[v3,1/3] ethdev: enable direct rearm with separate API

Message ID 20230104073043.1120168-2-feifei.wang2@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series Direct re-arming of buffers on receive side |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-testing fail build patch failure

Commit Message

Feifei Wang Jan. 4, 2023, 7:30 a.m. UTC
  Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
mode for separate Rx and Tx Operation. And this can support different
multiple sources in direct rearm mode. For examples, Rx driver is ixgbe,
and Tx driver is i40e.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/ethdev/ethdev_driver.h   |  10 ++
 lib/ethdev/ethdev_private.c  |   2 +
 lib/ethdev/rte_ethdev.c      |  52 +++++++++++
 lib/ethdev/rte_ethdev.h      | 174 +++++++++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev_core.h |  11 +++
 lib/ethdev/version.map       |   6 ++
 6 files changed, 255 insertions(+)
  

Comments

Morten Brørup Jan. 4, 2023, 8:21 a.m. UTC | #1
> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Wednesday, 4 January 2023 08.31
> 
> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> mode for separate Rx and Tx Operation. And this can support different
> multiple sources in direct rearm mode. For examples, Rx driver is
> ixgbe,
> and Tx driver is i40e.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---

This feature looks very promising for performance. I am pleased to see progress on it.

Please confirm that the fast path functions are still thread safe, i.e. one EAL thread may be calling rte_eth_rx_burst() while another EAL thread is calling rte_eth_tx_burst().

A few more comments below.

>  lib/ethdev/ethdev_driver.h   |  10 ++
>  lib/ethdev/ethdev_private.c  |   2 +
>  lib/ethdev/rte_ethdev.c      |  52 +++++++++++
>  lib/ethdev/rte_ethdev.h      | 174 +++++++++++++++++++++++++++++++++++
>  lib/ethdev/rte_ethdev_core.h |  11 +++
>  lib/ethdev/version.map       |   6 ++
>  6 files changed, 255 insertions(+)
> 
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 6a550cfc83..bc539ec862 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -59,6 +59,10 @@ struct rte_eth_dev {
>  	eth_rx_descriptor_status_t rx_descriptor_status;
>  	/** Check the status of a Tx descriptor */
>  	eth_tx_descriptor_status_t tx_descriptor_status;
> +	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> +	eth_tx_fill_sw_ring_t tx_fill_sw_ring;

What is "Rx sw-ring"? Please confirm that this is not an Intel PMD specific term and/or implementation detail, e.g. by providing a conceptual implementation for a non-Intel PMD, e.g. mlx5.

Please note: I do not request the ability to rearm between drivers from different vendors, I only request that the public ethdev API uses generic terms and concepts, so any NIC vendor can implement the direct-rearm functions in their PMDs.

> +	/** Flush Rx descriptor in direct rearm mode */
> +	eth_rx_flush_descriptor_t rx_flush_descriptor;

descriptor -> descriptors. There are more than one. Both in comment and function name.

[...]

> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index c129ca1eaf..381c3d535f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
>  	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
>  } __rte_cache_min_aligned;
> 
> +/**
> + * @internal
> + * Structure used to hold pointers to internal ethdev Rx rearm data.
> + * The main purpose is to load Rx queue rearm data in direct rearm
> mode.
> + */
> +struct rte_eth_rxq_rearm_data {
> +	void *rx_sw_ring;
> +	uint16_t *rearm_start;
> +	uint16_t *rearm_nb;
> +} __rte_cache_min_aligned;

Please add descriptions to the fields in this structure.
  
Feifei Wang Jan. 4, 2023, 8:51 a.m. UTC | #2
Hi, Morten

> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Wednesday, January 4, 2023 4:22 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Wednesday, 4 January 2023 08.31
> >
> > Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> > mode for separate Rx and Tx Operation. And this can support different
> > multiple sources in direct rearm mode. For examples, Rx driver is
> > ixgbe, and Tx driver is i40e.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> 
> This feature looks very promising for performance. I am pleased to see
> progress on it.
> 
Thanks very much for your reviewing.

> Please confirm that the fast path functions are still thread safe, i.e. one EAL
> thread may be calling rte_eth_rx_burst() while another EAL thread is calling
> rte_eth_tx_burst().
> 
For the multiple threads safe, like we say in cover letter, current direct-rearm support
Rx and Tx in the same thread. If we consider multiple threads like 'pipeline model', there
need to add 'lock' in the data path which can decrease the performance.
Thus, the first step we do is try to enable direct-rearm in the single thread, and then we will consider
to enable direct rearm in multiple threads and improve the performance. 

> A few more comments below.
> 
> >  lib/ethdev/ethdev_driver.h   |  10 ++
> >  lib/ethdev/ethdev_private.c  |   2 +
> >  lib/ethdev/rte_ethdev.c      |  52 +++++++++++
> >  lib/ethdev/rte_ethdev.h      | 174
> +++++++++++++++++++++++++++++++++++
> >  lib/ethdev/rte_ethdev_core.h |  11 +++
> >  lib/ethdev/version.map       |   6 ++
> >  6 files changed, 255 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 6a550cfc83..bc539ec862 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -59,6 +59,10 @@ struct rte_eth_dev {
> >  	eth_rx_descriptor_status_t rx_descriptor_status;
> >  	/** Check the status of a Tx descriptor */
> >  	eth_tx_descriptor_status_t tx_descriptor_status;
> > +	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > +	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> 
> What is "Rx sw-ring"? Please confirm that this is not an Intel PMD specific
> term and/or implementation detail, e.g. by providing a conceptual
> implementation for a non-Intel PMD, e.g. mlx5.
Rx sw_ring is used  to store mbufs in intel PMD. This is the same as 'rxq->elts'
in mlx5. Agree with that we need to providing a conceptual implementation for
all PMDs.
> 
> Please note: I do not request the ability to rearm between drivers from
> different vendors, I only request that the public ethdev API uses generic
> terms and concepts, so any NIC vendor can implement the direct-rearm
> functions in their PMDs.
Agree with this. This is also our consideration. 
In the future plan, we plan to implement this function in different PMDs.

> 
> > +	/** Flush Rx descriptor in direct rearm mode */
> > +	eth_rx_flush_descriptor_t rx_flush_descriptor;
> 
> descriptor -> descriptors. There are more than one. Both in comment and
> function name.
Agree.
> 
> [...]
> 
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > c129ca1eaf..381c3d535f 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
> >  	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
> >  } __rte_cache_min_aligned;
> >
> > +/**
> > + * @internal
> > + * Structure used to hold pointers to internal ethdev Rx rearm data.
> > + * The main purpose is to load Rx queue rearm data in direct rearm
> > mode.
> > + */
> > +struct rte_eth_rxq_rearm_data {
> > +	void *rx_sw_ring;
> > +	uint16_t *rearm_start;
> > +	uint16_t *rearm_nb;
> > +} __rte_cache_min_aligned;
> 
> Please add descriptions to the fields in this structure.
Agree.
  
Morten Brørup Jan. 4, 2023, 10:11 a.m. UTC | #3
> From: Feifei Wang [mailto:Feifei.Wang2@arm.com]
> Sent: Wednesday, 4 January 2023 09.51
> 
> Hi, Morten
> 
> > 发件人: Morten Brørup <mb@smartsharesystems.com>
> > 发送时间: Wednesday, January 4, 2023 4:22 PM
> >
> > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > Sent: Wednesday, 4 January 2023 08.31
> > >
> > > Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> rearm
> > > mode for separate Rx and Tx Operation. And this can support
> different
> > > multiple sources in direct rearm mode. For examples, Rx driver is
> > > ixgbe, and Tx driver is i40e.
> > >
> > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > ---
> >
> > This feature looks very promising for performance. I am pleased to
> see
> > progress on it.
> >
> Thanks very much for your reviewing.
> 
> > Please confirm that the fast path functions are still thread safe,
> i.e. one EAL
> > thread may be calling rte_eth_rx_burst() while another EAL thread is
> calling
> > rte_eth_tx_burst().
> >
> For the multiple threads safe, like we say in cover letter, current
> direct-rearm support
> Rx and Tx in the same thread. If we consider multiple threads like
> 'pipeline model', there
> need to add 'lock' in the data path which can decrease the performance.
> Thus, the first step we do is try to enable direct-rearm in the single
> thread, and then we will consider
> to enable direct rearm in multiple threads and improve the performance.

OK, doing it in steps is a good idea for a feature like this - makes it easier to understand and review.

When proceeding to add support for the "pipeline model", perhaps the lockless principles from the rte_ring can be used in this feature too.

From a high level perspective, I'm somewhat worried that releasing a "work-in-progress" version of this feature in some DPDK version will cause API/ABI breakage discussions when progressing to the next steps of the implementation to make the feature more complete. Not only support for thread safety across simultaneous RX and TX, but also support for multiple mbuf pools per RX queue [1]. Marking the functions experimental should alleviate such discussions, but there is a risk of pushback to not break the API/ABI anyway.

[1]: https://elixir.bootlin.com/dpdk/v22.11.1/source/lib/ethdev/rte_ethdev.h#L1105

[...]

> > > --- a/lib/ethdev/ethdev_driver.h
> > > +++ b/lib/ethdev/ethdev_driver.h
> > > @@ -59,6 +59,10 @@ struct rte_eth_dev {
> > >  	eth_rx_descriptor_status_t rx_descriptor_status;
> > >  	/** Check the status of a Tx descriptor */
> > >  	eth_tx_descriptor_status_t tx_descriptor_status;
> > > +	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > > +	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> >
> > What is "Rx sw-ring"? Please confirm that this is not an Intel PMD
> specific
> > term and/or implementation detail, e.g. by providing a conceptual
> > implementation for a non-Intel PMD, e.g. mlx5.
> Rx sw_ring is used  to store mbufs in intel PMD. This is the same as
> 'rxq->elts'
> in mlx5. 

Sounds good.

Then all we need is consensus on a generic name for this, unless "Rx sw-ring" already is the generic name. (I'm not a PMD developer, so I might be completely off track here.) Naming is often debatable, so I'll stop talking about it now - I only wanted to highlight that we should avoid vendor-specific terms in public APIs intended to be implemented by multiple vendors. On the other hand... if no other vendors raise their voices before merging into the DPDK main repository, they forfeit their right to complain about it. ;-)

> Agree with that we need to providing a conceptual
> implementation for all PMDs.

My main point is that we should ensure that the feature is not too tightly coupled with the way Intel PMDs implement mbuf handling. Providing a conceptual implementation for a non-Intel PMD is one way of checking this.

The actual implementation in other PMDs could be left up to the various NIC vendors.
  
Konstantin Ananyev Feb. 2, 2023, 2:33 p.m. UTC | #4
Hi Feifei,

> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> mode for separate Rx and Tx Operation. And this can support different
> multiple sources in direct rearm mode. For examples, Rx driver is ixgbe,
> and Tx driver is i40e.


Thanks for your effort and thanks for taking comments provided into 
consideration.
That approach looks much better then previous ones.
Few nits below.
Konstantin

> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>   lib/ethdev/ethdev_driver.h   |  10 ++
>   lib/ethdev/ethdev_private.c  |   2 +
>   lib/ethdev/rte_ethdev.c      |  52 +++++++++++
>   lib/ethdev/rte_ethdev.h      | 174 +++++++++++++++++++++++++++++++++++
>   lib/ethdev/rte_ethdev_core.h |  11 +++
>   lib/ethdev/version.map       |   6 ++
>   6 files changed, 255 insertions(+)
> 
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 6a550cfc83..bc539ec862 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -59,6 +59,10 @@ struct rte_eth_dev {
>   	eth_rx_descriptor_status_t rx_descriptor_status;
>   	/** Check the status of a Tx descriptor */
>   	eth_tx_descriptor_status_t tx_descriptor_status;
> +	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> +	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> +	/** Flush Rx descriptor in direct rearm mode */
> +	eth_rx_flush_descriptor_t rx_flush_descriptor;
>   
>   	/**
>   	 * Device data that is shared between primary and secondary processes
> @@ -504,6 +508,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
>   typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
>   	uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
>   
> +/**< @internal Get rearm data for a receive queue of an Ethernet device. */
> +typedef void (*eth_rxq_rearm_data_get_t)(struct rte_eth_dev *dev,
> +	uint16_t tx_queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> +
>   typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
>   	uint16_t queue_id, struct rte_eth_burst_mode *mode);
>   
> @@ -1215,6 +1223,8 @@ struct eth_dev_ops {
>   	eth_rxq_info_get_t         rxq_info_get;
>   	/** Retrieve Tx queue information */
>   	eth_txq_info_get_t         txq_info_get;
> +	/** Get Rx queue rearm data */
> +	eth_rxq_rearm_data_get_t   rxq_rearm_data_get;
>   	eth_burst_mode_get_t       rx_burst_mode_get; /**< Get Rx burst mode */
>   	eth_burst_mode_get_t       tx_burst_mode_get; /**< Get Tx burst mode */
>   	eth_fw_version_get_t       fw_version_get; /**< Get firmware version */
> diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> index 48090c879a..c5dd5e30f6 100644
> --- a/lib/ethdev/ethdev_private.c
> +++ b/lib/ethdev/ethdev_private.c
> @@ -276,6 +276,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
>   	fpo->rx_queue_count = dev->rx_queue_count;
>   	fpo->rx_descriptor_status = dev->rx_descriptor_status;
>   	fpo->tx_descriptor_status = dev->tx_descriptor_status;
> +	fpo->tx_fill_sw_ring = dev->tx_fill_sw_ring;
> +	fpo->rx_flush_descriptor = dev->rx_flush_descriptor;
>   
>   	fpo->rxq.data = dev->data->rx_queues;
>   	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 5d5e18db1e..2af5cb42fe 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -3282,6 +3282,21 @@ rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id, uint16_t rx_queue_id,
>   						stat_idx, STAT_QMAP_RX));
>   }
>   
> +int
> +rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
> +		uint16_t tx_port_id, uint16_t tx_rx_queue_id,
> +		struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> +{
> +	int nb_rearm = 0;
> +
> +	nb_rearm = rte_eth_tx_fill_sw_ring(tx_port_id, tx_rx_queue_id, rxq_rearm_data);
> +
> +	if (nb_rearm > 0)
> +		return rte_eth_rx_flush_descriptor(rx_port_id, rx_queue_id, nb_rearm);
> +
> +	return 0;
> +}
> +
>   int
>   rte_eth_dev_fw_version_get(uint16_t port_id, char *fw_version, size_t fw_size)
>   {
> @@ -5323,6 +5338,43 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
>   	return 0;
>   }
>   
> +int
> +rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
> +		struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (queue_id >= dev->data->nb_rx_queues) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
> +		return -EINVAL;
> +	}
> +
> +	if (rxq_rearm_data == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "Cannot get ethdev port %u Rx queue %u rearm data to NULL\n",
> +			port_id, queue_id);
> +		return -EINVAL;
> +	}
> +
> +	if (dev->data->rx_queues == NULL ||
> +			dev->data->rx_queues[queue_id] == NULL) {
> +		RTE_ETHDEV_LOG(ERR,
> +			   "Rx queue %"PRIu16" of device with port_id=%"
> +			   PRIu16" has not been setup\n",
> +			   queue_id, port_id);
> +		return -EINVAL;
> +	}
> +
> +	if (*dev->dev_ops->rxq_rearm_data_get == NULL)
> +		return -ENOTSUP;
> +
> +	dev->dev_ops->rxq_rearm_data_get(dev, queue_id, rxq_rearm_data);
> +
> +	return 0;
> +}
> +
>   int
>   rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>   			  struct rte_eth_burst_mode *mode)
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index c129ca1eaf..381c3d535f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
>   	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
>   } __rte_cache_min_aligned;
>   
> +/**
> + * @internal
> + * Structure used to hold pointers to internal ethdev Rx rearm data.
> + * The main purpose is to load Rx queue rearm data in direct rearm mode.
> + */

I think this structure owes a lot more expalantion.
What each fields suppose to do and what are the constraints, etc.

In general, more doc will be needed to explain that feature.
> +struct rte_eth_rxq_rearm_data {
> +	void *rx_sw_ring;

That's misleading, we always suppose to store mbufs ptrs here,
so why not be direct:
struct rte_mbuf **rx_sw_ring;

> +	uint16_t *rearm_start;
> +	uint16_t *rearm_nb;

I know that for Intel NICs uint16_t is sufficient,
wonder would it always be for other vendors?
Another thing to consider the case when ring position wrapping?
Again I know that it is not required for Intel NICs, but would
it be sufficient for API that supposed to be general?


> +} __rte_cache_min_aligned;
> +
>   /* Generic Burst mode flag definition, values can be ORed. */
>   
>   /**
> @@ -3184,6 +3195,34 @@ int rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id,
>   					   uint16_t rx_queue_id,
>   					   uint8_t stat_idx);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Directly put Tx freed buffers into Rx sw-ring and flush desc.
> + *
> + * @param rx_port_id
> + *   Port identifying the receive side.
> + * @param rx_queue_id
> + *   The index of the receive queue identifying the receive side.
> + *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_port_id
> + *   Port identifying the transmit side.
> + * @param tx_queue_id
> + *   The index of the transmit queue identifying the transmit side.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param rxq_rearm_data
> + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> + * @return
> + *   - 0: Success
> + */
> +__rte_experimental
> +int rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
> +		uint16_t tx_port_id, uint16_t tx_queue_id,
> +		struct rte_eth_rxq_rearm_data *rxq_rearm_data);


I think we need one more parameter for that function 'uint16_t offset' 
or so.
So _rearm_ will start to populate rx_sw_ring from *rearm_start + offset
position. That way we can support populating from different sources.
Or should 'offset' be part of truct rte_eth_rxq_rearm_data?

> +
>   /**
>    * Retrieve the Ethernet address of an Ethernet device.
>    *
> @@ -4782,6 +4821,27 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
>   int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
>   	struct rte_eth_txq_info *qinfo);
>   
> +/**
> + * Get rearm data about given ports's Rx queue.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The Rx queue on the Ethernet device for which rearm data
> + *   will be got.
> + * @param rxq_rearm_data
> + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> + *
> + * @return
> + *   - 0: Success
> + *   - -ENODEV:  If *port_id* is invalid.
> + *   - -ENOTSUP: routine is not supported by the device PMD.
> + *   - -EINVAL:  The queue_id is out of range.
> + */
> +__rte_experimental
> +int rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
> +		struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> +
>   /**
>    * Retrieve information about the Rx packet burst mode.
>    *
> @@ -6103,6 +6163,120 @@ static inline int rte_eth_tx_descriptor_status(uint16_t port_id,
>   	return (*p->tx_descriptor_status)(qd, offset);
>   }
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Fill Rx sw-ring with Tx buffers in direct rearm mode.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The index of the transmit queue.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param rxq_rearm_data
> + *   A pointer to a structure of type *rte_eth_rxq_rearm_data* to be filled with
> + *   the rearm data of a receive queue.
> + * @return
> + *   - The number buffers correct to be filled in the Rx sw-ring.
> + *   - (-EINVAL) bad port or queue.
> + *   - (-ENODEV) bad port.
> + *   - (-ENOTSUP) if the device does not support this function.
> + *
> + */
> +__rte_experimental
> +static inline int rte_eth_tx_fill_sw_ring(uint16_t port_id,
> +	uint16_t queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> +{
> +	struct rte_eth_fp_ops *p;
> +	void *qd;
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> +	if (port_id >= RTE_MAX_ETHPORTS ||
> +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Invalid port_id=%u or queue_id=%u\n",
> +			port_id, queue_id);
> +		return -EINVAL;
> +	}
> +#endif
> +
> +	p = &rte_eth_fp_ops[port_id];
> +	qd = p->txq.data[queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> +
> +	if (qd == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
> +			queue_id, port_id);
> +		return -ENODEV;
> +	}
> +#endif
> +
> +	if (p->tx_fill_sw_ring == NULL)
> +		return -ENOTSUP;
> +
> +	return p->tx_fill_sw_ring(qd, rxq_rearm_data);
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + *  Flush Rx descriptor in direct rearm mode.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The index of the receive queue.
> + *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + *@param nb_rearm
> + *   The number of Rx sw-ring buffers need to be flushed.
> + * @return
> + *   - (0) if successful.
> + *   - (-EINVAL) bad port or queue.
> + *   - (-ENODEV) bad port.
> + *   - (-ENOTSUP) if the device does not support this function.
> + */
> +__rte_experimental
> +static inline int rte_eth_rx_flush_descriptor(uint16_t port_id,
> +	uint16_t queue_id, uint16_t nb_rearm)
> +{
> +	struct rte_eth_fp_ops *p;
> +	void *qd;
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> +	if (port_id >= RTE_MAX_ETHPORTS ||
> +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Invalid port_id=%u or queue_id=%u\n",
> +			port_id, queue_id);
> +		return -EINVAL;
> +	}
> +#endif
> +
> +	p = &rte_eth_fp_ops[port_id];
> +	qd = p->rxq.data[queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> +
> +	if (qd == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
> +			queue_id, port_id);
> +		return -ENODEV;
> +	}
> +#endif
> +
> +	if (p->rx_flush_descriptor == NULL)
> +		return -ENOTSUP;
> +
> +	return p->rx_flush_descriptor(qd, nb_rearm);
> +}
> +
>   /**
>    * @internal
>    * Helper routine for rte_eth_tx_burst().
> diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> index dcf8adab92..5ecb57f6f0 100644
> --- a/lib/ethdev/rte_ethdev_core.h
> +++ b/lib/ethdev/rte_ethdev_core.h
> @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
>   /** @internal Check the status of a Tx descriptor */
>   typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
>   
> +/** @internal Fill Rx sw-ring with Tx buffers in direct rearm mode */
> +typedef int (*eth_tx_fill_sw_ring_t)(void *txq,
> +		struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> +
> +/** @internal Flush Rx descriptor in direct rearm mode */
> +typedef int (*eth_rx_flush_descriptor_t)(void *rxq, uint16_t nb_rearm);
> +
>   /**
>    * @internal
>    * Structure used to hold opaque pointers to internal ethdev Rx/Tx
> @@ -90,6 +97,8 @@ struct rte_eth_fp_ops {
>   	eth_rx_queue_count_t rx_queue_count;
>   	/** Check the status of a Rx descriptor. */
>   	eth_rx_descriptor_status_t rx_descriptor_status;
> +	/** Flush Rx descriptor in direct rearm mode */
> +	eth_rx_flush_descriptor_t rx_flush_descriptor;
>   	/** Rx queues data. */
>   	struct rte_ethdev_qdata rxq;
>   	uintptr_t reserved1[3];
> @@ -106,6 +115,8 @@ struct rte_eth_fp_ops {
>   	eth_tx_prep_t tx_pkt_prepare;
>   	/** Check the status of a Tx descriptor. */
>   	eth_tx_descriptor_status_t tx_descriptor_status;
> +	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> +	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
>   	/** Tx queues data. */
>   	struct rte_ethdev_qdata txq;
>   	uintptr_t reserved2[3];
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 17201fbe0f..f39f02a69b 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -298,6 +298,12 @@ EXPERIMENTAL {
>   	rte_flow_get_q_aged_flows;
>   	rte_mtr_meter_policy_get;
>   	rte_mtr_meter_profile_get;
> +
> +	# added in 23.03
> +	rte_eth_dev_direct_rearm;
> +	rte_eth_rx_flush_descriptor;
> +	rte_eth_rx_queue_rearm_data_get;
> +	rte_eth_tx_fill_sw_ring;
>   };
>   
>   INTERNAL {
  
Feifei Wang Feb. 24, 2023, 8:55 a.m. UTC | #5
Sorry for my delayed reply.

> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Wednesday, January 4, 2023 6:11 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
> 
> > From: Feifei Wang [mailto:Feifei.Wang2@arm.com]
> > Sent: Wednesday, 4 January 2023 09.51
> >
> > Hi, Morten
> >
> > > 发件人: Morten Brørup <mb@smartsharesystems.com>
> > > 发送时间: Wednesday, January 4, 2023 4:22 PM
> > >
> > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > Sent: Wednesday, 4 January 2023 08.31
> > > >
> > > > Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> > rearm
> > > > mode for separate Rx and Tx Operation. And this can support
> > different
> > > > multiple sources in direct rearm mode. For examples, Rx driver is
> > > > ixgbe, and Tx driver is i40e.
> > > >
> > > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > ---
> > >
> > > This feature looks very promising for performance. I am pleased to
> > see
> > > progress on it.
> > >
> > Thanks very much for your reviewing.
> >
> > > Please confirm that the fast path functions are still thread safe,
> > i.e. one EAL
> > > thread may be calling rte_eth_rx_burst() while another EAL thread is
> > calling
> > > rte_eth_tx_burst().
> > >
> > For the multiple threads safe, like we say in cover letter, current
> > direct-rearm support Rx and Tx in the same thread. If we consider
> > multiple threads like 'pipeline model', there need to add 'lock' in
> > the data path which can decrease the performance.
> > Thus, the first step we do is try to enable direct-rearm in the single
> > thread, and then we will consider to enable direct rearm in multiple
> > threads and improve the performance.
> 
> OK, doing it in steps is a good idea for a feature like this - makes it easier to
> understand and review.
> 
> When proceeding to add support for the "pipeline model", perhaps the
> lockless principles from the rte_ring can be used in this feature too.
> 
> From a high level perspective, I'm somewhat worried that releasing a "work-
> in-progress" version of this feature in some DPDK version will cause API/ABI
> breakage discussions when progressing to the next steps of the
> implementation to make the feature more complete. Not only support for
> thread safety across simultaneous RX and TX, but also support for multiple
> mbuf pools per RX queue [1]. Marking the functions experimental should
> alleviate such discussions, but there is a risk of pushback to not break the
> API/ABI anyway.
> 
> [1]:
> https://elixir.bootlin.com/dpdk/v22.11.1/source/lib/ethdev/rte_ethdev.h#L1
> 105
> 

[Feifei] I think the subsequent upgrade does not significantly damage the stability
of the API we currently define.

For thread safety across simultaneous RX and TX, in the future, the lockless operation
change will happen in the pmd layer, such as CAS load/store for rxq queue index of pmd.
Thus, this can not affect the stability of the upper API.

For multiple mbuf pools per RX queue, direct-rearm just put Tx buffers into Rx buffers, and
it do not care which mempool the buffer coming from. 
From different mempool buffers eventually freed into their respective sources in the
no FAST_FREE path.  
I think this is a mistake in cover letter. Previous direct-rearm can just support FAST_FREE
so it constraint that buffer should be from the same mempool. Now, the latest version can
support no_FAST_FREE path, but we forget to make change in cover letter.
> [...]
> 
> > > > --- a/lib/ethdev/ethdev_driver.h
> > > > +++ b/lib/ethdev/ethdev_driver.h
> > > > @@ -59,6 +59,10 @@ struct rte_eth_dev {
> > > >  	eth_rx_descriptor_status_t rx_descriptor_status;
> > > >  	/** Check the status of a Tx descriptor */
> > > >  	eth_tx_descriptor_status_t tx_descriptor_status;
> > > > +	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > > > +	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> > >
> > > What is "Rx sw-ring"? Please confirm that this is not an Intel PMD
> > specific
> > > term and/or implementation detail, e.g. by providing a conceptual
> > > implementation for a non-Intel PMD, e.g. mlx5.
> > Rx sw_ring is used  to store mbufs in intel PMD. This is the same as
> > 'rxq->elts'
> > in mlx5.
> 
> Sounds good.
> 
> Then all we need is consensus on a generic name for this, unless "Rx sw-ring"
> already is the generic name. (I'm not a PMD developer, so I might be
> completely off track here.) Naming is often debatable, so I'll stop talking
> about it now - I only wanted to highlight that we should avoid vendor-
> specific terms in public APIs intended to be implemented by multiple vendors.
> On the other hand... if no other vendors raise their voices before merging
> into the DPDK main repository, they forfeit their right to complain about it. ;-)
> 
> > Agree with that we need to providing a conceptual implementation for
> > all PMDs.
> 
> My main point is that we should ensure that the feature is not too tightly
> coupled with the way Intel PMDs implement mbuf handling. Providing a
> conceptual implementation for a non-Intel PMD is one way of checking this.
> 
> The actual implementation in other PMDs could be left up to the various NIC
> vendors.

Yes. And we will rename our API to make it suitable for all vendors:
rte_eth_direct_rearm  ->  rte_eth_buf_cycle   (upper API for direct rearm)
rte_eth_tx_fill_sw_ring  -> rte_eth_tx_buf_stash   (Tx queue fill Rx ring buffer )
rte_eth_rx_flush_descriptor -> rte_eth_rx_descriptors_refill (Rx queue flush its descriptors)

rte_eth_rxq_rearm_data {
	void *rx_sw_ring;
	uint16_t *rearm_start;
	uint16_t *rearm_nb;
}

->

struct *rxq_recycle_info {
	rte_mbuf **buf_ring;
	uint16_t *offset = (uint16 *)(&rq-<ci);
	uint16_t *end;
	uint16_t ring_size; 

}
  
Feifei Wang Feb. 24, 2023, 9:45 a.m. UTC | #6
Hi, Konstantin

Thanks for your reviewing and sorry for my delayed response.
For your comments, we put forward several improvement plans below.

Best Regards
Feifei

> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> 发送时间: Thursday, February 2, 2023 10:33 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: Re: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
> 
> Hi Feifei,
> 
> > Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> > mode for separate Rx and Tx Operation. And this can support different
> > multiple sources in direct rearm mode. For examples, Rx driver is
> > ixgbe, and Tx driver is i40e.
> 
> 
> Thanks for your effort and thanks for taking comments provided into
> consideration.
> That approach looks much better then previous ones.
> Few nits below.
> Konstantin
> 
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >   lib/ethdev/ethdev_driver.h   |  10 ++
> >   lib/ethdev/ethdev_private.c  |   2 +
> >   lib/ethdev/rte_ethdev.c      |  52 +++++++++++
> >   lib/ethdev/rte_ethdev.h      | 174
> +++++++++++++++++++++++++++++++++++
> >   lib/ethdev/rte_ethdev_core.h |  11 +++
> >   lib/ethdev/version.map       |   6 ++
> >   6 files changed, 255 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 6a550cfc83..bc539ec862 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -59,6 +59,10 @@ struct rte_eth_dev {
> >   	eth_rx_descriptor_status_t rx_descriptor_status;
> >   	/** Check the status of a Tx descriptor */
> >   	eth_tx_descriptor_status_t tx_descriptor_status;
> > +	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > +	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> > +	/** Flush Rx descriptor in direct rearm mode */
> > +	eth_rx_flush_descriptor_t rx_flush_descriptor;
> >
> >   	/**
> >   	 * Device data that is shared between primary and secondary
> > processes @@ -504,6 +508,10 @@ typedef void
> (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
> >   typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
> >   	uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
> >
> > +/**< @internal Get rearm data for a receive queue of an Ethernet
> > +device. */ typedef void (*eth_rxq_rearm_data_get_t)(struct rte_eth_dev
> *dev,
> > +	uint16_t tx_queue_id, struct rte_eth_rxq_rearm_data
> > +*rxq_rearm_data);
> > +
> >   typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
> >   	uint16_t queue_id, struct rte_eth_burst_mode *mode);
> >
> > @@ -1215,6 +1223,8 @@ struct eth_dev_ops {
> >   	eth_rxq_info_get_t         rxq_info_get;
> >   	/** Retrieve Tx queue information */
> >   	eth_txq_info_get_t         txq_info_get;
> > +	/** Get Rx queue rearm data */
> > +	eth_rxq_rearm_data_get_t   rxq_rearm_data_get;
> >   	eth_burst_mode_get_t       rx_burst_mode_get; /**< Get Rx burst
> mode */
> >   	eth_burst_mode_get_t       tx_burst_mode_get; /**< Get Tx burst
> mode */
> >   	eth_fw_version_get_t       fw_version_get; /**< Get firmware version
> */
> > diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> > index 48090c879a..c5dd5e30f6 100644
> > --- a/lib/ethdev/ethdev_private.c
> > +++ b/lib/ethdev/ethdev_private.c
> > @@ -276,6 +276,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
> >   	fpo->rx_queue_count = dev->rx_queue_count;
> >   	fpo->rx_descriptor_status = dev->rx_descriptor_status;
> >   	fpo->tx_descriptor_status = dev->tx_descriptor_status;
> > +	fpo->tx_fill_sw_ring = dev->tx_fill_sw_ring;
> > +	fpo->rx_flush_descriptor = dev->rx_flush_descriptor;
> >
> >   	fpo->rxq.data = dev->data->rx_queues;
> >   	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 5d5e18db1e..2af5cb42fe 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -3282,6 +3282,21 @@
> rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id, uint16_t
> rx_queue_id,
> >   						stat_idx, STAT_QMAP_RX));
> >   }
> >
> > +int
> > +rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
> > +		uint16_t tx_port_id, uint16_t tx_rx_queue_id,
> > +		struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > +	int nb_rearm = 0;
> > +
> > +	nb_rearm = rte_eth_tx_fill_sw_ring(tx_port_id, tx_rx_queue_id,
> > +rxq_rearm_data);
> > +
> > +	if (nb_rearm > 0)
> > +		return rte_eth_rx_flush_descriptor(rx_port_id, rx_queue_id,
> > +nb_rearm);
> > +
> > +	return 0;
> > +}
> > +
> >   int
> >   rte_eth_dev_fw_version_get(uint16_t port_id, char *fw_version, size_t
> fw_size)
> >   {
> > @@ -5323,6 +5338,43 @@ rte_eth_tx_queue_info_get(uint16_t port_id,
> uint16_t queue_id,
> >   	return 0;
> >   }
> >
> > +int
> > +rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
> > +		struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > +	struct rte_eth_dev *dev;
> > +
> > +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> > +	dev = &rte_eth_devices[port_id];
> > +
> > +	if (queue_id >= dev->data->nb_rx_queues) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n",
> queue_id);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (rxq_rearm_data == NULL) {
> > +		RTE_ETHDEV_LOG(ERR, "Cannot get ethdev port %u Rx
> queue %u rearm data to NULL\n",
> > +			port_id, queue_id);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (dev->data->rx_queues == NULL ||
> > +			dev->data->rx_queues[queue_id] == NULL) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			   "Rx queue %"PRIu16" of device with port_id=%"
> > +			   PRIu16" has not been setup\n",
> > +			   queue_id, port_id);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (*dev->dev_ops->rxq_rearm_data_get == NULL)
> > +		return -ENOTSUP;
> > +
> > +	dev->dev_ops->rxq_rearm_data_get(dev, queue_id,
> rxq_rearm_data);
> > +
> > +	return 0;
> > +}
> > +
> >   int
> >   rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
> >   			  struct rte_eth_burst_mode *mode) diff --git
> > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > c129ca1eaf..381c3d535f 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
> >   	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
> >   } __rte_cache_min_aligned;
> >
> > +/**
> > + * @internal
> > + * Structure used to hold pointers to internal ethdev Rx rearm data.
> > + * The main purpose is to load Rx queue rearm data in direct rearm mode.
> > + */
> 
> I think this structure owes a lot more expalantion.
> What each fields suppose to do and what are the constraints, etc.
> 
> In general, more doc will be needed to explain that feature.

You are right. We will add more explanation for it and also update the doc
To explain the feature.

> > +struct rte_eth_rxq_rearm_data {
> > +	void *rx_sw_ring;
> 
> That's misleading, we always suppose to store mbufs ptrs here, so why not
> be direct:
> struct rte_mbuf **rx_sw_ring;
> 
Agree.
> > +	uint16_t *rearm_start;
> > +	uint16_t *rearm_nb;
> 
> I know that for Intel NICs uint16_t is sufficient, wonder would it always be
> for other vendors?
> Another thing to consider the case when ring position wrapping?
> Again I know that it is not required for Intel NICs, but would it be sufficient
> for API that supposed to be general?
> 
For this, we re-define this structure:
rte_eth_rxq_rearm_data {
	void *rx_sw_ring;
	uint16_t *rearm_start;
	uint16_t *rearm_nb;
}
->
struct *rxq_recycle_info {
	rte_mbuf **buf_ring;
	uint16_t *offset = (uint16 *)(&rq->ci);
	uint16_t *end;
	uint16_t ring_size; 

}
For the new structure, *offset is a pointer for rearm-start index of
Rx buffer ring (consumer index). *end is a pointer for rearm-end index
Of Rx buffer ring (producer index).

1. we look up different pmds,  some pmds using 'uint_16t' as index size like intel PMD,
some pmds using 'uint32_t' as index size like MLX5 or thunderx PMD. 
For pmd using 'uint32_t', rearm starts at 'buf_ring[offset & (ring_size -1)]', and 'uint16_t'
is enough for ring size.

2. Good question. In general path, there is a constraint that  'nb_rearm < ring_size - rq->ci',
This can ensure no ring wrapping in rearm. Thus in direct-rearm, we will refer to this to
solve ring wrapping.

> 
> > +} __rte_cache_min_aligned;
> > +
> >   /* Generic Burst mode flag definition, values can be ORed. */
> >
> >   /**
> > @@ -3184,6 +3195,34 @@ int
> rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id,
> >   					   uint16_t rx_queue_id,
> >   					   uint8_t stat_idx);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Directly put Tx freed buffers into Rx sw-ring and flush desc.
> > + *
> > + * @param rx_port_id
> > + *   Port identifying the receive side.
> > + * @param rx_queue_id
> > + *   The index of the receive queue identifying the receive side.
> > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + *   Port identifying the transmit side.
> > + * @param tx_queue_id
> > + *   The index of the transmit queue identifying the transmit side.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param rxq_rearm_data
> > + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> > + * @return
> > + *   - 0: Success
> > + */
> > +__rte_experimental
> > +int rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
> > +		uint16_t tx_port_id, uint16_t tx_queue_id,
> > +		struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> 
> 
> I think we need one more parameter for that function 'uint16_t offset'
> or so.
> So _rearm_ will start to populate rx_sw_ring from *rearm_start + offset
> position. That way we can support populating from different sources.
> Or should 'offset' be part of truct rte_eth_rxq_rearm_data?
> 
Agree, please see above, we will do the change in the structure.
> > +
> >   /**
> >    * Retrieve the Ethernet address of an Ethernet device.
> >    *
> > @@ -4782,6 +4821,27 @@ int rte_eth_rx_queue_info_get(uint16_t
> port_id, uint16_t queue_id,
> >   int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
> >   	struct rte_eth_txq_info *qinfo);
> >
> > +/**
> > + * Get rearm data about given ports's Rx queue.
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The Rx queue on the Ethernet device for which rearm data
> > + *   will be got.
> > + * @param rxq_rearm_data
> > + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> > + *
> > + * @return
> > + *   - 0: Success
> > + *   - -ENODEV:  If *port_id* is invalid.
> > + *   - -ENOTSUP: routine is not supported by the device PMD.
> > + *   - -EINVAL:  The queue_id is out of range.
> > + */
> > +__rte_experimental
> > +int rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t
> queue_id,
> > +		struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> > +
> >   /**
> >    * Retrieve information about the Rx packet burst mode.
> >    *
> > @@ -6103,6 +6163,120 @@ static inline int
> rte_eth_tx_descriptor_status(uint16_t port_id,
> >   	return (*p->tx_descriptor_status)(qd, offset);
> >   }
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Fill Rx sw-ring with Tx buffers in direct rearm mode.
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The index of the transmit queue.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param rxq_rearm_data
> > + *   A pointer to a structure of type *rte_eth_rxq_rearm_data* to be filled
> with
> > + *   the rearm data of a receive queue.
> > + * @return
> > + *   - The number buffers correct to be filled in the Rx sw-ring.
> > + *   - (-EINVAL) bad port or queue.
> > + *   - (-ENODEV) bad port.
> > + *   - (-ENOTSUP) if the device does not support this function.
> > + *
> > + */
> > +__rte_experimental
> > +static inline int rte_eth_tx_fill_sw_ring(uint16_t port_id,
> > +	uint16_t queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> {
> > +	struct rte_eth_fp_ops *p;
> > +	void *qd;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			"Invalid port_id=%u or queue_id=%u\n",
> > +			port_id, queue_id);
> > +		return -EINVAL;
> > +	}
> > +#endif
> > +
> > +	p = &rte_eth_fp_ops[port_id];
> > +	qd = p->txq.data[queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> > +
> > +	if (qd == NULL) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for
> port_id=%u\n",
> > +			queue_id, port_id);
> > +		return -ENODEV;
> > +	}
> > +#endif
> > +
> > +	if (p->tx_fill_sw_ring == NULL)
> > +		return -ENOTSUP;
> > +
> > +	return p->tx_fill_sw_ring(qd, rxq_rearm_data); }
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + *  Flush Rx descriptor in direct rearm mode.
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The index of the receive queue.
> > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + *@param nb_rearm
> > + *   The number of Rx sw-ring buffers need to be flushed.
> > + * @return
> > + *   - (0) if successful.
> > + *   - (-EINVAL) bad port or queue.
> > + *   - (-ENODEV) bad port.
> > + *   - (-ENOTSUP) if the device does not support this function.
> > + */
> > +__rte_experimental
> > +static inline int rte_eth_rx_flush_descriptor(uint16_t port_id,
> > +	uint16_t queue_id, uint16_t nb_rearm) {
> > +	struct rte_eth_fp_ops *p;
> > +	void *qd;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			"Invalid port_id=%u or queue_id=%u\n",
> > +			port_id, queue_id);
> > +		return -EINVAL;
> > +	}
> > +#endif
> > +
> > +	p = &rte_eth_fp_ops[port_id];
> > +	qd = p->rxq.data[queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> > +
> > +	if (qd == NULL) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for
> port_id=%u\n",
> > +			queue_id, port_id);
> > +		return -ENODEV;
> > +	}
> > +#endif
> > +
> > +	if (p->rx_flush_descriptor == NULL)
> > +		return -ENOTSUP;
> > +
> > +	return p->rx_flush_descriptor(qd, nb_rearm); }
> > +
> >   /**
> >    * @internal
> >    * Helper routine for rte_eth_tx_burst().
> > diff --git a/lib/ethdev/rte_ethdev_core.h
> > b/lib/ethdev/rte_ethdev_core.h index dcf8adab92..5ecb57f6f0 100644
> > --- a/lib/ethdev/rte_ethdev_core.h
> > +++ b/lib/ethdev/rte_ethdev_core.h
> > @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq,
> uint16_t offset);
> >   /** @internal Check the status of a Tx descriptor */
> >   typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t
> > offset);
> >
> > +/** @internal Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > +typedef int (*eth_tx_fill_sw_ring_t)(void *txq,
> > +		struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> > +
> > +/** @internal Flush Rx descriptor in direct rearm mode */ typedef int
> > +(*eth_rx_flush_descriptor_t)(void *rxq, uint16_t nb_rearm);
> > +
> >   /**
> >    * @internal
> >    * Structure used to hold opaque pointers to internal ethdev Rx/Tx
> > @@ -90,6 +97,8 @@ struct rte_eth_fp_ops {
> >   	eth_rx_queue_count_t rx_queue_count;
> >   	/** Check the status of a Rx descriptor. */
> >   	eth_rx_descriptor_status_t rx_descriptor_status;
> > +	/** Flush Rx descriptor in direct rearm mode */
> > +	eth_rx_flush_descriptor_t rx_flush_descriptor;
> >   	/** Rx queues data. */
> >   	struct rte_ethdev_qdata rxq;
> >   	uintptr_t reserved1[3];
> > @@ -106,6 +115,8 @@ struct rte_eth_fp_ops {
> >   	eth_tx_prep_t tx_pkt_prepare;
> >   	/** Check the status of a Tx descriptor. */
> >   	eth_tx_descriptor_status_t tx_descriptor_status;
> > +	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > +	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> >   	/** Tx queues data. */
> >   	struct rte_ethdev_qdata txq;
> >   	uintptr_t reserved2[3];
> > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > 17201fbe0f..f39f02a69b 100644
> > --- a/lib/ethdev/version.map
> > +++ b/lib/ethdev/version.map
> > @@ -298,6 +298,12 @@ EXPERIMENTAL {
> >   	rte_flow_get_q_aged_flows;
> >   	rte_mtr_meter_policy_get;
> >   	rte_mtr_meter_profile_get;
> > +
> > +	# added in 23.03
> > +	rte_eth_dev_direct_rearm;
> > +	rte_eth_rx_flush_descriptor;
> > +	rte_eth_rx_queue_rearm_data_get;
> > +	rte_eth_tx_fill_sw_ring;
> >   };
> >
> >   INTERNAL {
  
Konstantin Ananyev Feb. 27, 2023, 7:31 p.m. UTC | #7
Hi Feifei ,


> > > +	uint16_t *rearm_start;
> > > +	uint16_t *rearm_nb;
> >
> > I know that for Intel NICs uint16_t is sufficient, wonder would it always be
> > for other vendors?
> > Another thing to consider the case when ring position wrapping?
> > Again I know that it is not required for Intel NICs, but would it be sufficient
> > for API that supposed to be general?
> >
> For this, we re-define this structure:
> rte_eth_rxq_rearm_data {
> 	void *rx_sw_ring;
> 	uint16_t *rearm_start;
> 	uint16_t *rearm_nb;
> }
> ->
> struct *rxq_recycle_info {
> 	rte_mbuf **buf_ring;
> 	uint16_t *offset = (uint16 *)(&rq->ci);
> 	uint16_t *end;
> 	uint16_t ring_size;
> 
> }
> For the new structure, *offset is a pointer for rearm-start index of
> Rx buffer ring (consumer index). *end is a pointer for rearm-end index
> Of Rx buffer ring (producer index).
> 
> 1. we look up different pmds,  some pmds using 'uint_16t' as index size like intel PMD,
> some pmds using 'uint32_t' as index size like MLX5 or thunderx PMD.
> For pmd using 'uint32_t', rearm starts at 'buf_ring[offset & (ring_size -1)]', and 'uint16_t'
> is enough for ring size.

Sounds like a smart idea to me. 
 

> 
> 2. Good question. In general path, there is a constraint that  'nb_rearm < ring_size - rq->ci',
> This can ensure no ring wrapping in rearm. Thus in direct-rearm, we will refer to this to
> solve ring wrapping.

Should work, I think...
Just need not to forget to document it :)
  
Feifei Wang Feb. 28, 2023, 2:16 a.m. UTC | #8
> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> 发送时间: Tuesday, February 28, 2023 3:32 AM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>; thomas@monjalon.net; Ferruh Yigit
> <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
> 
> 
> Hi Feifei ,
> 
> 
> > > > +	uint16_t *rearm_start;
> > > > +	uint16_t *rearm_nb;
> > >
> > > I know that for Intel NICs uint16_t is sufficient, wonder would it
> > > always be for other vendors?
> > > Another thing to consider the case when ring position wrapping?
> > > Again I know that it is not required for Intel NICs, but would it be
> > > sufficient for API that supposed to be general?
> > >
> > For this, we re-define this structure:
> > rte_eth_rxq_rearm_data {
> > 	void *rx_sw_ring;
> > 	uint16_t *rearm_start;
> > 	uint16_t *rearm_nb;
> > }
> > ->
> > struct *rxq_recycle_info {
> > 	rte_mbuf **buf_ring;
> > 	uint16_t *offset = (uint16 *)(&rq->ci);
> > 	uint16_t *end;
> > 	uint16_t ring_size;
> >
> > }
> > For the new structure, *offset is a pointer for rearm-start index of
> > Rx buffer ring (consumer index). *end is a pointer for rearm-end index
> > Of Rx buffer ring (producer index).
> >
> > 1. we look up different pmds,  some pmds using 'uint_16t' as index
> > size like intel PMD, some pmds using 'uint32_t' as index size like MLX5 or
> thunderx PMD.
> > For pmd using 'uint32_t', rearm starts at 'buf_ring[offset & (ring_size -1)]',
> and 'uint16_t'
> > is enough for ring size.
> 
> Sounds like a smart idea to me.
> 
> 
> >
> > 2. Good question. In general path, there is a constraint that
> > 'nb_rearm < ring_size - rq->ci', This can ensure no ring wrapping in
> > rearm. Thus in direct-rearm, we will refer to this to solve ring wrapping.
> 
> Should work, I think...
> Just need not to forget to document it :)
Agree, we need to doc this.
  
Morten Brørup Feb. 28, 2023, 8:09 a.m. UTC | #9
Hi Feifei,

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Monday, 27 February 2023 20.32
> 
> Hi Feifei ,
> 
> 
> > > > +	uint16_t *rearm_start;
> > > > +	uint16_t *rearm_nb;
> > >
> > > I know that for Intel NICs uint16_t is sufficient, wonder would it always
> be
> > > for other vendors?
> > > Another thing to consider the case when ring position wrapping?
> > > Again I know that it is not required for Intel NICs, but would it be
> sufficient
> > > for API that supposed to be general?
> > >
> > For this, we re-define this structure:
> > rte_eth_rxq_rearm_data {
> > 	void *rx_sw_ring;
> > 	uint16_t *rearm_start;
> > 	uint16_t *rearm_nb;
> > }
> > ->
> > struct *rxq_recycle_info {
> > 	rte_mbuf **buf_ring;
> > 	uint16_t *offset = (uint16 *)(&rq->ci);
> > 	uint16_t *end;
> > 	uint16_t ring_size;
> >
> > }
> > For the new structure, *offset is a pointer for rearm-start index of
> > Rx buffer ring (consumer index). *end is a pointer for rearm-end index
> > Of Rx buffer ring (producer index).
> >
> > 1. we look up different pmds,  some pmds using 'uint_16t' as index size like
> intel PMD,
> > some pmds using 'uint32_t' as index size like MLX5 or thunderx PMD.
> > For pmd using 'uint32_t', rearm starts at 'buf_ring[offset & (ring_size -
> 1)]', and 'uint16_t'
> > is enough for ring size.
> 
> Sounds like a smart idea to me.

When configuring an Ethernet device queue, the nb_rx/tx_desc parameter to rte_eth_rx/tx_queue_setup() is uint16_t, so I agree that uint16_t should suffice here too.

I had the following thought, but am not sure. So please take this comment for consideration only:

I think the "& (ring_size -1)" is superfluous, unless a PMD allows its index pointer to exceed the ring size, and performs the same "& (ring_size -1)" when using the index pointer to access its ring.

And if a PMD uses the index pointer like that (i.e. exceeding the ring size), you would need the same wrap protection for a 16 bit index pointer.

> 
> 
> >
> > 2. Good question. In general path, there is a constraint that  'nb_rearm <
> ring_size - rq->ci',
> > This can ensure no ring wrapping in rearm. Thus in direct-rearm, we will
> refer to this to
> > solve ring wrapping.
> 
> Should work, I think...
> Just need not to forget to document it :)

It is this constraint (the guarantee that there is no ring wrapping in a rearm burst) that makes me think that the "& (ring_size -1)" is superfluous.
  
Feifei Wang March 1, 2023, 7:34 a.m. UTC | #10
Hi, Morten

> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Tuesday, February 28, 2023 4:09 PM
> 收件人: Konstantin Ananyev <konstantin.ananyev@huawei.com>; Feifei
> Wang <Feifei.Wang2@arm.com>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>; thomas@monjalon.net; Ferruh Yigit
> <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
> 
> Hi Feifei,
> 
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Monday, 27 February 2023 20.32
> >
> > Hi Feifei ,
> >
> >
> > > > > +	uint16_t *rearm_start;
> > > > > +	uint16_t *rearm_nb;
> > > >
> > > > I know that for Intel NICs uint16_t is sufficient, wonder would it
> > > > always
> > be
> > > > for other vendors?
> > > > Another thing to consider the case when ring position wrapping?
> > > > Again I know that it is not required for Intel NICs, but would it
> > > > be
> > sufficient
> > > > for API that supposed to be general?
> > > >
> > > For this, we re-define this structure:
> > > rte_eth_rxq_rearm_data {
> > > 	void *rx_sw_ring;
> > > 	uint16_t *rearm_start;
> > > 	uint16_t *rearm_nb;
> > > }
> > > ->
> > > struct *rxq_recycle_info {
> > > 	rte_mbuf **buf_ring;
> > > 	uint16_t *offset = (uint16 *)(&rq->ci);
> > > 	uint16_t *end;
> > > 	uint16_t ring_size;
> > >
> > > }
> > > For the new structure, *offset is a pointer for rearm-start index of
> > > Rx buffer ring (consumer index). *end is a pointer for rearm-end
> > > index Of Rx buffer ring (producer index).
> > >
> > > 1. we look up different pmds,  some pmds using 'uint_16t' as index
> > > size like
> > intel PMD,
> > > some pmds using 'uint32_t' as index size like MLX5 or thunderx PMD.
> > > For pmd using 'uint32_t', rearm starts at 'buf_ring[offset &
> > > (ring_size -
> > 1)]', and 'uint16_t'
> > > is enough for ring size.
> >
> > Sounds like a smart idea to me.
> 
> When configuring an Ethernet device queue, the nb_rx/tx_desc parameter
> to rte_eth_rx/tx_queue_setup() is uint16_t, so I agree that uint16_t should
> suffice here too.
> 
> I had the following thought, but am not sure. So please take this comment
> for consideration only:
> 
> I think the "& (ring_size -1)" is superfluous, unless a PMD allows its index
> pointer to exceed the ring size, and performs the same "& (ring_size -1)"
> when using the index pointer to access its ring.
> 
> And if a PMD uses the index pointer like that (i.e. exceeding the ring size),
> you would need the same wrap protection for a 16 bit index pointer.
> 
> >
> >
> > >
> > > 2. Good question. In general path, there is a constraint that
> > > 'nb_rearm <
> > ring_size - rq->ci',
> > > This can ensure no ring wrapping in rearm. Thus in direct-rearm, we
> > > will
> > refer to this to
> > > solve ring wrapping.
> >
> > Should work, I think...
> > Just need not to forget to document it :)
> 
> It is this constraint (the guarantee that there is no ring wrapping in a rearm
> burst) that makes me think that the "& (ring_size -1)" is superfluous.

Actually there is some misleading here, I can explain about this.

(ring_size -1 ) is for some pmds whose index can exceed the ring_size, such as MLX5:
uint16_t *offset = &mlx5_rxq->rq_ci, rq_ci is an index which can exceed the ring_size.
  
Ferruh Yigit March 6, 2023, 12:49 p.m. UTC | #11
On 1/4/2023 8:21 AM, Morten Brørup wrote:
>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
>> Sent: Wednesday, 4 January 2023 08.31
>>
>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
>> mode for separate Rx and Tx Operation. And this can support different
>> multiple sources in direct rearm mode. For examples, Rx driver is
>> ixgbe,
>> and Tx driver is i40e.
>>
>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>> ---
> 
> This feature looks very promising for performance. I am pleased to see progress on it.
> 

Hi Morten,

Yes it brings some performance, but not to generic use case, only to
specific and constraint use case.

And changes are relatively invasive comparing the usecase it supports,
like it adds new two inline datapath functions and a new dev_ops.

I am worried the unnecessary complexity and possible regressions in the
fundamental and simple parts of the project, with a good intention to
gain a few percentage performance in a specific usecase, can hurt the
project.


I can see this is compared to MBUF_FAST_FREE feature, but MBUF_FAST_FREE
is just an offload benefiting from existing offload infrastructure,
which requires very small update and logically change in application and
simple to implement in the drivers. So, they are not same from
complexity perspective.

Briefly, I am not comfortable with this change, I would like to see an
explicit approval and code review from techboard to proceed.


> Please confirm that the fast path functions are still thread safe, i.e. one EAL thread may be calling rte_eth_rx_burst() while another EAL thread is calling rte_eth_tx_burst().
> 
> A few more comments below.
> 
>>  lib/ethdev/ethdev_driver.h   |  10 ++
>>  lib/ethdev/ethdev_private.c  |   2 +
>>  lib/ethdev/rte_ethdev.c      |  52 +++++++++++
>>  lib/ethdev/rte_ethdev.h      | 174 +++++++++++++++++++++++++++++++++++
>>  lib/ethdev/rte_ethdev_core.h |  11 +++
>>  lib/ethdev/version.map       |   6 ++
>>  6 files changed, 255 insertions(+)
>>
>> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
>> index 6a550cfc83..bc539ec862 100644
>> --- a/lib/ethdev/ethdev_driver.h
>> +++ b/lib/ethdev/ethdev_driver.h
>> @@ -59,6 +59,10 @@ struct rte_eth_dev {
>>  	eth_rx_descriptor_status_t rx_descriptor_status;
>>  	/** Check the status of a Tx descriptor */
>>  	eth_tx_descriptor_status_t tx_descriptor_status;
>> +	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
>> +	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> 
> What is "Rx sw-ring"? Please confirm that this is not an Intel PMD specific term and/or implementation detail, e.g. by providing a conceptual implementation for a non-Intel PMD, e.g. mlx5.
> 
> Please note: I do not request the ability to rearm between drivers from different vendors, I only request that the public ethdev API uses generic terms and concepts, so any NIC vendor can implement the direct-rearm functions in their PMDs.
> 
>> +	/** Flush Rx descriptor in direct rearm mode */
>> +	eth_rx_flush_descriptor_t rx_flush_descriptor;
> 
> descriptor -> descriptors. There are more than one. Both in comment and function name.
> 
> [...]
> 
>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>> index c129ca1eaf..381c3d535f 100644
>> --- a/lib/ethdev/rte_ethdev.h
>> +++ b/lib/ethdev/rte_ethdev.h
>> @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
>>  	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
>>  } __rte_cache_min_aligned;
>>
>> +/**
>> + * @internal
>> + * Structure used to hold pointers to internal ethdev Rx rearm data.
>> + * The main purpose is to load Rx queue rearm data in direct rearm
>> mode.
>> + */
>> +struct rte_eth_rxq_rearm_data {
>> +	void *rx_sw_ring;
>> +	uint16_t *rearm_start;
>> +	uint16_t *rearm_nb;
>> +} __rte_cache_min_aligned;
> 
> Please add descriptions to the fields in this structure.
>
  
Morten Brørup March 6, 2023, 1:26 p.m. UTC | #12
> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> Sent: Monday, 6 March 2023 13.49
> 
> On 1/4/2023 8:21 AM, Morten Brørup wrote:
> >> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> >> Sent: Wednesday, 4 January 2023 08.31
> >>
> >> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> >> mode for separate Rx and Tx Operation. And this can support different
> >> multiple sources in direct rearm mode. For examples, Rx driver is
> >> ixgbe,
> >> and Tx driver is i40e.
> >>
> >> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >> ---
> >
> > This feature looks very promising for performance. I am pleased to see
> progress on it.
> >
> 
> Hi Morten,
> 
> Yes it brings some performance, but not to generic use case, only to
> specific and constraint use case.

I got the impression that the supported use case is a prominent and important use case.

This is the primary argument for considering such a complex non-generic feature.

> 
> And changes are relatively invasive comparing the usecase it supports,
> like it adds new two inline datapath functions and a new dev_ops.
> 
> I am worried the unnecessary complexity and possible regressions in the
> fundamental and simple parts of the project, with a good intention to
> gain a few percentage performance in a specific usecase, can hurt the
> project.
> 
> 
> I can see this is compared to MBUF_FAST_FREE feature, but MBUF_FAST_FREE
> is just an offload benefiting from existing offload infrastructure,
> which requires very small update and logically change in application and
> simple to implement in the drivers. So, they are not same from
> complexity perspective.
> 
> Briefly, I am not comfortable with this change, I would like to see an
> explicit approval and code review from techboard to proceed.

I agree that the complexity is very high, and thus requires extra consideration. Your suggested techboard review and approval process seems like a good solution.

And the performance benefit of direct rearm should be compared to the performance using the new zero-copy mempool API.

-Morten
  
Feifei Wang March 6, 2023, 2:53 p.m. UTC | #13
Hi, Morten, Ferruh

Thanks very much for your reviewing.

Whatever the worries about direct rearm or comments to improve direct rearm, these urge
us to learn more and think more.  I think it is beneficial and a good achievement for this exploration.

I will update the latest version for techboard code review. During this time,
I need some time to  do performance test for the latest version.
So it should not catch up with this week's meeting and will be done before the techboard
meeting in two weeks.

Best Regards
Feifei

> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Monday, March 6, 2023 9:26 PM
> 收件人: Ferruh Yigit <ferruh.yigit@amd.com>; Feifei Wang
> <Feifei.Wang2@arm.com>; thomas@monjalon.net; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; techboard@dpdk.org
> 抄送: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
> 
> > From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> > Sent: Monday, 6 March 2023 13.49
> >
> > On 1/4/2023 8:21 AM, Morten Brørup wrote:
> > >> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > >> Sent: Wednesday, 4 January 2023 08.31
> > >>
> > >> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> > >> rearm mode for separate Rx and Tx Operation. And this can support
> > >> different multiple sources in direct rearm mode. For examples, Rx
> > >> driver is ixgbe, and Tx driver is i40e.
> > >>
> > >> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > >> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >> ---
> > >
> > > This feature looks very promising for performance. I am pleased to
> > > see
> > progress on it.
> > >
> >
> > Hi Morten,
> >
> > Yes it brings some performance, but not to generic use case, only to
> > specific and constraint use case.
> 
> I got the impression that the supported use case is a prominent and
> important use case.
> 
> This is the primary argument for considering such a complex non-generic
> feature.
> 
> >
> > And changes are relatively invasive comparing the usecase it supports,
> > like it adds new two inline datapath functions and a new dev_ops.
> >
> > I am worried the unnecessary complexity and possible regressions in
> > the fundamental and simple parts of the project, with a good intention
> > to gain a few percentage performance in a specific usecase, can hurt
> > the project.
> >
> >
> > I can see this is compared to MBUF_FAST_FREE feature, but
> > MBUF_FAST_FREE is just an offload benefiting from existing offload
> > infrastructure, which requires very small update and logically change
> > in application and simple to implement in the drivers. So, they are
> > not same from complexity perspective.
> >
> > Briefly, I am not comfortable with this change, I would like to see an
> > explicit approval and code review from techboard to proceed.
> 
> I agree that the complexity is very high, and thus requires extra consideration.
> Your suggested techboard review and approval process seems like a good
> solution.
> 
> And the performance benefit of direct rearm should be compared to the
> performance using the new zero-copy mempool API.
> 
> -Morten
  
Ferruh Yigit March 6, 2023, 3:02 p.m. UTC | #14
On 3/6/2023 1:26 PM, Morten Brørup wrote:
>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
>> Sent: Monday, 6 March 2023 13.49
>>
>> On 1/4/2023 8:21 AM, Morten Brørup wrote:
>>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
>>>> Sent: Wednesday, 4 January 2023 08.31
>>>>
>>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
>>>> mode for separate Rx and Tx Operation. And this can support different
>>>> multiple sources in direct rearm mode. For examples, Rx driver is
>>>> ixgbe,
>>>> and Tx driver is i40e.
>>>>
>>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>> ---
>>>
>>> This feature looks very promising for performance. I am pleased to see
>> progress on it.
>>>
>>
>> Hi Morten,
>>
>> Yes it brings some performance, but not to generic use case, only to
>> specific and constraint use case.
> 
> I got the impression that the supported use case is a prominent and important use case.
> 

Can you please give real life samples for this use case, other than just
showing better performance number in the test bench? This helps to
understand the reasoning better.

> This is the primary argument for considering such a complex non-generic feature.
> 
>>
>> And changes are relatively invasive comparing the usecase it supports,
>> like it adds new two inline datapath functions and a new dev_ops.
>>
>> I am worried the unnecessary complexity and possible regressions in the
>> fundamental and simple parts of the project, with a good intention to
>> gain a few percentage performance in a specific usecase, can hurt the
>> project.
>>
>>
>> I can see this is compared to MBUF_FAST_FREE feature, but MBUF_FAST_FREE
>> is just an offload benefiting from existing offload infrastructure,
>> which requires very small update and logically change in application and
>> simple to implement in the drivers. So, they are not same from
>> complexity perspective.
>>
>> Briefly, I am not comfortable with this change, I would like to see an
>> explicit approval and code review from techboard to proceed.
> 
> I agree that the complexity is very high, and thus requires extra consideration. Your suggested techboard review and approval process seems like a good solution.
> 
> And the performance benefit of direct rearm should be compared to the performance using the new zero-copy mempool API.
> 
> -Morten
>
  
Honnappa Nagarahalli March 7, 2023, 6:12 a.m. UTC | #15
<snip>

> 
> On 3/6/2023 1:26 PM, Morten Brørup wrote:
> >> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> >> Sent: Monday, 6 March 2023 13.49
> >>
> >> On 1/4/2023 8:21 AM, Morten Brørup wrote:
> >>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> >>>> Sent: Wednesday, 4 January 2023 08.31
> >>>>
> >>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> >>>> rearm mode for separate Rx and Tx Operation. And this can support
> >>>> different multiple sources in direct rearm mode. For examples, Rx
> >>>> driver is ixgbe, and Tx driver is i40e.
> >>>>
> >>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>>> ---
> >>>
> >>> This feature looks very promising for performance. I am pleased to
> >>> see
> >> progress on it.
> >>>
> >>
> >> Hi Morten,
> >>
> >> Yes it brings some performance, but not to generic use case, only to
> >> specific and constraint use case.
> >
> > I got the impression that the supported use case is a prominent and important
> use case.
> >
> 
> Can you please give real life samples for this use case, other than just showing
> better performance number in the test bench? This helps to understand the
> reasoning better.
The very first patch started off with a constrained but prominent use case. Though, DPU based PCIe cards running DPDK applications with 1 or max 2 ports being used in tons of data centers is not a secret anymore and not a small use case that can be ignored.
However, the design of the patch has changed significantly from then. Now the solution can be applied to any generic use case that uses run-to-completion model of DPDK. i.e. the mapping of the RX and TX ports can be done dynamically in the data plane threads. There is no need of static configuration from control plane.

On the test bench, we need to make up our mind. When we see improvements, we say it is just a test bench. On other occasions when the test bench does not show any improvements (but improvements are shown by other metrics), we say the test bench does not show any improvements.

> 
> > This is the primary argument for considering such a complex non-generic
> feature.
I am not sure what is the complexity here, can you please elaborate?
I see other patches/designs (ex: proactive error recovery) which are way more complex to understand and comprehend.

> >
> >>
> >> And changes are relatively invasive comparing the usecase it
> >> supports, like it adds new two inline datapath functions and a new dev_ops.
> >>
> >> I am worried the unnecessary complexity and possible regressions in
> >> the fundamental and simple parts of the project, with a good
> >> intention to gain a few percentage performance in a specific usecase,
> >> can hurt the project.
I agree that we are touching some fundamental parts of the project. But, we also need to realize that those fundamental parts were not developed on architectures that have joined the project way later. Similarly, the use cases have evolved significantly from the original intended use cases. We cannot hold on to those fundamental designs if they affect the performance on other architectures while addressing prominent new use cases.
Please note that this patch does not break any existing features or affect their performance in any negative way. The generic and originally intended use cases can benefit from this feature.

> >>
> >>
> >> I can see this is compared to MBUF_FAST_FREE feature, but
> >> MBUF_FAST_FREE is just an offload benefiting from existing offload
> >> infrastructure, which requires very small update and logically change
> >> in application and simple to implement in the drivers. So, they are
> >> not same from complexity perspective.
> >>
> >> Briefly, I am not comfortable with this change, I would like to see
> >> an explicit approval and code review from techboard to proceed.
> >
> > I agree that the complexity is very high, and thus requires extra consideration.
> Your suggested techboard review and approval process seems like a good
> solution.
We can add to the agenda for the next Techboard meeting.

> >
> > And the performance benefit of direct rearm should be compared to the
> performance using the new zero-copy mempool API.
> >
> > -Morten
> >
  
Konstantin Ananyev March 7, 2023, 10:52 a.m. UTC | #16
> > On 3/6/2023 1:26 PM, Morten Brørup wrote:
> > >> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> > >> Sent: Monday, 6 March 2023 13.49
> > >>
> > >> On 1/4/2023 8:21 AM, Morten Brørup wrote:
> > >>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > >>>> Sent: Wednesday, 4 January 2023 08.31
> > >>>>
> > >>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> > >>>> rearm mode for separate Rx and Tx Operation. And this can support
> > >>>> different multiple sources in direct rearm mode. For examples, Rx
> > >>>> driver is ixgbe, and Tx driver is i40e.
> > >>>>
> > >>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > >>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >>>> ---
> > >>>
> > >>> This feature looks very promising for performance. I am pleased to
> > >>> see
> > >> progress on it.
> > >>>
> > >>
> > >> Hi Morten,
> > >>
> > >> Yes it brings some performance, but not to generic use case, only to
> > >> specific and constraint use case.
> > >
> > > I got the impression that the supported use case is a prominent and important
> > use case.
> > >
> >
> > Can you please give real life samples for this use case, other than just showing
> > better performance number in the test bench? This helps to understand the
> > reasoning better.
> The very first patch started off with a constrained but prominent use case. Though, DPU based PCIe cards running DPDK applications
> with 1 or max 2 ports being used in tons of data centers is not a secret anymore and not a small use case that can be ignored.
> However, the design of the patch has changed significantly from then. Now the solution can be applied to any generic use case that
> uses run-to-completion model of DPDK. i.e. the mapping of the RX and TX ports can be done dynamically in the data plane threads.
> There is no need of static configuration from control plane.

+1 to this statement.
I think the authors did a good job to make it generic enough, so it can be used for many different cases,
plus AFAIU it doesn't introduce new implicit restrictions to the user. 
Again, this feature is totally optional, so users are free to ignore it. 
Personally I do not see any good reason why we shouldn't accept this feature into DPDK.
Off-course with more code reviews, extra testing, docs updates, etc.. 

> 
> On the test bench, we need to make up our mind. When we see improvements, we say it is just a test bench. On other occasions
> when the test bench does not show any improvements (but improvements are shown by other metrics), we say the test bench does
> not show any improvements.
> 
> >
> > > This is the primary argument for considering such a complex non-generic
> > feature.
> I am not sure what is the complexity here, can you please elaborate?
> I see other patches/designs (ex: proactive error recovery) which are way more complex to understand and comprehend.
> 
> > >
> > >>
> > >> And changes are relatively invasive comparing the usecase it
> > >> supports, like it adds new two inline datapath functions and a new dev_ops.
> > >>
> > >> I am worried the unnecessary complexity and possible regressions in
> > >> the fundamental and simple parts of the project, with a good
> > >> intention to gain a few percentage performance in a specific usecase,
> > >> can hurt the project.
> I agree that we are touching some fundamental parts of the project. But, we also need to realize that those fundamental parts were
> not developed on architectures that have joined the project way later. Similarly, the use cases have evolved significantly from the
> original intended use cases. We cannot hold on to those fundamental designs if they affect the performance on other architectures
> while addressing prominent new use cases.
> Please note that this patch does not break any existing features or affect their performance in any negative way. The generic and
> originally intended use cases can benefit from this feature.
> 
> > >>
> > >>
> > >> I can see this is compared to MBUF_FAST_FREE feature, but
> > >> MBUF_FAST_FREE is just an offload benefiting from existing offload
> > >> infrastructure, which requires very small update and logically change
> > >> in application and simple to implement in the drivers. So, they are
> > >> not same from complexity perspective.
> > >>
> > >> Briefly, I am not comfortable with this change, I would like to see
> > >> an explicit approval and code review from techboard to proceed.
> > >
> > > I agree that the complexity is very high, and thus requires extra consideration.
> > Your suggested techboard review and approval process seems like a good
> > solution.
> We can add to the agenda for the next Techboard meeting.
> 
> > >
> > > And the performance benefit of direct rearm should be compared to the
> > performance using the new zero-copy mempool API.
> > >
> > > -Morten
> > >
  
Ferruh Yigit March 7, 2023, 8:41 p.m. UTC | #17
On 3/7/2023 6:12 AM, Honnappa Nagarahalli wrote:
> <snip>
> 
>>
>> On 3/6/2023 1:26 PM, Morten Brørup wrote:
>>>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
>>>> Sent: Monday, 6 March 2023 13.49
>>>>
>>>> On 1/4/2023 8:21 AM, Morten Brørup wrote:
>>>>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
>>>>>> Sent: Wednesday, 4 January 2023 08.31
>>>>>>
>>>>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
>>>>>> rearm mode for separate Rx and Tx Operation. And this can support
>>>>>> different multiple sources in direct rearm mode. For examples, Rx
>>>>>> driver is ixgbe, and Tx driver is i40e.
>>>>>>
>>>>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>>> ---
>>>>>
>>>>> This feature looks very promising for performance. I am pleased to
>>>>> see
>>>> progress on it.
>>>>>
>>>>
>>>> Hi Morten,
>>>>
>>>> Yes it brings some performance, but not to generic use case, only to
>>>> specific and constraint use case.
>>>
>>> I got the impression that the supported use case is a prominent and important
>> use case.
>>>
>>
>> Can you please give real life samples for this use case, other than just showing
>> better performance number in the test bench? This helps to understand the
>> reasoning better.
> The very first patch started off with a constrained but prominent use case. Though, DPU based PCIe cards running DPDK applications with 1 or max 2 ports being used in tons of data centers is not a secret anymore and not a small use case that can be ignored.
> However, the design of the patch has changed significantly from then. Now the solution can be applied to any generic use case that uses run-to-completion model of DPDK. i.e. the mapping of the RX and TX ports can be done dynamically in the data plane threads. There is no need of static configuration from control plane.
> 
> On the test bench, we need to make up our mind. When we see improvements, we say it is just a test bench. On other occasions when the test bench does not show any improvements (but improvements are shown by other metrics), we say the test bench does not show any improvements.
> 
>>
>>> This is the primary argument for considering such a complex non-generic
>> feature.
> I am not sure what is the complexity here, can you please elaborate?

I am considering from user perspective.

OK, DPDK is already low level, but ethdev has only a handful of datapath
APIs (6 of them), and main ones are easy to comprehend:
rte_eth_rx_burst(port_id, queue_id, rx_pkts, nb_pkts);
rte_eth_tx_burst(port_id, queue_id, tx_pkts, nb_pkts);

They (magically) Rx/Tx buffers, easy to grasp.

Maybe rte_eth_tx_prepare() is a little less obvious (why/when to use
it), but still I believe simple.

Whoever looks to these APIs can figure out how to use in the application.

The other three is related to the descriptors and I am not sure about
their use-case, I assume they are mostly good for debugging.


But now we are adding new datapath APIs:
rte_eth_tx_fill_sw_ring(port_id, queue_id, rxq_rearm_data);
rte_eth_rx_flush_descriptor(port_id, queue_id, nb_rearm);

When you talk about SW ring and re-arming descriptors I believe you will
loose most of the users already, driver developers will know what it is,
you will know what that is, but people who are not close to the Ethernet
HW won't.

And these APIs will be very visible, not like one of many control plane
dev_ops. So this can confuse users who are not familiar with details.

Usage of these APIs comes with restrictions, it is possible that at some
percentage of users will miss these restrictions or miss-understand them
and will have issues.

Or many may be intimidated by them and stay away from using these APIs,
leaving them as a burden to maintain, to test, to fix. That is why I
think a real life usecase is needed, in that case at least we will know
some consumers will fix or let us know when they get broken.

It may be possible to hide details under driver and user only set an
offload flag, similar to FAST_FREE, but in that case feature will loose
flexibility and it will be even more specific, perhaps making it less
useful.


> I see other patches/designs (ex: proactive error recovery) which are way more complex to understand and comprehend.
> 
>>>
>>>>
>>>> And changes are relatively invasive comparing the usecase it
>>>> supports, like it adds new two inline datapath functions and a new dev_ops.
>>>>
>>>> I am worried the unnecessary complexity and possible regressions in
>>>> the fundamental and simple parts of the project, with a good
>>>> intention to gain a few percentage performance in a specific usecase,
>>>> can hurt the project.
> I agree that we are touching some fundamental parts of the project. But, we also need to realize that those fundamental parts were not developed on architectures that have joined the project way later. Similarly, the use cases have evolved significantly from the original intended use cases. We cannot hold on to those fundamental designs if they affect the performance on other architectures while addressing prominent new use cases.
> Please note that this patch does not break any existing features or affect their performance in any negative way. The generic and originally intended use cases can benefit from this feature.
> 
>>>>
>>>>
>>>> I can see this is compared to MBUF_FAST_FREE feature, but
>>>> MBUF_FAST_FREE is just an offload benefiting from existing offload
>>>> infrastructure, which requires very small update and logically change
>>>> in application and simple to implement in the drivers. So, they are
>>>> not same from complexity perspective.
>>>>
>>>> Briefly, I am not comfortable with this change, I would like to see
>>>> an explicit approval and code review from techboard to proceed.
>>>
>>> I agree that the complexity is very high, and thus requires extra consideration.
>> Your suggested techboard review and approval process seems like a good
>> solution.
> We can add to the agenda for the next Techboard meeting.
> 
>>>
>>> And the performance benefit of direct rearm should be compared to the
>> performance using the new zero-copy mempool API.
>>>
>>> -Morten
>>>
  
Honnappa Nagarahalli March 22, 2023, 2:43 p.m. UTC | #18
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Tuesday, March 7, 2023 2:41 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Morten Brørup
> <mb@smartsharesystems.com>; Feifei Wang <Feifei.Wang2@arm.com>;
> thomas@monjalon.net; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; techboard@dpdk.org
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>
> Subject: Re: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
> 
> On 3/7/2023 6:12 AM, Honnappa Nagarahalli wrote:
> > <snip>
> >
> >>
> >> On 3/6/2023 1:26 PM, Morten Brørup wrote:
> >>>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> >>>> Sent: Monday, 6 March 2023 13.49
> >>>>
> >>>> On 1/4/2023 8:21 AM, Morten Brørup wrote:
> >>>>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> >>>>>> Sent: Wednesday, 4 January 2023 08.31
> >>>>>>
> >>>>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> >>>>>> rearm mode for separate Rx and Tx Operation. And this can support
> >>>>>> different multiple sources in direct rearm mode. For examples, Rx
> >>>>>> driver is ixgbe, and Tx driver is i40e.
> >>>>>>
> >>>>>> Suggested-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> >>>>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >>>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>>>> Reviewed-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> >>>>>> ---
> >>>>>
> >>>>> This feature looks very promising for performance. I am pleased to
> >>>>> see
> >>>> progress on it.
> >>>>>
> >>>>
> >>>> Hi Morten,
> >>>>
> >>>> Yes it brings some performance, but not to generic use case, only
> >>>> to specific and constraint use case.
> >>>
> >>> I got the impression that the supported use case is a prominent and
> >>> important
> >> use case.
> >>>
> >>
> >> Can you please give real life samples for this use case, other than
> >> just showing better performance number in the test bench? This helps
> >> to understand the reasoning better.
> > The very first patch started off with a constrained but prominent use case.
> Though, DPU based PCIe cards running DPDK applications with 1 or max 2 ports
> being used in tons of data centers is not a secret anymore and not a small use
> case that can be ignored.
> > However, the design of the patch has changed significantly from then. Now
> the solution can be applied to any generic use case that uses run-to-completion
> model of DPDK. i.e. the mapping of the RX and TX ports can be done
> dynamically in the data plane threads. There is no need of static configuration
> from control plane.
> >
> > On the test bench, we need to make up our mind. When we see
> improvements, we say it is just a test bench. On other occasions when the test
> bench does not show any improvements (but improvements are shown by
> other metrics), we say the test bench does not show any improvements.
> >
> >>
> >>> This is the primary argument for considering such a complex
> >>> non-generic
> >> feature.
> > I am not sure what is the complexity here, can you please elaborate?
> 
> I am considering from user perspective.
Thanks for clarifying Ferruh.

> 
> OK, DPDK is already low level, but ethdev has only a handful of datapath APIs (6
> of them), and main ones are easy to comprehend:
> rte_eth_rx_burst(port_id, queue_id, rx_pkts, nb_pkts);
> rte_eth_tx_burst(port_id, queue_id, tx_pkts, nb_pkts);
> 
> They (magically) Rx/Tx buffers, easy to grasp.
I think the pktmbuf pool part is missed here. The user needs to create a pktmbuf pool by calling rte_pktmbuf_pool_create and has to pass the cache_size parameter.
This requires the user to understand what is a cache, why it is required and how it affects the performance.
There are further complexities associated with pktmbuf pool - creating a pool with external pinned memory, creating a pool with ops name etc.
So, practically, the user needs to be aware of more details than just the RX and TX functions.

> 
> Maybe rte_eth_tx_prepare() is a little less obvious (why/when to use it), but still
> I believe simple.
> 
> Whoever looks to these APIs can figure out how to use in the application.
> 
> The other three is related to the descriptors and I am not sure about their use-
> case, I assume they are mostly good for debugging.
> 
> 
> But now we are adding new datapath APIs:
> rte_eth_tx_fill_sw_ring(port_id, queue_id, rxq_rearm_data);
> rte_eth_rx_flush_descriptor(port_id, queue_id, nb_rearm);
> 
> When you talk about SW ring and re-arming descriptors I believe you will loose
> most of the users already, driver developers will know what it is, you will know
> what that is, but people who are not close to the Ethernet HW won't.
Agree, the names could be better. I personally do not want to separate out these two APIs as I do not think a use case (receiving and transmitting pkts across NICs of different types) exists to keep them separate. But, we did this based on feedback and to maintain a cleaner separation between RX and TX path.
We will try to propose new names for these.

> 
> And these APIs will be very visible, not like one of many control plane dev_ops.
> So this can confuse users who are not familiar with details.
> 
> Usage of these APIs comes with restrictions, it is possible that at some
> percentage of users will miss these restrictions or miss-understand them and will
> have issues.
Agreed, there are several features already with restrictions.

> 
> Or many may be intimidated by them and stay away from using these APIs,
> leaving them as a burden to maintain, to test, to fix. That is why I think a real life
> usecase is needed, in that case at least we will know some consumers will fix or
> let us know when they get broken.
> 
> It may be possible to hide details under driver and user only set an offload flag,
> similar to FAST_FREE, but in that case feature will loose flexibility and it will be
> even more specific, perhaps making it less useful.
Agree.

> 
> 
> > I see other patches/designs (ex: proactive error recovery) which are way more
> complex to understand and comprehend.
> >
> >>>
> >>>>
> >>>> And changes are relatively invasive comparing the usecase it
> >>>> supports, like it adds new two inline datapath functions and a new
> dev_ops.
> >>>>
> >>>> I am worried the unnecessary complexity and possible regressions in
> >>>> the fundamental and simple parts of the project, with a good
> >>>> intention to gain a few percentage performance in a specific
> >>>> usecase, can hurt the project.
> > I agree that we are touching some fundamental parts of the project. But, we
> also need to realize that those fundamental parts were not developed on
> architectures that have joined the project way later. Similarly, the use cases
> have evolved significantly from the original intended use cases. We cannot hold
> on to those fundamental designs if they affect the performance on other
> architectures while addressing prominent new use cases.
> > Please note that this patch does not break any existing features or affect their
> performance in any negative way. The generic and originally intended use cases
> can benefit from this feature.
> >
> >>>>
> >>>>
> >>>> I can see this is compared to MBUF_FAST_FREE feature, but
> >>>> MBUF_FAST_FREE is just an offload benefiting from existing offload
> >>>> infrastructure, which requires very small update and logically
> >>>> change in application and simple to implement in the drivers. So,
> >>>> they are not same from complexity perspective.
> >>>>
> >>>> Briefly, I am not comfortable with this change, I would like to see
> >>>> an explicit approval and code review from techboard to proceed.
> >>>
> >>> I agree that the complexity is very high, and thus requires extra
> consideration.
> >> Your suggested techboard review and approval process seems like a
> >> good solution.
> > We can add to the agenda for the next Techboard meeting.
> >
> >>>
> >>> And the performance benefit of direct rearm should be compared to
> >>> the
> >> performance using the new zero-copy mempool API.
> >>>
> >>> -Morten
> >>>
  

Patch

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 6a550cfc83..bc539ec862 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -59,6 +59,10 @@  struct rte_eth_dev {
 	eth_rx_descriptor_status_t rx_descriptor_status;
 	/** Check the status of a Tx descriptor */
 	eth_tx_descriptor_status_t tx_descriptor_status;
+	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
+	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
+	/** Flush Rx descriptor in direct rearm mode */
+	eth_rx_flush_descriptor_t rx_flush_descriptor;
 
 	/**
 	 * Device data that is shared between primary and secondary processes
@@ -504,6 +508,10 @@  typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
 typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
 	uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
 
+/**< @internal Get rearm data for a receive queue of an Ethernet device. */
+typedef void (*eth_rxq_rearm_data_get_t)(struct rte_eth_dev *dev,
+	uint16_t tx_queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+
 typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
 	uint16_t queue_id, struct rte_eth_burst_mode *mode);
 
@@ -1215,6 +1223,8 @@  struct eth_dev_ops {
 	eth_rxq_info_get_t         rxq_info_get;
 	/** Retrieve Tx queue information */
 	eth_txq_info_get_t         txq_info_get;
+	/** Get Rx queue rearm data */
+	eth_rxq_rearm_data_get_t   rxq_rearm_data_get;
 	eth_burst_mode_get_t       rx_burst_mode_get; /**< Get Rx burst mode */
 	eth_burst_mode_get_t       tx_burst_mode_get; /**< Get Tx burst mode */
 	eth_fw_version_get_t       fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 48090c879a..c5dd5e30f6 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -276,6 +276,8 @@  eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
 	fpo->rx_queue_count = dev->rx_queue_count;
 	fpo->rx_descriptor_status = dev->rx_descriptor_status;
 	fpo->tx_descriptor_status = dev->tx_descriptor_status;
+	fpo->tx_fill_sw_ring = dev->tx_fill_sw_ring;
+	fpo->rx_flush_descriptor = dev->rx_flush_descriptor;
 
 	fpo->rxq.data = dev->data->rx_queues;
 	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 5d5e18db1e..2af5cb42fe 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -3282,6 +3282,21 @@  rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id, uint16_t rx_queue_id,
 						stat_idx, STAT_QMAP_RX));
 }
 
+int
+rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
+		uint16_t tx_port_id, uint16_t tx_rx_queue_id,
+		struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+	int nb_rearm = 0;
+
+	nb_rearm = rte_eth_tx_fill_sw_ring(tx_port_id, tx_rx_queue_id, rxq_rearm_data);
+
+	if (nb_rearm > 0)
+		return rte_eth_rx_flush_descriptor(rx_port_id, rx_queue_id, nb_rearm);
+
+	return 0;
+}
+
 int
 rte_eth_dev_fw_version_get(uint16_t port_id, char *fw_version, size_t fw_size)
 {
@@ -5323,6 +5338,43 @@  rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 	return 0;
 }
 
+int
+rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
+		struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	if (rxq_rearm_data == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Cannot get ethdev port %u Rx queue %u rearm data to NULL\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+
+	if (dev->data->rx_queues == NULL ||
+			dev->data->rx_queues[queue_id] == NULL) {
+		RTE_ETHDEV_LOG(ERR,
+			   "Rx queue %"PRIu16" of device with port_id=%"
+			   PRIu16" has not been setup\n",
+			   queue_id, port_id);
+		return -EINVAL;
+	}
+
+	if (*dev->dev_ops->rxq_rearm_data_get == NULL)
+		return -ENOTSUP;
+
+	dev->dev_ops->rxq_rearm_data_get(dev, queue_id, rxq_rearm_data);
+
+	return 0;
+}
+
 int
 rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 			  struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index c129ca1eaf..381c3d535f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1818,6 +1818,17 @@  struct rte_eth_txq_info {
 	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
 } __rte_cache_min_aligned;
 
+/**
+ * @internal
+ * Structure used to hold pointers to internal ethdev Rx rearm data.
+ * The main purpose is to load Rx queue rearm data in direct rearm mode.
+ */
+struct rte_eth_rxq_rearm_data {
+	void *rx_sw_ring;
+	uint16_t *rearm_start;
+	uint16_t *rearm_nb;
+} __rte_cache_min_aligned;
+
 /* Generic Burst mode flag definition, values can be ORed. */
 
 /**
@@ -3184,6 +3195,34 @@  int rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id,
 					   uint16_t rx_queue_id,
 					   uint8_t stat_idx);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Directly put Tx freed buffers into Rx sw-ring and flush desc.
+ *
+ * @param rx_port_id
+ *   Port identifying the receive side.
+ * @param rx_queue_id
+ *   The index of the receive queue identifying the receive side.
+ *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param tx_port_id
+ *   Port identifying the transmit side.
+ * @param tx_queue_id
+ *   The index of the transmit queue identifying the transmit side.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param rxq_rearm_data
+ *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
+ * @return
+ *   - 0: Success
+ */
+__rte_experimental
+int rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
+		uint16_t tx_port_id, uint16_t tx_queue_id,
+		struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+
 /**
  * Retrieve the Ethernet address of an Ethernet device.
  *
@@ -4782,6 +4821,27 @@  int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
 
+/**
+ * Get rearm data about given ports's Rx queue.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which rearm data
+ *   will be got.
+ * @param rxq_rearm_data
+ *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
+ *
+ * @return
+ *   - 0: Success
+ *   - -ENODEV:  If *port_id* is invalid.
+ *   - -ENOTSUP: routine is not supported by the device PMD.
+ *   - -EINVAL:  The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
+		struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+
 /**
  * Retrieve information about the Rx packet burst mode.
  *
@@ -6103,6 +6163,120 @@  static inline int rte_eth_tx_descriptor_status(uint16_t port_id,
 	return (*p->tx_descriptor_status)(qd, offset);
 }
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Fill Rx sw-ring with Tx buffers in direct rearm mode.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The index of the transmit queue.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param rxq_rearm_data
+ *   A pointer to a structure of type *rte_eth_rxq_rearm_data* to be filled with
+ *   the rearm data of a receive queue.
+ * @return
+ *   - The number buffers correct to be filled in the Rx sw-ring.
+ *   - (-EINVAL) bad port or queue.
+ *   - (-ENODEV) bad port.
+ *   - (-ENOTSUP) if the device does not support this function.
+ *
+ */
+__rte_experimental
+static inline int rte_eth_tx_fill_sw_ring(uint16_t port_id,
+	uint16_t queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+	struct rte_eth_fp_ops *p;
+	void *qd;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+	if (port_id >= RTE_MAX_ETHPORTS ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		RTE_ETHDEV_LOG(ERR,
+			"Invalid port_id=%u or queue_id=%u\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+#endif
+
+	p = &rte_eth_fp_ops[port_id];
+	qd = p->txq.data[queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
+
+	if (qd == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+			queue_id, port_id);
+		return -ENODEV;
+	}
+#endif
+
+	if (p->tx_fill_sw_ring == NULL)
+		return -ENOTSUP;
+
+	return p->tx_fill_sw_ring(qd, rxq_rearm_data);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ *  Flush Rx descriptor in direct rearm mode.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The index of the receive queue.
+ *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ *@param nb_rearm
+ *   The number of Rx sw-ring buffers need to be flushed.
+ * @return
+ *   - (0) if successful.
+ *   - (-EINVAL) bad port or queue.
+ *   - (-ENODEV) bad port.
+ *   - (-ENOTSUP) if the device does not support this function.
+ */
+__rte_experimental
+static inline int rte_eth_rx_flush_descriptor(uint16_t port_id,
+	uint16_t queue_id, uint16_t nb_rearm)
+{
+	struct rte_eth_fp_ops *p;
+	void *qd;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+	if (port_id >= RTE_MAX_ETHPORTS ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		RTE_ETHDEV_LOG(ERR,
+			"Invalid port_id=%u or queue_id=%u\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+#endif
+
+	p = &rte_eth_fp_ops[port_id];
+	qd = p->rxq.data[queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
+
+	if (qd == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+			queue_id, port_id);
+		return -ENODEV;
+	}
+#endif
+
+	if (p->rx_flush_descriptor == NULL)
+		return -ENOTSUP;
+
+	return p->rx_flush_descriptor(qd, nb_rearm);
+}
+
 /**
  * @internal
  * Helper routine for rte_eth_tx_burst().
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index dcf8adab92..5ecb57f6f0 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -56,6 +56,13 @@  typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
 /** @internal Check the status of a Tx descriptor */
 typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
 
+/** @internal Fill Rx sw-ring with Tx buffers in direct rearm mode */
+typedef int (*eth_tx_fill_sw_ring_t)(void *txq,
+		struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+
+/** @internal Flush Rx descriptor in direct rearm mode */
+typedef int (*eth_rx_flush_descriptor_t)(void *rxq, uint16_t nb_rearm);
+
 /**
  * @internal
  * Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -90,6 +97,8 @@  struct rte_eth_fp_ops {
 	eth_rx_queue_count_t rx_queue_count;
 	/** Check the status of a Rx descriptor. */
 	eth_rx_descriptor_status_t rx_descriptor_status;
+	/** Flush Rx descriptor in direct rearm mode */
+	eth_rx_flush_descriptor_t rx_flush_descriptor;
 	/** Rx queues data. */
 	struct rte_ethdev_qdata rxq;
 	uintptr_t reserved1[3];
@@ -106,6 +115,8 @@  struct rte_eth_fp_ops {
 	eth_tx_prep_t tx_pkt_prepare;
 	/** Check the status of a Tx descriptor. */
 	eth_tx_descriptor_status_t tx_descriptor_status;
+	/** Fill Rx sw-ring with Tx buffers in direct rearm mode */
+	eth_tx_fill_sw_ring_t tx_fill_sw_ring;
 	/** Tx queues data. */
 	struct rte_ethdev_qdata txq;
 	uintptr_t reserved2[3];
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 17201fbe0f..f39f02a69b 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -298,6 +298,12 @@  EXPERIMENTAL {
 	rte_flow_get_q_aged_flows;
 	rte_mtr_meter_policy_get;
 	rte_mtr_meter_profile_get;
+
+	# added in 23.03
+	rte_eth_dev_direct_rearm;
+	rte_eth_rx_flush_descriptor;
+	rte_eth_rx_queue_rearm_data_get;
+	rte_eth_tx_fill_sw_ring;
 };
 
 INTERNAL {