mbox series

[RFC,0/7] hide eth dev related structures

Message ID 20210820162834.12544-1-konstantin.ananyev@intel.com (mailing list archive)
Headers
Series hide eth dev related structures |

Message

Ananyev, Konstantin Aug. 20, 2021, 4:28 p.m. UTC
  NOTE: This is just an RFC to start further discussion and collect the feedback.
Due to significant amount of work, changes required are applied only to two
PMDs so far: net/i40e and net/ice.
So to build it you'll need to add:
-Denable_drivers='common/*,mempool/*,net/ice,net/i40e'
to your config options. 

The aim of these patch series is to make rte_ethdev core data structures
(rte_eth_dev, rte_eth_dev_data, rte_eth_rxtx_callback, etc.) internal to DPDK
and not visible to the user.
That should allow future possible changes to core ethdev related structures
to be transparent to the user and help to improve ABI/API stability.
Note that current ethdev API is preserved, though it is an ABI break for sure.

The work is based on previous discussion at:
https://www.mail-archive.com/dev@dpdk.org/msg211405.html
and consists of the following main points:
1. Move public 'fast' function pointers (rx_pkt_burst(), etc.) from
   rte_eth_dev into a separate flat array. We keep it public to still be
   able to use inline functions for these 'fast' calls
   (like rte_eth_rx_burst(), etc.) to avoid/minimize slowdown.
2. Change prototype within PMDs for these 'fast' functions
   (pkt_rx_burst(), etc.) to accept pair of <port_id, queue_id>
   instead of queue pointer.
3. Also some mechanical changes in function start/finish code is required.
   Basically to avoid extra level of indirection - PMD required to do some
   preliminary checks and data retrieval that are currently done at user level
   by inline rte_eth* functions. 
4. Special _rte_eth_*_prolog(/epilog) inline functions and helper macros
   are provided to make these changes inside PMDs as straightforward
   as possible.
5. Change implementation of 'fast' ethdev functions (rte_eth_rx_burst(), etc.)
   to use new public flat array. 
6. Move rte_eth_dev, rte_eth_dev_data, rte_eth_rxtx_callback and related things
   into internal header: <ethdev_driver.h>.

That approach was selected to avoid(/minimize) possible performance losses.
 
So far I done only limited amount functional and performance testing.
Didn't spot any functional problems, and performance numbers
remains the same before and after the patch on my box (testpmd, macswap fwd).

Remaining items:
==============
- implement required changes for all PMD at drivers/net.
  So far I done changes only for two drivers, and definitely would use some
  help from other PMD maintainers. Required changes are mechanical,
  but we have a lot of drivers these days.
- <rte_bus_pci.h> contains reference to rte_eth_dev field
  RTE_ETH_DEV_TO_PCI(eth_dev).
  Need to move this macro into some internal header.
- Extra testing
- checkpatch warnings
- docs update

Konstantin Ananyev (7):
  eth: move ethdev 'burst' API into separate structure
  eth: make drivers to use new API for Rx
  eth: make drivers to use new API for Tx
  eth: make drivers to use new API for Tx prepare
  eth: make drivers to use new API to obtain descriptor status
  eth: make drivers to use new API for Rx queue count
  eth: hide eth dev related structures

 app/test-pmd/config.c                         |  23 +-
 app/test/virtual_pmd.c                        |  27 +-
 drivers/common/octeontx2/otx2_sec_idev.c      |   2 +-
 drivers/crypto/octeontx2/otx2_cryptodev_ops.c |   2 +-
 drivers/net/i40e/i40e_ethdev.c                |  15 +-
 drivers/net/i40e/i40e_ethdev_vf.c             |  15 +-
 drivers/net/i40e/i40e_rxtx.c                  | 243 ++++---
 drivers/net/i40e/i40e_rxtx.h                  |  68 +-
 drivers/net/i40e/i40e_rxtx_vec_avx2.c         |  11 +-
 drivers/net/i40e/i40e_rxtx_vec_avx512.c       |  12 +-
 drivers/net/i40e/i40e_rxtx_vec_sse.c          |   8 +-
 drivers/net/i40e/i40e_vf_representor.c        |  10 +-
 drivers/net/ice/ice_dcf_ethdev.c              |  10 +-
 drivers/net/ice/ice_dcf_vf_representor.c      |  10 +-
 drivers/net/ice/ice_ethdev.c                  |  15 +-
 drivers/net/ice/ice_rxtx.c                    | 236 ++++---
 drivers/net/ice/ice_rxtx.h                    |  73 +--
 drivers/net/ice/ice_rxtx_vec_avx2.c           |  24 +-
 drivers/net/ice/ice_rxtx_vec_avx512.c         |  24 +-
 drivers/net/ice/ice_rxtx_vec_common.h         |   7 +-
 drivers/net/ice/ice_rxtx_vec_sse.c            |  12 +-
 lib/ethdev/ethdev_driver.h                    | 601 ++++++++++++++++++
 lib/ethdev/ethdev_private.c                   |  74 +++
 lib/ethdev/ethdev_private.h                   |   3 +
 lib/ethdev/rte_ethdev.c                       | 176 ++++-
 lib/ethdev/rte_ethdev.h                       | 194 ++----
 lib/ethdev/rte_ethdev_core.h                  | 182 ++----
 lib/ethdev/version.map                        |  16 +
 lib/eventdev/rte_event_eth_rx_adapter.c       |   2 +-
 lib/eventdev/rte_event_eth_tx_adapter.c       |   2 +-
 lib/eventdev/rte_eventdev.c                   |   2 +-
 31 files changed, 1488 insertions(+), 611 deletions(-)
  

Comments

Jerin Jacob Aug. 26, 2021, 12:37 p.m. UTC | #1
On Fri, Aug 20, 2021 at 9:59 PM Konstantin Ananyev
<konstantin.ananyev@intel.com> wrote:
>
> NOTE: This is just an RFC to start further discussion and collect the feedback.
> Due to significant amount of work, changes required are applied only to two
> PMDs so far: net/i40e and net/ice.
> So to build it you'll need to add:
> -Denable_drivers='common/*,mempool/*,net/ice,net/i40e'
> to your config options.

>
> That approach was selected to avoid(/minimize) possible performance losses.
>
> So far I done only limited amount functional and performance testing.
> Didn't spot any functional problems, and performance numbers
> remains the same before and after the patch on my box (testpmd, macswap fwd).


Based on testing on octeonxt2. We see some regression in testpmd and
bit on l3fwd too.

Without patch: 73.5mpps/core in testpmd iofwd
With out patch: 72 5mpps/core in testpmd iofwd

Based on my understanding it is due to additional indirection.

My suggestion to fix the problem by:
Removing the additional `data` redirection and pull callback function
pointers back
and keep rest as opaque as done in the existing patch like [1]

I don't believe this has any real implication on future ABI stability
as we will not be adding
any new item in rte_eth_fp in any way as new features can be added in slowpath
rte_eth_dev as mentioned in the patch.

[2] is the patch of doing the same as I don't see any performance
regression after [2].


[1]
- struct rte_eth_burst_api {
- struct rte_eth_fp {
+ void *data;
  rte_eth_rx_burst_t rx_pkt_burst;
  /**< PMD receive function. */
  rte_eth_tx_burst_t tx_pkt_burst;
@@ -85,8 +100,19 @@ struct rte_eth_burst_api {
  /**< Check the status of a Rx descriptor. */
  rte_eth_tx_descriptor_status_t tx_descriptor_status;
  /**< Check the status of a Tx descriptor. */
+ /**
+ * User-supplied functions called from rx_burst to post-process
+ * received packets before passing them to the user
+ */
+ struct rte_eth_rxtx_callback
+ *post_rx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
+ /**
+ * User-supplied functions called from tx_burst to pre-process
+ * received packets before passing them to the driver for transmission.
+ */
+ struct rte_eth_rxtx_callback *pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
  uintptr_t reserved[2];
-} __rte_cache_min_aligned;
+} __rte_cache_aligned;

[2]
https://pastebin.com/CuqkrCW4
  
Ferruh Yigit Sept. 6, 2021, 6:09 p.m. UTC | #2
On 8/26/2021 1:37 PM, Jerin Jacob wrote:
> On Fri, Aug 20, 2021 at 9:59 PM Konstantin Ananyev
> <konstantin.ananyev@intel.com> wrote:
>>
>> NOTE: This is just an RFC to start further discussion and collect the feedback.
>> Due to significant amount of work, changes required are applied only to two
>> PMDs so far: net/i40e and net/ice.
>> So to build it you'll need to add:
>> -Denable_drivers='common/*,mempool/*,net/ice,net/i40e'
>> to your config options.
> 
>>
>> That approach was selected to avoid(/minimize) possible performance losses.
>>
>> So far I done only limited amount functional and performance testing.
>> Didn't spot any functional problems, and performance numbers
>> remains the same before and after the patch on my box (testpmd, macswap fwd).
> 
> 
> Based on testing on octeonxt2. We see some regression in testpmd and
> bit on l3fwd too.
> 
> Without patch: 73.5mpps/core in testpmd iofwd
> With out patch: 72 5mpps/core in testpmd iofwd
> 

So roughly 1.3% drop.

> Based on my understanding it is due to additional indirection.
> 

What is the additional indirection? I can see all dereferences area same.

> My suggestion to fix the problem by:
> Removing the additional `data` redirection and pull callback function
> pointers back
> and keep rest as opaque as done in the existing patch like [1]
> 
> I don't believe this has any real implication on future ABI stability
> as we will not be adding
> any new item in rte_eth_fp in any way as new features can be added in slowpath
> rte_eth_dev as mentioned in the patch.
> 
> [2] is the patch of doing the same as I don't see any performance
> regression after [2].
> 

Only there is an additional check after Konstantin's patch (in both
'rte_eth_rx_burst()' & 'rte_eth_tx_burst()'):
"
if (port_id >= RTE_MAX_ETHPORTS)
        return -EINVAL;
"

Which I can see removed again in your patch [2] which fixes the regression. I
wonder if this can be reason of restoring the performance drop, can you please
test original patch by just removing the 'RTE_MAX_ETHPORTS' check?

Thanks,
ferruh

> 
> [1]
> - struct rte_eth_burst_api {
> - struct rte_eth_fp {
> + void *data;
>   rte_eth_rx_burst_t rx_pkt_burst;
>   /**< PMD receive function. */
>   rte_eth_tx_burst_t tx_pkt_burst;
> @@ -85,8 +100,19 @@ struct rte_eth_burst_api {
>   /**< Check the status of a Rx descriptor. */
>   rte_eth_tx_descriptor_status_t tx_descriptor_status;
>   /**< Check the status of a Tx descriptor. */
> + /**
> + * User-supplied functions called from rx_burst to post-process
> + * received packets before passing them to the user
> + */
> + struct rte_eth_rxtx_callback
> + *post_rx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
> + /**
> + * User-supplied functions called from tx_burst to pre-process
> + * received packets before passing them to the driver for transmission.
> + */
> + struct rte_eth_rxtx_callback *pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
>   uintptr_t reserved[2];
> -} __rte_cache_min_aligned;
> +} __rte_cache_aligned;
> 
> [2]
> https://pastebin.com/CuqkrCW4
>
  
Ananyev, Konstantin Sept. 14, 2021, 1:33 p.m. UTC | #3
Hi Jerin,

> > NOTE: This is just an RFC to start further discussion and collect the feedback.
> > Due to significant amount of work, changes required are applied only to two
> > PMDs so far: net/i40e and net/ice.
> > So to build it you'll need to add:
> > -Denable_drivers='common/*,mempool/*,net/ice,net/i40e'
> > to your config options.
> 
> >
> > That approach was selected to avoid(/minimize) possible performance losses.
> >
> > So far I done only limited amount functional and performance testing.
> > Didn't spot any functional problems, and performance numbers
> > remains the same before and after the patch on my box (testpmd, macswap fwd).
> 
> 
> Based on testing on octeonxt2. We see some regression in testpmd and
> bit on l3fwd too.
> 
> Without patch: 73.5mpps/core in testpmd iofwd
> With out patch: 72 5mpps/core in testpmd iofwd
> 
> Based on my understanding it is due to additional indirection.

From your patch below, it looks like not actually additional indirection,
but extra memory dereference - func and dev pointers are now stored
at different places. Plus the fact that now we dereference rte_eth_devices[]
data inside PMD function. Which probably prevents compiler and CPU to load
 rte_eth_devices[port_id].data and rte_eth_devices[port_id]. pre_tx_burst_cbs[queue_id]  
in advance before calling actual RX/TX function.
About your approach: I don’t mind to add extra opaque 'void *data' pointer,
but would prefer not to expose callback invocations code into inline function.
Main reason for that - I think it still need to be reworked to allow adding/removing 
callbacks without stopping the device. Something similar to what was done for cryptodev
callbacks. To be able to do that in future without another ABI breakage callbacks related part
needs to be kept internal.
Though what we probably can do: add two dynamic arrays of opaque pointers to  rte_eth_burst_api.
One for rx/tx queue data pointers, second for rx/tx callback pointers.
To be more specific, something like:

typedef uint16_t (*rte_eth_rx_burst_t)( void *rxq, struct rte_mbuf **rx_pkts, uint16_t nb_pkts, void *cbs);
typedef uint16_t (*rte_eth_tx_burst_t)(void *txq, struct rte_mbuf **tx_pkts, uint16_t nb_pkts, void *cbs);
....

struct rte_eth_burst_api {
        rte_eth_rx_burst_t rx_pkt_burst;
        /**< PMD receive function. */
        rte_eth_tx_burst_t tx_pkt_burst;
        /**< PMD transmit function. */
        rte_eth_tx_prep_t tx_pkt_prepare;
        /**< PMD transmit prepare function. */
        rte_eth_rx_queue_count_t rx_queue_count;
        /**< Get the number of used RX descriptors. */
        rte_eth_rx_descriptor_status_t rx_descriptor_status;
        /**< Check the status of a Rx descriptor. */
        rte_eth_tx_descriptor_status_t tx_descriptor_status;
        /**< Check the status of a Tx descriptor. */
        struct {
                 void **queue_data;   /* point to rte_eth_devices[port_id].data-> rx_queues */
                 void **cbs;                  /*  points to rte_eth_devices[port_id].post_rx_burst_cbs */ 
       } rx_data, tx_data;
} __rte_cache_aligned;

static inline uint16_t
rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
                 struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
{
       struct rte_eth_burst_api *p;

        if (port_id >= RTE_MAX_ETHPORTS || queue_id >= RTE_MAX_QUEUES_PER_PORT)
                return 0;
 
      p =  &rte_eth_burst_api[port_id];
      return p->rx_pkt_burst(p->rx_data.queue_data[queue_id], rx_pkts, nb_pkts, p->rx_data.cbs[queue_id]);
}

Same for TX.

If that looks ok to everyone, I'll try to prepare next version based on that.
In theory that should avoid extra dereference problem and even reduce indirection.
As a drawback data->rxq/txq should always be allocated for RTE_MAX_QUEUES_PER_PORT entries,
but I presume that’s not a big deal.

As a side question - is there any reason why rte_ethdev_trace_rx_burst() is invoked at very last point,
while rte_ethdev_trace_tx_burst()  after CBs but before actual tx_pkt_burst()?
It would make things simpler if tracng would always be done either on entrance or exit of rx/tx_burst.

> 
> My suggestion to fix the problem by:
> Removing the additional `data` redirection and pull callback function
> pointers back
> and keep rest as opaque as done in the existing patch like [1]
> 
> I don't believe this has any real implication on future ABI stability
> as we will not be adding
> any new item in rte_eth_fp in any way as new features can be added in slowpath
> rte_eth_dev as mentioned in the patch.
> 
> [2] is the patch of doing the same as I don't see any performance
> regression after [2].
> 
> 
> [1]
> - struct rte_eth_burst_api {
> - struct rte_eth_fp {
> + void *data;
>   rte_eth_rx_burst_t rx_pkt_burst;
>   /**< PMD receive function. */
>   rte_eth_tx_burst_t tx_pkt_burst;
> @@ -85,8 +100,19 @@ struct rte_eth_burst_api {
>   /**< Check the status of a Rx descriptor. */
>   rte_eth_tx_descriptor_status_t tx_descriptor_status;
>   /**< Check the status of a Tx descriptor. */
> + /**
> + * User-supplied functions called from rx_burst to post-process
> + * received packets before passing them to the user
> + */
> + struct rte_eth_rxtx_callback
> + *post_rx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
> + /**
> + * User-supplied functions called from tx_burst to pre-process
> + * received packets before passing them to the driver for transmission.
> + */
> + struct rte_eth_rxtx_callback *pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
>   uintptr_t reserved[2];
> -} __rte_cache_min_aligned;
> +} __rte_cache_aligned;
> 
> [2]
> https://pastebin.com/CuqkrCW4
  
Jerin Jacob Sept. 15, 2021, 9:45 a.m. UTC | #4
On Tue, Sep 14, 2021 at 7:03 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
> Hi Jerin,
>
> > > NOTE: This is just an RFC to start further discussion and collect the feedback.
> > > Due to significant amount of work, changes required are applied only to two
> > > PMDs so far: net/i40e and net/ice.
> > > So to build it you'll need to add:
> > > -Denable_drivers='common/*,mempool/*,net/ice,net/i40e'
> > > to your config options.
> >
> > >
> > > That approach was selected to avoid(/minimize) possible performance losses.
> > >
> > > So far I done only limited amount functional and performance testing.
> > > Didn't spot any functional problems, and performance numbers
> > > remains the same before and after the patch on my box (testpmd, macswap fwd).
> >
> >
> > Based on testing on octeonxt2. We see some regression in testpmd and
> > bit on l3fwd too.
> >
> > Without patch: 73.5mpps/core in testpmd iofwd
> > With out patch: 72 5mpps/core in testpmd iofwd
> >
> > Based on my understanding it is due to additional indirection.
>
> From your patch below, it looks like not actually additional indirection,
> but extra memory dereference - func and dev pointers are now stored
> at different places.

Yup. I meant the same. We are on the same page.

> Plus the fact that now we dereference rte_eth_devices[]
> data inside PMD function. Which probably prevents compiler and CPU to load
>  rte_eth_devices[port_id].data and rte_eth_devices[port_id]. pre_tx_burst_cbs[queue_id]
> in advance before calling actual RX/TX function.

Yes.

> About your approach: I don’t mind to add extra opaque 'void *data' pointer,
> but would prefer not to expose callback invocations code into inline function.
> Main reason for that - I think it still need to be reworked to allow adding/removing
> callbacks without stopping the device. Something similar to what was done for cryptodev
> callbacks. To be able to do that in future without another ABI breakage callbacks related part
> needs to be kept internal.
> Though what we probably can do: add two dynamic arrays of opaque pointers to  rte_eth_burst_api.
> One for rx/tx queue data pointers, second for rx/tx callback pointers.
> To be more specific, something like:
>
> typedef uint16_t (*rte_eth_rx_burst_t)( void *rxq, struct rte_mbuf **rx_pkts, uint16_t nb_pkts, void *cbs);
> typedef uint16_t (*rte_eth_tx_burst_t)(void *txq, struct rte_mbuf **tx_pkts, uint16_t nb_pkts, void *cbs);
> ....
>
> struct rte_eth_burst_api {
>         rte_eth_rx_burst_t rx_pkt_burst;
>         /**< PMD receive function. */
>         rte_eth_tx_burst_t tx_pkt_burst;
>         /**< PMD transmit function. */
>         rte_eth_tx_prep_t tx_pkt_prepare;
>         /**< PMD transmit prepare function. */
>         rte_eth_rx_queue_count_t rx_queue_count;
>         /**< Get the number of used RX descriptors. */
>         rte_eth_rx_descriptor_status_t rx_descriptor_status;
>         /**< Check the status of a Rx descriptor. */
>         rte_eth_tx_descriptor_status_t tx_descriptor_status;
>         /**< Check the status of a Tx descriptor. */
>         struct {
>                  void **queue_data;   /* point to rte_eth_devices[port_id].data-> rx_queues */
>                  void **cbs;                  /*  points to rte_eth_devices[port_id].post_rx_burst_cbs */
>        } rx_data, tx_data;
> } __rte_cache_aligned;
>
> static inline uint16_t
> rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> {
>        struct rte_eth_burst_api *p;
>
>         if (port_id >= RTE_MAX_ETHPORTS || queue_id >= RTE_MAX_QUEUES_PER_PORT)
>                 return 0;
>
>       p =  &rte_eth_burst_api[port_id];
>       return p->rx_pkt_burst(p->rx_data.queue_data[queue_id], rx_pkts, nb_pkts, p->rx_data.cbs[queue_id]);



That works.


> }
>
> Same for TX.
>
> If that looks ok to everyone, I'll try to prepare next version based on that.


Looks good to me.

> In theory that should avoid extra dereference problem and even reduce indirection.
> As a drawback data->rxq/txq should always be allocated for RTE_MAX_QUEUES_PER_PORT entries,
> but I presume that’s not a big deal.
>
> As a side question - is there any reason why rte_ethdev_trace_rx_burst() is invoked at very last point,
> while rte_ethdev_trace_tx_burst()  after CBs but before actual tx_pkt_burst()?
> It would make things simpler if tracng would always be done either on entrance or exit of rx/tx_burst.

exit is fine.

>
> >
> > My suggestion to fix the problem by:
> > Removing the additional `data` redirection and pull callback function
> > pointers back
> > and keep rest as opaque as done in the existing patch like [1]
> >
> > I don't believe this has any real implication on future ABI stability
> > as we will not be adding
> > any new item in rte_eth_fp in any way as new features can be added in slowpath
> > rte_eth_dev as mentioned in the patch.

Ack

I will happy to test again after the rework and report any performance
issues if any.

Thaks for the great work :-)


> >
> > [2] is the patch of doing the same as I don't see any performance
> > regression after [2].
> >
> >
> > [1]
> > - struct rte_eth_burst_api {
> > - struct rte_eth_fp {
> > + void *data;
> >   rte_eth_rx_burst_t rx_pkt_burst;
> >   /**< PMD receive function. */
> >   rte_eth_tx_burst_t tx_pkt_burst;
> > @@ -85,8 +100,19 @@ struct rte_eth_burst_api {
> >   /**< Check the status of a Rx descriptor. */
> >   rte_eth_tx_descriptor_status_t tx_descriptor_status;
> >   /**< Check the status of a Tx descriptor. */
> > + /**
> > + * User-supplied functions called from rx_burst to post-process
> > + * received packets before passing them to the user
> > + */
> > + struct rte_eth_rxtx_callback
> > + *post_rx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
> > + /**
> > + * User-supplied functions called from tx_burst to pre-process
> > + * received packets before passing them to the driver for transmission.
> > + */
> > + struct rte_eth_rxtx_callback *pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
> >   uintptr_t reserved[2];
> > -} __rte_cache_min_aligned;
> > +} __rte_cache_aligned;
> >
> > [2]
> > https://pastebin.com/CuqkrCW4
  
Ananyev, Konstantin Sept. 22, 2021, 3:08 p.m. UTC | #5
> > Hi Jerin,
> >
> > > > NOTE: This is just an RFC to start further discussion and collect the feedback.
> > > > Due to significant amount of work, changes required are applied only to two
> > > > PMDs so far: net/i40e and net/ice.
> > > > So to build it you'll need to add:
> > > > -Denable_drivers='common/*,mempool/*,net/ice,net/i40e'
> > > > to your config options.
> > >
> > > >
> > > > That approach was selected to avoid(/minimize) possible performance losses.
> > > >
> > > > So far I done only limited amount functional and performance testing.
> > > > Didn't spot any functional problems, and performance numbers
> > > > remains the same before and after the patch on my box (testpmd, macswap fwd).
> > >
> > >
> > > Based on testing on octeonxt2. We see some regression in testpmd and
> > > bit on l3fwd too.
> > >
> > > Without patch: 73.5mpps/core in testpmd iofwd
> > > With out patch: 72 5mpps/core in testpmd iofwd
> > >
> > > Based on my understanding it is due to additional indirection.
> >
> > From your patch below, it looks like not actually additional indirection,
> > but extra memory dereference - func and dev pointers are now stored
> > at different places.
> 
> Yup. I meant the same. We are on the same page.
> 
> > Plus the fact that now we dereference rte_eth_devices[]
> > data inside PMD function. Which probably prevents compiler and CPU to load
> >  rte_eth_devices[port_id].data and rte_eth_devices[port_id]. pre_tx_burst_cbs[queue_id]
> > in advance before calling actual RX/TX function.
> 
> Yes.
> 
> > About your approach: I don’t mind to add extra opaque 'void *data' pointer,
> > but would prefer not to expose callback invocations code into inline function.
> > Main reason for that - I think it still need to be reworked to allow adding/removing
> > callbacks without stopping the device. Something similar to what was done for cryptodev
> > callbacks. To be able to do that in future without another ABI breakage callbacks related part
> > needs to be kept internal.
> > Though what we probably can do: add two dynamic arrays of opaque pointers to  rte_eth_burst_api.
> > One for rx/tx queue data pointers, second for rx/tx callback pointers.
> > To be more specific, something like:
> >
> > typedef uint16_t (*rte_eth_rx_burst_t)( void *rxq, struct rte_mbuf **rx_pkts, uint16_t nb_pkts, void *cbs);
> > typedef uint16_t (*rte_eth_tx_burst_t)(void *txq, struct rte_mbuf **tx_pkts, uint16_t nb_pkts, void *cbs);
> > ....
> >
> > struct rte_eth_burst_api {
> >         rte_eth_rx_burst_t rx_pkt_burst;
> >         /**< PMD receive function. */
> >         rte_eth_tx_burst_t tx_pkt_burst;
> >         /**< PMD transmit function. */
> >         rte_eth_tx_prep_t tx_pkt_prepare;
> >         /**< PMD transmit prepare function. */
> >         rte_eth_rx_queue_count_t rx_queue_count;
> >         /**< Get the number of used RX descriptors. */
> >         rte_eth_rx_descriptor_status_t rx_descriptor_status;
> >         /**< Check the status of a Rx descriptor. */
> >         rte_eth_tx_descriptor_status_t tx_descriptor_status;
> >         /**< Check the status of a Tx descriptor. */
> >         struct {
> >                  void **queue_data;   /* point to rte_eth_devices[port_id].data-> rx_queues */
> >                  void **cbs;                  /*  points to rte_eth_devices[port_id].post_rx_burst_cbs */
> >        } rx_data, tx_data;
> > } __rte_cache_aligned;
> >
> > static inline uint16_t
> > rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
> >                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> > {
> >        struct rte_eth_burst_api *p;
> >
> >         if (port_id >= RTE_MAX_ETHPORTS || queue_id >= RTE_MAX_QUEUES_PER_PORT)
> >                 return 0;
> >
> >       p =  &rte_eth_burst_api[port_id];
> >       return p->rx_pkt_burst(p->rx_data.queue_data[queue_id], rx_pkts, nb_pkts, p->rx_data.cbs[queue_id]);
> 
> 
> 
> That works.
> 
> 
> > }
> >
> > Same for TX.
> >
> > If that looks ok to everyone, I'll try to prepare next version based on that.
> 
> 
> Looks good to me.
> 
> > In theory that should avoid extra dereference problem and even reduce indirection.
> > As a drawback data->rxq/txq should always be allocated for RTE_MAX_QUEUES_PER_PORT entries,
> > but I presume that’s not a big deal.
> >
> > As a side question - is there any reason why rte_ethdev_trace_rx_burst() is invoked at very last point,
> > while rte_ethdev_trace_tx_burst()  after CBs but before actual tx_pkt_burst()?
> > It would make things simpler if tracng would always be done either on entrance or exit of rx/tx_burst.
> 
> exit is fine.
> 
> >
> > >
> > > My suggestion to fix the problem by:
> > > Removing the additional `data` redirection and pull callback function
> > > pointers back
> > > and keep rest as opaque as done in the existing patch like [1]
> > >
> > > I don't believe this has any real implication on future ABI stability
> > > as we will not be adding
> > > any new item in rte_eth_fp in any way as new features can be added in slowpath
> > > rte_eth_dev as mentioned in the patch.
> 
> Ack
> 
> I will happy to test again after the rework and report any performance
> issues if any.

Thanks Jerin, v2 is out:
https://patches.dpdk.org/project/dpdk/list/?series=19084
Please have a look, when you'll get a chance.
  
Jerin Jacob Sept. 27, 2021, 4:14 p.m. UTC | #6
On Wed, Sep 22, 2021 at 8:38 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
> > > Hi Jerin,
> > >
> > > > > NOTE: This is just an RFC to start further discussion and collect the feedback.
> > > > > Due to significant amount of work, changes required are applied only to two
> > > > > PMDs so far: net/i40e and net/ice.
> > > > > So to build it you'll need to add:
> > > > > -Denable_drivers='common/*,mempool/*,net/ice,net/i40e'
> > > > > to your config options.
> > > >
> > > > >
> > > > > That approach was selected to avoid(/minimize) possible performance losses.
> > > > >
> > > > > So far I done only limited amount functional and performance testing.
> > > > > Didn't spot any functional problems, and performance numbers
> > > > > remains the same before and after the patch on my box (testpmd, macswap fwd).
> > > >
> > > >
> > > > Based on testing on octeonxt2. We see some regression in testpmd and
> > > > bit on l3fwd too.
> > > >
> > > > Without patch: 73.5mpps/core in testpmd iofwd
> > > > With out patch: 72 5mpps/core in testpmd iofwd
> > > >
> > > > Based on my understanding it is due to additional indirection.
> > >
> > > From your patch below, it looks like not actually additional indirection,
> > > but extra memory dereference - func and dev pointers are now stored
> > > at different places.
> >
> > Yup. I meant the same. We are on the same page.
> >
> > > Plus the fact that now we dereference rte_eth_devices[]
> > > data inside PMD function. Which probably prevents compiler and CPU to load
> > >  rte_eth_devices[port_id].data and rte_eth_devices[port_id]. pre_tx_burst_cbs[queue_id]
> > > in advance before calling actual RX/TX function.
> >
> > Yes.
> >
> > > About your approach: I don’t mind to add extra opaque 'void *data' pointer,
> > > but would prefer not to expose callback invocations code into inline function.
> > > Main reason for that - I think it still need to be reworked to allow adding/removing
> > > callbacks without stopping the device. Something similar to what was done for cryptodev
> > > callbacks. To be able to do that in future without another ABI breakage callbacks related part
> > > needs to be kept internal.
> > > Though what we probably can do: add two dynamic arrays of opaque pointers to  rte_eth_burst_api.
> > > One for rx/tx queue data pointers, second for rx/tx callback pointers.
> > > To be more specific, something like:
> > >
> > > typedef uint16_t (*rte_eth_rx_burst_t)( void *rxq, struct rte_mbuf **rx_pkts, uint16_t nb_pkts, void *cbs);
> > > typedef uint16_t (*rte_eth_tx_burst_t)(void *txq, struct rte_mbuf **tx_pkts, uint16_t nb_pkts, void *cbs);
> > > ....
> > >
> > > struct rte_eth_burst_api {
> > >         rte_eth_rx_burst_t rx_pkt_burst;
> > >         /**< PMD receive function. */
> > >         rte_eth_tx_burst_t tx_pkt_burst;
> > >         /**< PMD transmit function. */
> > >         rte_eth_tx_prep_t tx_pkt_prepare;
> > >         /**< PMD transmit prepare function. */
> > >         rte_eth_rx_queue_count_t rx_queue_count;
> > >         /**< Get the number of used RX descriptors. */
> > >         rte_eth_rx_descriptor_status_t rx_descriptor_status;
> > >         /**< Check the status of a Rx descriptor. */
> > >         rte_eth_tx_descriptor_status_t tx_descriptor_status;
> > >         /**< Check the status of a Tx descriptor. */
> > >         struct {
> > >                  void **queue_data;   /* point to rte_eth_devices[port_id].data-> rx_queues */
> > >                  void **cbs;                  /*  points to rte_eth_devices[port_id].post_rx_burst_cbs */
> > >        } rx_data, tx_data;
> > > } __rte_cache_aligned;
> > >
> > > static inline uint16_t
> > > rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
> > >                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> > > {
> > >        struct rte_eth_burst_api *p;
> > >
> > >         if (port_id >= RTE_MAX_ETHPORTS || queue_id >= RTE_MAX_QUEUES_PER_PORT)
> > >                 return 0;
> > >
> > >       p =  &rte_eth_burst_api[port_id];
> > >       return p->rx_pkt_burst(p->rx_data.queue_data[queue_id], rx_pkts, nb_pkts, p->rx_data.cbs[queue_id]);
> >
> >
> >
> > That works.
> >
> >
> > > }
> > >
> > > Same for TX.
> > >
> > > If that looks ok to everyone, I'll try to prepare next version based on that.
> >
> >
> > Looks good to me.
> >
> > > In theory that should avoid extra dereference problem and even reduce indirection.
> > > As a drawback data->rxq/txq should always be allocated for RTE_MAX_QUEUES_PER_PORT entries,
> > > but I presume that’s not a big deal.
> > >
> > > As a side question - is there any reason why rte_ethdev_trace_rx_burst() is invoked at very last point,
> > > while rte_ethdev_trace_tx_burst()  after CBs but before actual tx_pkt_burst()?
> > > It would make things simpler if tracng would always be done either on entrance or exit of rx/tx_burst.
> >
> > exit is fine.
> >
> > >
> > > >
> > > > My suggestion to fix the problem by:
> > > > Removing the additional `data` redirection and pull callback function
> > > > pointers back
> > > > and keep rest as opaque as done in the existing patch like [1]
> > > >
> > > > I don't believe this has any real implication on future ABI stability
> > > > as we will not be adding
> > > > any new item in rte_eth_fp in any way as new features can be added in slowpath
> > > > rte_eth_dev as mentioned in the patch.
> >
> > Ack
> >
> > I will happy to test again after the rework and report any performance
> > issues if any.
>
> Thanks Jerin, v2 is out:
> https://patches.dpdk.org/project/dpdk/list/?series=19084
> Please have a look, when you'll get a chance.

Tested the series. Looks good, No performance issue.

>
  
Ananyev, Konstantin Sept. 28, 2021, 9:37 a.m. UTC | #7
> On Wed, Sep 22, 2021 at 8:38 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> > > > Hi Jerin,
> > > >
> > > > > > NOTE: This is just an RFC to start further discussion and collect the feedback.
> > > > > > Due to significant amount of work, changes required are applied only to two
> > > > > > PMDs so far: net/i40e and net/ice.
> > > > > > So to build it you'll need to add:
> > > > > > -Denable_drivers='common/*,mempool/*,net/ice,net/i40e'
> > > > > > to your config options.
> > > > >
> > > > > >
> > > > > > That approach was selected to avoid(/minimize) possible performance losses.
> > > > > >
> > > > > > So far I done only limited amount functional and performance testing.
> > > > > > Didn't spot any functional problems, and performance numbers
> > > > > > remains the same before and after the patch on my box (testpmd, macswap fwd).
> > > > >
> > > > >
> > > > > Based on testing on octeonxt2. We see some regression in testpmd and
> > > > > bit on l3fwd too.
> > > > >
> > > > > Without patch: 73.5mpps/core in testpmd iofwd
> > > > > With out patch: 72 5mpps/core in testpmd iofwd
> > > > >
> > > > > Based on my understanding it is due to additional indirection.
> > > >
> > > > From your patch below, it looks like not actually additional indirection,
> > > > but extra memory dereference - func and dev pointers are now stored
> > > > at different places.
> > >
> > > Yup. I meant the same. We are on the same page.
> > >
> > > > Plus the fact that now we dereference rte_eth_devices[]
> > > > data inside PMD function. Which probably prevents compiler and CPU to load
> > > >  rte_eth_devices[port_id].data and rte_eth_devices[port_id]. pre_tx_burst_cbs[queue_id]
> > > > in advance before calling actual RX/TX function.
> > >
> > > Yes.
> > >
> > > > About your approach: I don’t mind to add extra opaque 'void *data' pointer,
> > > > but would prefer not to expose callback invocations code into inline function.
> > > > Main reason for that - I think it still need to be reworked to allow adding/removing
> > > > callbacks without stopping the device. Something similar to what was done for cryptodev
> > > > callbacks. To be able to do that in future without another ABI breakage callbacks related part
> > > > needs to be kept internal.
> > > > Though what we probably can do: add two dynamic arrays of opaque pointers to  rte_eth_burst_api.
> > > > One for rx/tx queue data pointers, second for rx/tx callback pointers.
> > > > To be more specific, something like:
> > > >
> > > > typedef uint16_t (*rte_eth_rx_burst_t)( void *rxq, struct rte_mbuf **rx_pkts, uint16_t nb_pkts, void *cbs);
> > > > typedef uint16_t (*rte_eth_tx_burst_t)(void *txq, struct rte_mbuf **tx_pkts, uint16_t nb_pkts, void *cbs);
> > > > ....
> > > >
> > > > struct rte_eth_burst_api {
> > > >         rte_eth_rx_burst_t rx_pkt_burst;
> > > >         /**< PMD receive function. */
> > > >         rte_eth_tx_burst_t tx_pkt_burst;
> > > >         /**< PMD transmit function. */
> > > >         rte_eth_tx_prep_t tx_pkt_prepare;
> > > >         /**< PMD transmit prepare function. */
> > > >         rte_eth_rx_queue_count_t rx_queue_count;
> > > >         /**< Get the number of used RX descriptors. */
> > > >         rte_eth_rx_descriptor_status_t rx_descriptor_status;
> > > >         /**< Check the status of a Rx descriptor. */
> > > >         rte_eth_tx_descriptor_status_t tx_descriptor_status;
> > > >         /**< Check the status of a Tx descriptor. */
> > > >         struct {
> > > >                  void **queue_data;   /* point to rte_eth_devices[port_id].data-> rx_queues */
> > > >                  void **cbs;                  /*  points to rte_eth_devices[port_id].post_rx_burst_cbs */
> > > >        } rx_data, tx_data;
> > > > } __rte_cache_aligned;
> > > >
> > > > static inline uint16_t
> > > > rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
> > > >                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> > > > {
> > > >        struct rte_eth_burst_api *p;
> > > >
> > > >         if (port_id >= RTE_MAX_ETHPORTS || queue_id >= RTE_MAX_QUEUES_PER_PORT)
> > > >                 return 0;
> > > >
> > > >       p =  &rte_eth_burst_api[port_id];
> > > >       return p->rx_pkt_burst(p->rx_data.queue_data[queue_id], rx_pkts, nb_pkts, p->rx_data.cbs[queue_id]);
> > >
> > >
> > >
> > > That works.
> > >
> > >
> > > > }
> > > >
> > > > Same for TX.
> > > >
> > > > If that looks ok to everyone, I'll try to prepare next version based on that.
> > >
> > >
> > > Looks good to me.
> > >
> > > > In theory that should avoid extra dereference problem and even reduce indirection.
> > > > As a drawback data->rxq/txq should always be allocated for RTE_MAX_QUEUES_PER_PORT entries,
> > > > but I presume that’s not a big deal.
> > > >
> > > > As a side question - is there any reason why rte_ethdev_trace_rx_burst() is invoked at very last point,
> > > > while rte_ethdev_trace_tx_burst()  after CBs but before actual tx_pkt_burst()?
> > > > It would make things simpler if tracng would always be done either on entrance or exit of rx/tx_burst.
> > >
> > > exit is fine.
> > >
> > > >
> > > > >
> > > > > My suggestion to fix the problem by:
> > > > > Removing the additional `data` redirection and pull callback function
> > > > > pointers back
> > > > > and keep rest as opaque as done in the existing patch like [1]
> > > > >
> > > > > I don't believe this has any real implication on future ABI stability
> > > > > as we will not be adding
> > > > > any new item in rte_eth_fp in any way as new features can be added in slowpath
> > > > > rte_eth_dev as mentioned in the patch.
> > >
> > > Ack
> > >
> > > I will happy to test again after the rework and report any performance
> > > issues if any.
> >
> > Thanks Jerin, v2 is out:
> > https://patches.dpdk.org/project/dpdk/list/?series=19084
> > Please have a look, when you'll get a chance.
> 
> Tested the series. Looks good, No performance issue.

That's great news, thanks for testing it.
Plan to proceed with proper v3 in next few days.