[RFC] ethdev: fast path async flow API

Message ID 20231227105709.1951231-1-dsosnowski@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [RFC] ethdev: fast path async flow API |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation fail ninja build failure

Commit Message

Dariusz Sosnowski Dec. 27, 2023, 10:57 a.m. UTC
  Goal of the proposed API changes is reducing the overhead of performance
critical asynchronous flow API functions at library level.
Specifically the functions which can be called while processing packets
received by the application in data path.

The plan for the changes is as follows:

1. Fast path flow functions are defined in header file.
   This allows for inlining them into application code.
2. Fast path functions will call the relevant callbacks,
   which are cached in rte_flow_fp_ops array.
    - rte_flow_fp_ops is a flat array containing one entry per port.
    - Each entry holds the fast path callbacks provided by the PMD.
    - By default, each entry holds dummy callbacks defined in flow library,
      which return ENOSYS for all functions.
    - Each entry holds a per-port private context, provided by the PMD,
      which is passed to the callbacks.
    - It is assumed that a callback is defined for the respective function.
      Either the default callback provided by DPDK or
      one provided by the PMD.
3. Any API-level checks are compiled out unless
   RTE_FLOW_DEBUG macro is defined during compilation.
4. Each PMD which implements fast path functions must populate
   the rte_flow_fp_ops struct by calling rte_flow_fp_ops_register().
5. rte_flow_fp_ops is reset to default state on port cleanup
   by ethdev library.

As a result of these changes the relevant callbacks can also be removed
from rte_flow_ops struct. rte_flow_ops struct for now on may be considered
reserved for defining slow path flow API functions.

The proposed changes apply to the following functions
(currently all are marked as experimental API):

- rte_flow_async_create()
- rte_flow_async_create_by_index()
- rte_flow_async_actions_update()
- rte_flow_async_destroy()
- rte_flow_push()
- rte_flow_pull()
- rte_flow_async_action_handle_create()
- rte_flow_async_action_handle_destroy()
- rte_flow_async_action_handle_update()
- rte_flow_async_action_handle_query()
- rte_flow_async_action_handle_query_update()
- rte_flow_async_action_list_handle_create()
- rte_flow_async_action_list_handle_destroy()
- rte_flow_async_action_list_handle_query_update()

In internal testing for mlx5 PMD, with these API changes applied and
after adapting the PMD itself, we saw significant improvement
in single-threaded flow operations throughput:

Flow rule insertion: ~5570 vs ~6060 Kflows/sec -  +8.8%
Flow rule deletion:  ~8420 vs ~9680 Kflows/sec - +15.0%

NVIDIA plans to provide an implementation in mlx5 PMD,
based on these API changes, in 24.03.

**Future considerations**

A case can be made about converting some of the existing
stable API functions to fast path versions in future LTS releases.
I don't have any hard data on how such changes would affect performance of
these APIs, but I assume that the improvement would be noticeable.

However, at the moment I see one problem with this approach.
It would require DPDK to expose the rte_eth_dev struct definition,
because of implied locking implemented in the flow API.
To be specific, I am referring to fts_enter() and fts_exit()
used in flow library. As a result, we might not be able to fully move
the implementation of stable APIs to header files.

This applies to the following APIs:

- rte_flow_create()
- rte_flow_destroy()
- rte_flow_actions_update()
- rte_flow_query()
- rte_flow_get_aged_flows()
- rte_flow_action_handle_create()
- rte_flow_action_handle_destroy()
- rte_flow_action_handle_update()
- rte_flow_action_handle_query()
- rte_flow_action_list_handle_create()
- rte_flow_action_list_handle_destroy()
- rte_flow_action_list_handle_query_update()

Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
---
 lib/ethdev/ethdev_driver.c   |   3 +
 lib/ethdev/rte_flow.c        | 637 +++++++++++++++--------------------
 lib/ethdev/rte_flow.h        | 550 +++++++++++++++++++++++++++---
 lib/ethdev/rte_flow_driver.h | 144 ++------
 4 files changed, 813 insertions(+), 521 deletions(-)
  

Comments

Stephen Hemminger Dec. 27, 2023, 5:39 p.m. UTC | #1
On Wed, 27 Dec 2023 12:57:09 +0200
Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:

> **Future considerations**
> 
> A case can be made about converting some of the existing
> stable API functions to fast path versions in future LTS releases.
> I don't have any hard data on how such changes would affect performance of
> these APIs, but I assume that the improvement would be noticeable.

The problem is that inline functions create future ABI problems.
Usually, there are other ways to get the same performance benefit.
Often batching updates to hardware will do the trick.
  
Stephen Hemminger Dec. 27, 2023, 5:41 p.m. UTC | #2
On Wed, 27 Dec 2023 12:57:09 +0200
Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:

> +/**
> + * @internal
> + *
> + * Fast path async flow functions and related data are held in a flat array, one entry per ethdev.
> + * It is assumed that each entry is read-only and cache aligned.
> + */
> +struct rte_flow_fp_ops {
> +	void *ctx;
> +	rte_flow_async_create_t async_create;
> +	rte_flow_async_create_by_index_t async_create_by_index;
> +	rte_flow_async_actions_update_t async_actions_update;
> +	rte_flow_async_destroy_t async_destroy;
> +	rte_flow_push_t push;
> +	rte_flow_pull_t pull;
> +	rte_flow_async_action_handle_create_t async_action_handle_create;
> +	rte_flow_async_action_handle_destroy_t async_action_handle_destroy;
> +	rte_flow_async_action_handle_update_t async_action_handle_update;
> +	rte_flow_async_action_handle_query_t async_action_handle_query;
> +	rte_flow_async_action_handle_query_update_t async_action_handle_query_update;
> +	rte_flow_async_action_list_handle_create_t async_action_list_handle_create;
> +	rte_flow_async_action_list_handle_destroy_t async_action_list_handle_destroy;
> +	rte_flow_async_action_list_handle_query_update_t async_action_list_handle_query_update;
> +} __rte_cache_aligned;

If you go ahead with this then all usage should be const pointer.
Still think it is bad idea and creates even more technical debt.
  
Dariusz Sosnowski Dec. 28, 2023, 1:53 p.m. UTC | #3
Hi Stephen,

> > +/**
> > + * @internal
> > + *
> > + * Fast path async flow functions and related data are held in a flat array,
> one entry per ethdev.
> > + * It is assumed that each entry is read-only and cache aligned.
> > + */
> > +struct rte_flow_fp_ops {
> > +     void *ctx;
> > +     rte_flow_async_create_t async_create;
> > +     rte_flow_async_create_by_index_t async_create_by_index;
> > +     rte_flow_async_actions_update_t async_actions_update;
> > +     rte_flow_async_destroy_t async_destroy;
> > +     rte_flow_push_t push;
> > +     rte_flow_pull_t pull;
> > +     rte_flow_async_action_handle_create_t async_action_handle_create;
> > +     rte_flow_async_action_handle_destroy_t async_action_handle_destroy;
> > +     rte_flow_async_action_handle_update_t async_action_handle_update;
> > +     rte_flow_async_action_handle_query_t async_action_handle_query;
> > +     rte_flow_async_action_handle_query_update_t
> async_action_handle_query_update;
> > +     rte_flow_async_action_list_handle_create_t
> async_action_list_handle_create;
> > +     rte_flow_async_action_list_handle_destroy_t
> async_action_list_handle_destroy;
> > +     rte_flow_async_action_list_handle_query_update_t
> async_action_list_handle_query_update;
> > +} __rte_cache_aligned;
> 
> If you go ahead with this then all usage should be const pointer.
> Still think it is bad idea and creates even more technical debt.
Could you please elaborate a bit more on why do you think it is a bad idea and why it creates technical debt? 

> > **Future considerations**
> >
> > A case can be made about converting some of the existing stable API 
> > functions to fast path versions in future LTS releases.
> > I don't have any hard data on how such changes would affect 
> > performance of these APIs, but I assume that the improvement would be noticeable.
> 
> The problem is that inline functions create future ABI problems.
Agreed, this is why such a change can only be introduced when a new major ABI version is released.
However, even though inlining could reduce function call overhead, I'm not sure if such a change is needed for synchronous flow API.
I mentioned it here as a thing to consider.

> Usually, there are other ways to get the same performance benefit.
> Often batching updates to hardware will do the trick.
This is true, but we still have some library-level overhead.
To elaborate more, the current state of flow API is as follows:
- With sync flow API:
  - Applications cannot batch flow operations.
- With async flow APIs:
  - Applications can enqueue multiple flow operations and PMDs can issue batches to HW.
  - But there is still one function call per enqueued flow operation.
    The overall API overhead in datapath may be nonnegligible if we consider that applications may enqueue a flow rule creation operation for every packet received in SW.
This proposal specifically aims at reducing API overhead for async flow API in a case mentioned above.

As a side note, we plan to push changes to mlx5 PMD in 24.03 which will reduce PMD overhead in such scenarios.
This proposal's goal is to reduce overhead of async flow API at library level.

Best regards,
Dariusz Sosnowski
  
Ivan Malov Dec. 28, 2023, 2:10 p.m. UTC | #4
Hi Dariusz,

I appreciate the proposal. You say that a reference PMD implementation
will be made available for 24.03 release. What about the applications?
Is this supposed to go to upstream code of some applications?

Thank you.

On Thu, 28 Dec 2023, Dariusz Sosnowski wrote:

> Hi Stephen,
>
>>> +/**
>>> + * @internal
>>> + *
>>> + * Fast path async flow functions and related data are held in a flat array,
>> one entry per ethdev.
>>> + * It is assumed that each entry is read-only and cache aligned.
>>> + */
>>> +struct rte_flow_fp_ops {
>>> +     void *ctx;
>>> +     rte_flow_async_create_t async_create;
>>> +     rte_flow_async_create_by_index_t async_create_by_index;
>>> +     rte_flow_async_actions_update_t async_actions_update;
>>> +     rte_flow_async_destroy_t async_destroy;
>>> +     rte_flow_push_t push;
>>> +     rte_flow_pull_t pull;
>>> +     rte_flow_async_action_handle_create_t async_action_handle_create;
>>> +     rte_flow_async_action_handle_destroy_t async_action_handle_destroy;
>>> +     rte_flow_async_action_handle_update_t async_action_handle_update;
>>> +     rte_flow_async_action_handle_query_t async_action_handle_query;
>>> +     rte_flow_async_action_handle_query_update_t
>> async_action_handle_query_update;
>>> +     rte_flow_async_action_list_handle_create_t
>> async_action_list_handle_create;
>>> +     rte_flow_async_action_list_handle_destroy_t
>> async_action_list_handle_destroy;
>>> +     rte_flow_async_action_list_handle_query_update_t
>> async_action_list_handle_query_update;
>>> +} __rte_cache_aligned;
>>
>> If you go ahead with this then all usage should be const pointer.
>> Still think it is bad idea and creates even more technical debt.
> Could you please elaborate a bit more on why do you think it is a bad idea and why it creates technical debt?
>
>>> **Future considerations**
>>>
>>> A case can be made about converting some of the existing stable API
>>> functions to fast path versions in future LTS releases.
>>> I don't have any hard data on how such changes would affect
>>> performance of these APIs, but I assume that the improvement would be noticeable.
>>
>> The problem is that inline functions create future ABI problems.
> Agreed, this is why such a change can only be introduced when a new major ABI version is released.
> However, even though inlining could reduce function call overhead, I'm not sure if such a change is needed for synchronous flow API.
> I mentioned it here as a thing to consider.
>
>> Usually, there are other ways to get the same performance benefit.
>> Often batching updates to hardware will do the trick.
> This is true, but we still have some library-level overhead.
> To elaborate more, the current state of flow API is as follows:
> - With sync flow API:
>  - Applications cannot batch flow operations.
> - With async flow APIs:
>  - Applications can enqueue multiple flow operations and PMDs can issue batches to HW.
>  - But there is still one function call per enqueued flow operation.
>    The overall API overhead in datapath may be nonnegligible if we consider that applications may enqueue a flow rule creation operation for every packet received in SW.
> This proposal specifically aims at reducing API overhead for async flow API in a case mentioned above.
>
> As a side note, we plan to push changes to mlx5 PMD in 24.03 which will reduce PMD overhead in such scenarios.
> This proposal's goal is to reduce overhead of async flow API at library level.
>
> Best regards,
> Dariusz Sosnowski
>
  
Stephen Hemminger Dec. 28, 2023, 5:16 p.m. UTC | #5
> However, at the moment I see one problem with this approach.
> It would require DPDK to expose the rte_eth_dev struct definition,
> because of implied locking implemented in the flow API.

This is a blocker, showstopper for me.

Have you considered having something like
   rte_flow_create_bulk()

or better yet a Linux iouring style API?

A ring style API would allow for better mixed operations across
the board and get rid of the I-cache overhead which is the root
cause of the needing inline.
  
Dariusz Sosnowski Jan. 3, 2024, 6:01 p.m. UTC | #6
Hi Ivan,

> Hi Dariusz,
> 
> I appreciate the proposal. You say that a reference PMD implementation will
> be made available for 24.03 release. What about the applications?
> Is this supposed to go to upstream code of some applications?
No source code changes are required in applications which already use async flow APIs.
API signatures are not changed in this proposal.

Only the PMD changes are required.
To be specific - callbacks for async flow APIs should not be put in rte_flow_ops,
but registered by calling rte_flow_fp_ops_register(). 

Best regards,
Dariusz Sosnowski
  
Ivan Malov Jan. 3, 2024, 6:29 p.m. UTC | #7
Hi Dariusz,

I appreciate your response. All to the point.

I have to confess my question was inspired by the 23.11 merge commit
in OVS mailing list. I first thought that an obvious consumer for
the async flow API could have been OVS but saw no usage of it in
the current code. It was my impression that there had been some
patches in OVS already, waiting either for approval/testing or
for this particular optimisation to be accepted first.

So far I've been mistaken -- there are no such patches,
hence my question. Do we have real-world examples of
the async flow usage? Should it be tested somehow...

(I apologise in case I'm asking for too many clarifications).

Thank you.

On Wed, 3 Jan 2024, Dariusz Sosnowski wrote:

> Hi Ivan,
>
>> Hi Dariusz,
>>
>> I appreciate the proposal. You say that a reference PMD implementation will
>> be made available for 24.03 release. What about the applications?
>> Is this supposed to go to upstream code of some applications?
> No source code changes are required in applications which already use async flow APIs.
> API signatures are not changed in this proposal.
>
> Only the PMD changes are required.
> To be specific - callbacks for async flow APIs should not be put in rte_flow_ops,
> but registered by calling rte_flow_fp_ops_register().
>
> Best regards,
> Dariusz Sosnowski
>
  
Dariusz Sosnowski Jan. 3, 2024, 7:14 p.m. UTC | #8
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, December 28, 2023 18:17
> > However, at the moment I see one problem with this approach.
> > It would require DPDK to expose the rte_eth_dev struct definition,
> > because of implied locking implemented in the flow API.
> 
> This is a blocker, showstopper for me.
+1

> Have you considered having something like
>    rte_flow_create_bulk()
> 
> or better yet a Linux iouring style API?
> 
> A ring style API would allow for better mixed operations across the board and
> get rid of the I-cache overhead which is the root cause of the needing inline.
Existing async flow API is somewhat close to the io_uring interface.
The difference being that queue is not directly exposed to the application.
Application interacts with the queue using rte_flow_async_* APIs (e.g., places operations in the queue, pushes them to the HW).
Such design has some benefits over a flow API which exposes the queue to the user:
- Easier to use - Applications do not manage the queue directly, they do it through exposed APIs.
- Consistent with other DPDK APIs - In other libraries, queues are manipulated through API, not directly by an application.
- Lower memory usage - only HW primitives are needed (e.g., HW queue on PMD side), no need to allocate separate application queues.

Bulking of flow operations is a tricky subject.
Compared to packet processing, where it is desired to keep the manipulation of raw packet data to the minimum (e.g., only packet headers are accessed),
during flow rule creation all items and actions must be processed by PMD to create a flow rule.
The amount of memory consumed by items and actions themselves during this process might be nonnegligible.
If flow rule operations were bulked, the size of working set of memory would increase, which could have negative consequences on the cache behavior.
So, it might be the case that by utilizing bulking the I-cache overhead is removed, but the D-cache overhead is added.
On the other hand, creating flow rule operations (or enqueuing flow rule operations) one by one enables applications to reuse the same memory for different flow rules.

In summary, in my opinion extending the async flow API with bulking capabilities or exposing the queue directly to the application is not desirable.
This proposal aims to reduce the I-cache overhead in async flow API by reusing the existing design pattern in DPDK - fast path functions are inlined to the application code and they call cached PMD callbacks.

Best regards,
Dariusz Sosnowski
  
Stephen Hemminger Jan. 4, 2024, 1:07 a.m. UTC | #9
On Wed, 3 Jan 2024 19:14:49 +0000
Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:

> In summary, in my opinion extending the async flow API with bulking capabilities or exposing the queue directly to the application is not desirable.
> This proposal aims to reduce the I-cache overhead in async flow API by reusing the existing design pattern in DPDK - fast path functions are inlined to the application code and they call cached PMD callbacks.

Inline needs to more discouraged in DPDK, because it only works if application ends up building with DPDK from source.
It doesn't work for the Linux distro packaging model and symbol versioning, etc.
  
Konstantin Ananyev Jan. 4, 2024, 8:47 a.m. UTC | #10
> > This is a blocker, showstopper for me.
> +1
> 
> > Have you considered having something like
> >    rte_flow_create_bulk()
> >
> > or better yet a Linux iouring style API?
> >
> > A ring style API would allow for better mixed operations across the board and
> > get rid of the I-cache overhead which is the root cause of the needing inline.
> Existing async flow API is somewhat close to the io_uring interface.
> The difference being that queue is not directly exposed to the application.
> Application interacts with the queue using rte_flow_async_* APIs (e.g., places operations in the queue, pushes them to the HW).
> Such design has some benefits over a flow API which exposes the queue to the user:
> - Easier to use - Applications do not manage the queue directly, they do it through exposed APIs.
> - Consistent with other DPDK APIs - In other libraries, queues are manipulated through API, not directly by an application.
> - Lower memory usage - only HW primitives are needed (e.g., HW queue on PMD side), no need to allocate separate application
> queues.
> 
> Bulking of flow operations is a tricky subject.
> Compared to packet processing, where it is desired to keep the manipulation of raw packet data to the minimum (e.g., only packet
> headers are accessed),
> during flow rule creation all items and actions must be processed by PMD to create a flow rule.
> The amount of memory consumed by items and actions themselves during this process might be nonnegligible.
> If flow rule operations were bulked, the size of working set of memory would increase, which could have negative consequences on
> the cache behavior.
> So, it might be the case that by utilizing bulking the I-cache overhead is removed, but the D-cache overhead is added.

Is rte_flow struct really that big?
We do bulk processing for mbufs, crypto_ops, etc., and usually bulk processing improves performance not degrades it.
Of course bulk size has to be somewhat reasonable.

> On the other hand, creating flow rule operations (or enqueuing flow rule operations) one by one enables applications to reuse the
> same memory for different flow rules.
> 
> In summary, in my opinion extending the async flow API with bulking capabilities or exposing the queue directly to the application is
> not desirable.
> This proposal aims to reduce the I-cache overhead in async flow API by reusing the existing design pattern in DPDK - fast path
> functions are inlined to the application code and they call cached PMD callbacks.
> 
> Best regards,
> Dariusz Sosnowski
  
Dariusz Sosnowski Jan. 4, 2024, 1:13 p.m. UTC | #11
> -----Original Message-----
> From: Ivan Malov <ivan.malov@arknetworks.am>
> Sent: Wednesday, January 3, 2024 19:29
> Hi Dariusz,
> 
> I appreciate your response. All to the point.
> 
> I have to confess my question was inspired by the 23.11 merge commit in OVS
> mailing list. I first thought that an obvious consumer for the async flow API
> could have been OVS but saw no usage of it in the current code. It was my
> impression that there had been some patches in OVS already, waiting either
> for approval/testing or for this particular optimisation to be accepted first.
> 
> So far I've been mistaken -- there are no such patches, hence my question. Do
> we have real-world examples of the async flow usage? Should it be tested
> somehow...
> 
> (I apologise in case I'm asking for too many clarifications).
> 
> Thank you.
No need to apologize :)

Unfortunately, we are yet to see async flow API adoption in other open-source projects.
Until now, only direct NVIDIA customers use async flow API in their products.

Best regards,
Dariusz Sosnowski
  
Dariusz Sosnowski Jan. 4, 2024, 4:08 p.m. UTC | #12
Hi Konstantin,

> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Thursday, January 4, 2024 09:47
> > > This is a blocker, showstopper for me.
> > +1
> >
> > > Have you considered having something like
> > >    rte_flow_create_bulk()
> > >
> > > or better yet a Linux iouring style API?
> > >
> > > A ring style API would allow for better mixed operations across the
> > > board and get rid of the I-cache overhead which is the root cause of the
> needing inline.
> > Existing async flow API is somewhat close to the io_uring interface.
> > The difference being that queue is not directly exposed to the application.
> > Application interacts with the queue using rte_flow_async_* APIs (e.g.,
> places operations in the queue, pushes them to the HW).
> > Such design has some benefits over a flow API which exposes the queue to
> the user:
> > - Easier to use - Applications do not manage the queue directly, they do it
> through exposed APIs.
> > - Consistent with other DPDK APIs - In other libraries, queues are
> manipulated through API, not directly by an application.
> > - Lower memory usage - only HW primitives are needed (e.g., HW queue
> > on PMD side), no need to allocate separate application queues.
> >
> > Bulking of flow operations is a tricky subject.
> > Compared to packet processing, where it is desired to keep the
> > manipulation of raw packet data to the minimum (e.g., only packet
> > headers are accessed), during flow rule creation all items and actions must
> be processed by PMD to create a flow rule.
> > The amount of memory consumed by items and actions themselves during
> this process might be nonnegligible.
> > If flow rule operations were bulked, the size of working set of memory
> > would increase, which could have negative consequences on the cache
> behavior.
> > So, it might be the case that by utilizing bulking the I-cache overhead is
> removed, but the D-cache overhead is added.
> 
> Is rte_flow struct really that big?
> We do bulk processing for mbufs, crypto_ops, etc., and usually bulk
> processing improves performance not degrades it.
> Of course bulk size has to be somewhat reasonable.
It does not really depend on rte_flow struct size itself (it's opaque to the user), but on sizes of items and actions which are the parameters for flow operations.
To create a flow through async flow API the following is needed:
- array of items and their spec,
- array of actions and their configuration,
- pointer to template table,
- indexes of pattern and actions templates to be used.
If we assume a simple case of ETH/IPV4/TCP/END match and COUNT/RSS/END actions, then we have at most:
- 4 items (32B each) + 3 specs (20B each) = 188B
- 3 actions (16B each) + 2 configurations (4B and 40B) = 92B
- 8B for table pointer
- 2B for template indexes
In total = 290B.
Bulk API can be designed in a way that single bulk operates on a single set of tables and templates - this would remove a few bytes.
Flow actions can be based on actions templates (so no need for conf), but items' specs are still needed.
This would leave us at 236B, so at least 4 cache lines (assuming everything is tightly packed) for a single flow and almost twice the size of the mbuf.
Depending on the bulk size it might be a much more significant chunk of the cache.

I don't want to dismiss the idea. I think it's worth of evaluation.
However, I'm not entirely confident if bulking API would introduce performance benefits.

Best regards,
Dariusz Sosnowski
  
Dariusz Sosnowski Jan. 23, 2024, 11:37 a.m. UTC | #13
Hi Stephen,

Sorry for such a late response.

From: Stephen Hemminger <stephen@networkplumber.org>
Sent: Thursday, January 4, 2024 02:08
> On Wed, 3 Jan 2024 19:14:49 +0000
> Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:
> > In summary, in my opinion extending the async flow API with bulking
> capabilities or exposing the queue directly to the application is not desirable.
> > This proposal aims to reduce the I-cache overhead in async flow API by
> reusing the existing design pattern in DPDK - fast path functions are inlined to
> the application code and they call cached PMD callbacks.
> 
> Inline needs to more discouraged in DPDK, because it only works if application
> ends up building with DPDK from source.
> It doesn't work for the Linux distro packaging model and symbol versioning,
> etc.

I understand the problem. In that case, I have a proposal.
I had a chat with Thomas regarding this RFC, and he noticed that there are 2 separate changes proposed here:

1. Per-port callbacks for async flow API.
    - Moves specified callbacks to rte_flow_fp_ops struct and allow PMDs to register them dynamically.
    - Removes indirection at API level - no need to call rte_flow_ops_get().
    - Removes checking if callbacks are defined - either the default DPDK callback is used or the one provided by PMD.
2. Make async flow API functions inlineable.

Change (1) won't break ABI (existing callbacks in rte_flow_ops can be marked as deprecated for now and phased out later) and in my opinion is less controversial compared to change (2).

I'll rerun the benchmarks for both changes separately to compare their benefits in isolation.
This would allow us to decide if change (2) is worth the drawbacks it introduces.

What do you think?

Best regards,
Dariusz Sosnowski
  
Dariusz Sosnowski Jan. 29, 2024, 1:38 p.m. UTC | #14
Hi all,

> -----Original Message-----
> From: Dariusz Sosnowski <dsosnowski@nvidia.com>
> Sent: Tuesday, January 23, 2024 12:37
> To: Stephen Hemminger <stephen@networkplumber.org>
> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Ori Kam <orika@nvidia.com>;
> dev@dpdk.org
> Subject: RE: [RFC] ethdev: fast path async flow API
> 
> Hi Stephen,
> 
> Sorry for such a late response.
> 
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, January 4, 2024 02:08
> > On Wed, 3 Jan 2024 19:14:49 +0000
> > Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:
> > > In summary, in my opinion extending the async flow API with bulking
> > capabilities or exposing the queue directly to the application is not desirable.
> > > This proposal aims to reduce the I-cache overhead in async flow API
> > > by
> > reusing the existing design pattern in DPDK - fast path functions are
> > inlined to the application code and they call cached PMD callbacks.
> >
> > Inline needs to more discouraged in DPDK, because it only works if
> > application ends up building with DPDK from source.
> > It doesn't work for the Linux distro packaging model and symbol
> > versioning, etc.
> 
> I understand the problem. In that case, I have a proposal.
> I had a chat with Thomas regarding this RFC, and he noticed that there are 2
> separate changes proposed here:
> 
> 1. Per-port callbacks for async flow API.
>     - Moves specified callbacks to rte_flow_fp_ops struct and allow PMDs to
> register them dynamically.
>     - Removes indirection at API level - no need to call rte_flow_ops_get().
>     - Removes checking if callbacks are defined - either the default DPDK callback
> is used or the one provided by PMD.
> 2. Make async flow API functions inlineable.
> 
> Change (1) won't break ABI (existing callbacks in rte_flow_ops can be marked
> as deprecated for now and phased out later) and in my opinion is less
> controversial compared to change (2).
> 
> I'll rerun the benchmarks for both changes separately to compare their
> benefits in isolation.
> This would allow us to decide if change (2) is worth the drawbacks it
> introduces.
> 
> What do you think?

I split the RFC into 2 parts:

1. Introduce per-port callbacks:
    - Introduce rte_flow_fp_ops struct - holds callbacks for fast path calls, for each port. PMD registers callbacks through rte_flow_fp_ops_register().
    - Relevant rte_flow_async_* functions just pass arguments to fast path callbacks. Validation checks are done only if RTE_FLOW_DEBUG macro is defined.
    - Biggest difference is the removal of callback access through rte_flow_get_ops().
2. Inline async flow API functions.
    - Relevant rte_flow_async_* functions definitions are moved to rte_flow.h and made inlineable.

Here are the results of the benchmark:

|                       | Avg Insertion | Diff over baseline | Avg Deletion | Diff over baseline |
|-----------------------|---------------|--------------------|--------------|--------------------|
| baseline (v24.03-rc0) |     5855.4    |                    |    9026.8    |                    |
| applied (1)           |     6384.2    |    +528.8 (+9%)    |    10054.2   |  +1027.4 (+11.4%)  |
| applied (2)           |     6434.6    |   +579.2 (+9.9%)   |    10011.4   |   +984.6 (+10.9%)  |

Results are in KFlows/sec.
The benchmark is continuously inserting and deleting 2M flow rules.
These rules match on IPv4 destination address and with a single action DROP.
Flow rules are inserted and deleted using a single flow queue.

Change (2) improves insertion rate performance by ~1% compared to (1), but decreases deletion rate by ~0.5%.
Based on these results, I think we can say that making rte_flow_async_*() calls inlineable may not be worth it compared to the issues it causes.

What do you all think about the results?

Best regards,
Dariusz Sosnowski
  
Ferruh Yigit Jan. 29, 2024, 5:36 p.m. UTC | #15
On 1/29/2024 1:38 PM, Dariusz Sosnowski wrote:
> Hi all,
> 
>> -----Original Message-----
>> From: Dariusz Sosnowski <dsosnowski@nvidia.com>
>> Sent: Tuesday, January 23, 2024 12:37
>> To: Stephen Hemminger <stephen@networkplumber.org>
>> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
>> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
>> <andrew.rybchenko@oktetlabs.ru>; Ori Kam <orika@nvidia.com>;
>> dev@dpdk.org
>> Subject: RE: [RFC] ethdev: fast path async flow API
>>
>> Hi Stephen,
>>
>> Sorry for such a late response.
>>
>> From: Stephen Hemminger <stephen@networkplumber.org>
>> Sent: Thursday, January 4, 2024 02:08
>>> On Wed, 3 Jan 2024 19:14:49 +0000
>>> Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:
>>>> In summary, in my opinion extending the async flow API with bulking
>>> capabilities or exposing the queue directly to the application is not desirable.
>>>> This proposal aims to reduce the I-cache overhead in async flow API
>>>> by
>>> reusing the existing design pattern in DPDK - fast path functions are
>>> inlined to the application code and they call cached PMD callbacks.
>>>
>>> Inline needs to more discouraged in DPDK, because it only works if
>>> application ends up building with DPDK from source.
>>> It doesn't work for the Linux distro packaging model and symbol
>>> versioning, etc.
>>
>> I understand the problem. In that case, I have a proposal.
>> I had a chat with Thomas regarding this RFC, and he noticed that there are 2
>> separate changes proposed here:
>>
>> 1. Per-port callbacks for async flow API.
>>     - Moves specified callbacks to rte_flow_fp_ops struct and allow PMDs to
>> register them dynamically.
>>     - Removes indirection at API level - no need to call rte_flow_ops_get().
>>     - Removes checking if callbacks are defined - either the default DPDK callback
>> is used or the one provided by PMD.
>> 2. Make async flow API functions inlineable.
>>
>> Change (1) won't break ABI (existing callbacks in rte_flow_ops can be marked
>> as deprecated for now and phased out later) and in my opinion is less
>> controversial compared to change (2).
>>
>> I'll rerun the benchmarks for both changes separately to compare their
>> benefits in isolation.
>> This would allow us to decide if change (2) is worth the drawbacks it
>> introduces.
>>
>> What do you think?
> 
> I split the RFC into 2 parts:
> 
> 1. Introduce per-port callbacks:
>     - Introduce rte_flow_fp_ops struct - holds callbacks for fast path calls, for each port. PMD registers callbacks through rte_flow_fp_ops_register().
>     - Relevant rte_flow_async_* functions just pass arguments to fast path callbacks. Validation checks are done only if RTE_FLOW_DEBUG macro is defined.
>     - Biggest difference is the removal of callback access through rte_flow_get_ops().
> 2. Inline async flow API functions.
>     - Relevant rte_flow_async_* functions definitions are moved to rte_flow.h and made inlineable.
> 
> Here are the results of the benchmark:
> 
> |                       | Avg Insertion | Diff over baseline | Avg Deletion | Diff over baseline |
> |-----------------------|---------------|--------------------|--------------|--------------------|
> | baseline (v24.03-rc0) |     5855.4    |                    |    9026.8    |                    |
> | applied (1)           |     6384.2    |    +528.8 (+9%)    |    10054.2   |  +1027.4 (+11.4%)  |
> | applied (2)           |     6434.6    |   +579.2 (+9.9%)   |    10011.4   |   +984.6 (+10.9%)  |
> 
> Results are in KFlows/sec.
> The benchmark is continuously inserting and deleting 2M flow rules.
> These rules match on IPv4 destination address and with a single action DROP.
> Flow rules are inserted and deleted using a single flow queue.
> 
> Change (2) improves insertion rate performance by ~1% compared to (1), but decreases deletion rate by ~0.5%.
> Based on these results, I think we can say that making rte_flow_async_*() calls inlineable may not be worth it compared to the issues it causes.
> 
> What do you all think about the results?
> 

Hi Dariusz,

As discussed before, converting APIs to static inline functions or
exposing 'rte_eth_dev' has cons but with only applying first part above
seems get rid of them with reasonable performance gain, so I think we
can continue with this approach and continue to reviews.

Most of the 'rte_flow_async_*' are already missing the function validity
check, so having a default (dummy?) rte_flow_fp_ops for them seems even
an improvement there.


Only previously 'struct rte_flow_ops' was coming from drivers, ethdev
layer doesn't need to maintain anything.
But with 'rte_flow_fp_ops' struct, ethdev needs to store another array
with fixed ('RTE_MAX_ETHPORTS') size which will be another blocker in
the future to convert this fixed arrays to dynamically allocated arrays.

For this, does it still help we add an a new field to "struct
rte_eth_dev", like "struct rte_flow_ops *flow_ops", similar to 'dev_ops'?
The indirection still will be there, but eliminate 'rte_flow_get_ops()'
call and checks comes with it.
If makes sense, is there a chance to experiment this and get some
performance numbers with it?
  
Dariusz Sosnowski Jan. 30, 2024, 12:06 p.m. UTC | #16
Hi Ferruh,

> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Monday, January 29, 2024 18:36
> To: Dariusz Sosnowski <dsosnowski@nvidia.com>; Stephen Hemminger
> <stephen@networkplumber.org>
> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ori Kam
> <orika@nvidia.com>; dev@dpdk.org
> Subject: Re: [RFC] ethdev: fast path async flow API
> 
> On 1/29/2024 1:38 PM, Dariusz Sosnowski wrote:
> > Hi all,
> >
> >> -----Original Message-----
> >> From: Dariusz Sosnowski <dsosnowski@nvidia.com>
> >> Sent: Tuesday, January 23, 2024 12:37
> >> To: Stephen Hemminger <stephen@networkplumber.org>
> >> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
> >> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> >> <andrew.rybchenko@oktetlabs.ru>; Ori Kam <orika@nvidia.com>;
> >> dev@dpdk.org
> >> Subject: RE: [RFC] ethdev: fast path async flow API
> >>
> >> Hi Stephen,
> >>
> >> Sorry for such a late response.
> >>
> >> From: Stephen Hemminger <stephen@networkplumber.org>
> >> Sent: Thursday, January 4, 2024 02:08
> >>> On Wed, 3 Jan 2024 19:14:49 +0000
> >>> Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:
> >>>> In summary, in my opinion extending the async flow API with bulking
> >>> capabilities or exposing the queue directly to the application is not
> desirable.
> >>>> This proposal aims to reduce the I-cache overhead in async flow API
> >>>> by
> >>> reusing the existing design pattern in DPDK - fast path functions
> >>> are inlined to the application code and they call cached PMD callbacks.
> >>>
> >>> Inline needs to more discouraged in DPDK, because it only works if
> >>> application ends up building with DPDK from source.
> >>> It doesn't work for the Linux distro packaging model and symbol
> >>> versioning, etc.
> >>
> >> I understand the problem. In that case, I have a proposal.
> >> I had a chat with Thomas regarding this RFC, and he noticed that
> >> there are 2 separate changes proposed here:
> >>
> >> 1. Per-port callbacks for async flow API.
> >>     - Moves specified callbacks to rte_flow_fp_ops struct and allow
> >> PMDs to register them dynamically.
> >>     - Removes indirection at API level - no need to call rte_flow_ops_get().
> >>     - Removes checking if callbacks are defined - either the default
> >> DPDK callback is used or the one provided by PMD.
> >> 2. Make async flow API functions inlineable.
> >>
> >> Change (1) won't break ABI (existing callbacks in rte_flow_ops can be
> >> marked as deprecated for now and phased out later) and in my opinion
> >> is less controversial compared to change (2).
> >>
> >> I'll rerun the benchmarks for both changes separately to compare
> >> their benefits in isolation.
> >> This would allow us to decide if change (2) is worth the drawbacks it
> >> introduces.
> >>
> >> What do you think?
> >
> > I split the RFC into 2 parts:
> >
> > 1. Introduce per-port callbacks:
> >     - Introduce rte_flow_fp_ops struct - holds callbacks for fast path calls, for
> each port. PMD registers callbacks through rte_flow_fp_ops_register().
> >     - Relevant rte_flow_async_* functions just pass arguments to fast path
> callbacks. Validation checks are done only if RTE_FLOW_DEBUG macro is
> defined.
> >     - Biggest difference is the removal of callback access through
> rte_flow_get_ops().
> > 2. Inline async flow API functions.
> >     - Relevant rte_flow_async_* functions definitions are moved to rte_flow.h
> and made inlineable.
> >
> > Here are the results of the benchmark:
> >
> > |                       | Avg Insertion | Diff over baseline | Avg
> > | Deletion | Diff over baseline |
> > |-----------------------|---------------|--------------------|--------------|-------------
> -------|
> > | baseline (v24.03-rc0) |     5855.4    |                    |    9026.8    |                    |
> > | applied (1)           |     6384.2    |    +528.8 (+9%)    |    10054.2   |  +1027.4
> (+11.4%)  |
> > | applied (2)           |     6434.6    |   +579.2 (+9.9%)   |    10011.4   |   +984.6
> (+10.9%)  |
> >
> > Results are in KFlows/sec.
> > The benchmark is continuously inserting and deleting 2M flow rules.
> > These rules match on IPv4 destination address and with a single action
> DROP.
> > Flow rules are inserted and deleted using a single flow queue.
> >
> > Change (2) improves insertion rate performance by ~1% compared to (1),
> but decreases deletion rate by ~0.5%.
> > Based on these results, I think we can say that making rte_flow_async_*()
> calls inlineable may not be worth it compared to the issues it causes.
> >
> > What do you all think about the results?
> >
> 
> Hi Dariusz,
> 
> As discussed before, converting APIs to static inline functions or exposing
> 'rte_eth_dev' has cons but with only applying first part above seems get rid of
> them with reasonable performance gain, so I think we can continue with this
> approach and continue to reviews.
> 
> Most of the 'rte_flow_async_*' are already missing the function validity check,
> so having a default (dummy?) rte_flow_fp_ops for them seems even an
> improvement there.
> 
> 
> Only previously 'struct rte_flow_ops' was coming from drivers, ethdev layer
> doesn't need to maintain anything.
> But with 'rte_flow_fp_ops' struct, ethdev needs to store another array with
> fixed ('RTE_MAX_ETHPORTS') size which will be another blocker in the future
> to convert this fixed arrays to dynamically allocated arrays.
> 
> For this, does it still help we add an a new field to "struct rte_eth_dev", like
> "struct rte_flow_ops *flow_ops", similar to 'dev_ops'?
> The indirection still will be there, but eliminate 'rte_flow_get_ops()'
> call and checks comes with it.
> If makes sense, is there a chance to experiment this and get some performance
> numbers with it?
Which option do you have in mind specifically?

1. Keeping only fast path callbacks in "dev->flow_ops", so "struct rte_eth_dev" will hold only a pointer to "struct rte_flow_fp_ops" as defined in RFC.
    - Only async flow API will use "dev->flow_ops->callback".
    - Other APIs will go through "rte_flow_ops_get()"
2. Keeping all flow callbacks in "dev->flow_ops", so "struct rte_flow_ops".
    - As a result, I think that "rte_flow_get_ops()" could be removed altogether assuming that all functions have default implementation,
      checking for port validity (ENODEV if port not valid) and returning ENOSYS.

Regardless of the version, I can experiment with additional indirection.

Best regards,
Dariusz Sosnowski
  
Ferruh Yigit Jan. 30, 2024, 12:17 p.m. UTC | #17
On 1/30/2024 12:06 PM, Dariusz Sosnowski wrote:
> Hi Ferruh,
> 
>> -----Original Message-----
>> From: Ferruh Yigit <ferruh.yigit@amd.com>
>> Sent: Monday, January 29, 2024 18:36
>> To: Dariusz Sosnowski <dsosnowski@nvidia.com>; Stephen Hemminger
>> <stephen@networkplumber.org>
>> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
>> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ori Kam
>> <orika@nvidia.com>; dev@dpdk.org
>> Subject: Re: [RFC] ethdev: fast path async flow API
>>
>> On 1/29/2024 1:38 PM, Dariusz Sosnowski wrote:
>>> Hi all,
>>>
>>>> -----Original Message-----
>>>> From: Dariusz Sosnowski <dsosnowski@nvidia.com>
>>>> Sent: Tuesday, January 23, 2024 12:37
>>>> To: Stephen Hemminger <stephen@networkplumber.org>
>>>> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
>>>> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
>>>> <andrew.rybchenko@oktetlabs.ru>; Ori Kam <orika@nvidia.com>;
>>>> dev@dpdk.org
>>>> Subject: RE: [RFC] ethdev: fast path async flow API
>>>>
>>>> Hi Stephen,
>>>>
>>>> Sorry for such a late response.
>>>>
>>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>>> Sent: Thursday, January 4, 2024 02:08
>>>>> On Wed, 3 Jan 2024 19:14:49 +0000
>>>>> Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:
>>>>>> In summary, in my opinion extending the async flow API with bulking
>>>>> capabilities or exposing the queue directly to the application is not
>> desirable.
>>>>>> This proposal aims to reduce the I-cache overhead in async flow API
>>>>>> by
>>>>> reusing the existing design pattern in DPDK - fast path functions
>>>>> are inlined to the application code and they call cached PMD callbacks.
>>>>>
>>>>> Inline needs to more discouraged in DPDK, because it only works if
>>>>> application ends up building with DPDK from source.
>>>>> It doesn't work for the Linux distro packaging model and symbol
>>>>> versioning, etc.
>>>>
>>>> I understand the problem. In that case, I have a proposal.
>>>> I had a chat with Thomas regarding this RFC, and he noticed that
>>>> there are 2 separate changes proposed here:
>>>>
>>>> 1. Per-port callbacks for async flow API.
>>>>     - Moves specified callbacks to rte_flow_fp_ops struct and allow
>>>> PMDs to register them dynamically.
>>>>     - Removes indirection at API level - no need to call rte_flow_ops_get().
>>>>     - Removes checking if callbacks are defined - either the default
>>>> DPDK callback is used or the one provided by PMD.
>>>> 2. Make async flow API functions inlineable.
>>>>
>>>> Change (1) won't break ABI (existing callbacks in rte_flow_ops can be
>>>> marked as deprecated for now and phased out later) and in my opinion
>>>> is less controversial compared to change (2).
>>>>
>>>> I'll rerun the benchmarks for both changes separately to compare
>>>> their benefits in isolation.
>>>> This would allow us to decide if change (2) is worth the drawbacks it
>>>> introduces.
>>>>
>>>> What do you think?
>>>
>>> I split the RFC into 2 parts:
>>>
>>> 1. Introduce per-port callbacks:
>>>     - Introduce rte_flow_fp_ops struct - holds callbacks for fast path calls, for
>> each port. PMD registers callbacks through rte_flow_fp_ops_register().
>>>     - Relevant rte_flow_async_* functions just pass arguments to fast path
>> callbacks. Validation checks are done only if RTE_FLOW_DEBUG macro is
>> defined.
>>>     - Biggest difference is the removal of callback access through
>> rte_flow_get_ops().
>>> 2. Inline async flow API functions.
>>>     - Relevant rte_flow_async_* functions definitions are moved to rte_flow.h
>> and made inlineable.
>>>
>>> Here are the results of the benchmark:
>>>
>>> |                       | Avg Insertion | Diff over baseline | Avg
>>> | Deletion | Diff over baseline |
>>> |-----------------------|---------------|--------------------|--------------|-------------
>> -------|
>>> | baseline (v24.03-rc0) |     5855.4    |                    |    9026.8    |                    |
>>> | applied (1)           |     6384.2    |    +528.8 (+9%)    |    10054.2   |  +1027.4
>> (+11.4%)  |
>>> | applied (2)           |     6434.6    |   +579.2 (+9.9%)   |    10011.4   |   +984.6
>> (+10.9%)  |
>>>
>>> Results are in KFlows/sec.
>>> The benchmark is continuously inserting and deleting 2M flow rules.
>>> These rules match on IPv4 destination address and with a single action
>> DROP.
>>> Flow rules are inserted and deleted using a single flow queue.
>>>
>>> Change (2) improves insertion rate performance by ~1% compared to (1),
>> but decreases deletion rate by ~0.5%.
>>> Based on these results, I think we can say that making rte_flow_async_*()
>> calls inlineable may not be worth it compared to the issues it causes.
>>>
>>> What do you all think about the results?
>>>
>>
>> Hi Dariusz,
>>
>> As discussed before, converting APIs to static inline functions or exposing
>> 'rte_eth_dev' has cons but with only applying first part above seems get rid of
>> them with reasonable performance gain, so I think we can continue with this
>> approach and continue to reviews.
>>
>> Most of the 'rte_flow_async_*' are already missing the function validity check,
>> so having a default (dummy?) rte_flow_fp_ops for them seems even an
>> improvement there.
>>
>>
>> Only previously 'struct rte_flow_ops' was coming from drivers, ethdev layer
>> doesn't need to maintain anything.
>> But with 'rte_flow_fp_ops' struct, ethdev needs to store another array with
>> fixed ('RTE_MAX_ETHPORTS') size which will be another blocker in the future
>> to convert this fixed arrays to dynamically allocated arrays.
>>
>> For this, does it still help we add an a new field to "struct rte_eth_dev", like
>> "struct rte_flow_ops *flow_ops", similar to 'dev_ops'?
>> The indirection still will be there, but eliminate 'rte_flow_get_ops()'
>> call and checks comes with it.
>> If makes sense, is there a chance to experiment this and get some performance
>> numbers with it?
> Which option do you have in mind specifically?
> 
> 1. Keeping only fast path callbacks in "dev->flow_ops", so "struct rte_eth_dev" will hold only a pointer to "struct rte_flow_fp_ops" as defined in RFC.
>     - Only async flow API will use "dev->flow_ops->callback".
>     - Other APIs will go through "rte_flow_ops_get()"
> 2. Keeping all flow callbacks in "dev->flow_ops", so "struct rte_flow_ops".
>     - As a result, I think that "rte_flow_get_ops()" could be removed altogether assuming that all functions have default implementation,
>       checking for port validity (ENODEV if port not valid) and returning ENOSYS.
> 
> Regardless of the version, I can experiment with additional indirection.
> 
> 

Both can work, has some cons,

I think "rte_flow_ops_get()" is more clear approach and rte_flow APIs
already using it, so no change is required, but using separate struct
for fast path will create two different structs drivers needs to fill,
and both uses slightly different way to populate which is not nice.

For this RFC I think we can go with option 1, and consider updating rest
if there is more motivation for it.
  
Dariusz Sosnowski Jan. 30, 2024, 4:08 p.m. UTC | #18
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Tuesday, January 30, 2024 13:17
> To: Dariusz Sosnowski <dsosnowski@nvidia.com>; Stephen Hemminger
> <stephen@networkplumber.org>
> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ori Kam
> <orika@nvidia.com>; dev@dpdk.org
> Subject: Re: [RFC] ethdev: fast path async flow API 
> 
> On 1/30/2024 12:06 PM, Dariusz Sosnowski wrote:
> > Hi Ferruh,
> >
> >> -----Original Message-----
> >> From: Ferruh Yigit <ferruh.yigit@amd.com>
> >> Sent: Monday, January 29, 2024 18:36
> >> To: Dariusz Sosnowski <dsosnowski@nvidia.com>; Stephen Hemminger
> >> <stephen@networkplumber.org>
> >> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
> >> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ori Kam
> >> <orika@nvidia.com>; dev@dpdk.org
> >> Subject: Re: [RFC] ethdev: fast path async flow API
> >>
> >> On 1/29/2024 1:38 PM, Dariusz Sosnowski wrote:
> >>> Hi all,
> >>>
> >>>> -----Original Message-----
> >>>> From: Dariusz Sosnowski <dsosnowski@nvidia.com>
> >>>> Sent: Tuesday, January 23, 2024 12:37
> >>>> To: Stephen Hemminger <stephen@networkplumber.org>
> >>>> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL)
> <thomas@monjalon.net>;
> >>>> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> >>>> <andrew.rybchenko@oktetlabs.ru>; Ori Kam <orika@nvidia.com>;
> >>>> dev@dpdk.org
> >>>> Subject: RE: [RFC] ethdev: fast path async flow API
> >>>>
> >>>> Hi Stephen,
> >>>>
> >>>> Sorry for such a late response.
> >>>>
> >>>> From: Stephen Hemminger <stephen@networkplumber.org>
> >>>> Sent: Thursday, January 4, 2024 02:08
> >>>>> On Wed, 3 Jan 2024 19:14:49 +0000
> >>>>> Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:
> >>>>>> In summary, in my opinion extending the async flow API with
> >>>>>> bulking
> >>>>> capabilities or exposing the queue directly to the application is
> >>>>> not
> >> desirable.
> >>>>>> This proposal aims to reduce the I-cache overhead in async flow
> >>>>>> API by
> >>>>> reusing the existing design pattern in DPDK - fast path functions
> >>>>> are inlined to the application code and they call cached PMD callbacks.
> >>>>>
> >>>>> Inline needs to more discouraged in DPDK, because it only works if
> >>>>> application ends up building with DPDK from source.
> >>>>> It doesn't work for the Linux distro packaging model and symbol
> >>>>> versioning, etc.
> >>>>
> >>>> I understand the problem. In that case, I have a proposal.
> >>>> I had a chat with Thomas regarding this RFC, and he noticed that
> >>>> there are 2 separate changes proposed here:
> >>>>
> >>>> 1. Per-port callbacks for async flow API.
> >>>>     - Moves specified callbacks to rte_flow_fp_ops struct and allow
> >>>> PMDs to register them dynamically.
> >>>>     - Removes indirection at API level - no need to call rte_flow_ops_get().
> >>>>     - Removes checking if callbacks are defined - either the
> >>>> default DPDK callback is used or the one provided by PMD.
> >>>> 2. Make async flow API functions inlineable.
> >>>>
> >>>> Change (1) won't break ABI (existing callbacks in rte_flow_ops can
> >>>> be marked as deprecated for now and phased out later) and in my
> >>>> opinion is less controversial compared to change (2).
> >>>>
> >>>> I'll rerun the benchmarks for both changes separately to compare
> >>>> their benefits in isolation.
> >>>> This would allow us to decide if change (2) is worth the drawbacks
> >>>> it introduces.
> >>>>
> >>>> What do you think?
> >>>
> >>> I split the RFC into 2 parts:
> >>>
> >>> 1. Introduce per-port callbacks:
> >>>     - Introduce rte_flow_fp_ops struct - holds callbacks for fast
> >>> path calls, for
> >> each port. PMD registers callbacks through rte_flow_fp_ops_register().
> >>>     - Relevant rte_flow_async_* functions just pass arguments to
> >>> fast path
> >> callbacks. Validation checks are done only if RTE_FLOW_DEBUG macro is
> >> defined.
> >>>     - Biggest difference is the removal of callback access through
> >> rte_flow_get_ops().
> >>> 2. Inline async flow API functions.
> >>>     - Relevant rte_flow_async_* functions definitions are moved to
> >>> rte_flow.h
> >> and made inlineable.
> >>>
> >>> Here are the results of the benchmark:
> >>>
> >>> |                       | Avg Insertion | Diff over baseline | Avg
> >>> | Deletion | Diff over baseline |
> >>> |-----------------------|---------------|--------------------|--------------|----------
> ---
> >> -------|
> >>> | baseline (v24.03-rc0) |     5855.4    |                    |    9026.8    |                    |
> >>> | applied (1)           |     6384.2    |    +528.8 (+9%)    |    10054.2   |  +1027.4
> >> (+11.4%)  |
> >>> | applied (2)           |     6434.6    |   +579.2 (+9.9%)   |    10011.4   |   +984.6
> >> (+10.9%)  |
> >>>
> >>> Results are in KFlows/sec.
> >>> The benchmark is continuously inserting and deleting 2M flow rules.
> >>> These rules match on IPv4 destination address and with a single
> >>> action
> >> DROP.
> >>> Flow rules are inserted and deleted using a single flow queue.
> >>>
> >>> Change (2) improves insertion rate performance by ~1% compared to
> >>> (1),
> >> but decreases deletion rate by ~0.5%.
> >>> Based on these results, I think we can say that making
> >>> rte_flow_async_*()
> >> calls inlineable may not be worth it compared to the issues it causes.
> >>>
> >>> What do you all think about the results?
> >>>
> >>
> >> Hi Dariusz,
> >>
> >> As discussed before, converting APIs to static inline functions or
> >> exposing 'rte_eth_dev' has cons but with only applying first part
> >> above seems get rid of them with reasonable performance gain, so I
> >> think we can continue with this approach and continue to reviews.
> >>
> >> Most of the 'rte_flow_async_*' are already missing the function
> >> validity check, so having a default (dummy?) rte_flow_fp_ops for them
> >> seems even an improvement there.
> >>
> >>
> >> Only previously 'struct rte_flow_ops' was coming from drivers, ethdev
> >> layer doesn't need to maintain anything.
> >> But with 'rte_flow_fp_ops' struct, ethdev needs to store another
> >> array with fixed ('RTE_MAX_ETHPORTS') size which will be another
> >> blocker in the future to convert this fixed arrays to dynamically allocated
> arrays.
> >>
> >> For this, does it still help we add an a new field to "struct
> >> rte_eth_dev", like "struct rte_flow_ops *flow_ops", similar to 'dev_ops'?
> >> The indirection still will be there, but eliminate 'rte_flow_get_ops()'
> >> call and checks comes with it.
> >> If makes sense, is there a chance to experiment this and get some
> >> performance numbers with it?
> > Which option do you have in mind specifically?
> >
> > 1. Keeping only fast path callbacks in "dev->flow_ops", so "struct
> rte_eth_dev" will hold only a pointer to "struct rte_flow_fp_ops" as defined in
> RFC.
> >     - Only async flow API will use "dev->flow_ops->callback".
> >     - Other APIs will go through "rte_flow_ops_get()"
> > 2. Keeping all flow callbacks in "dev->flow_ops", so "struct rte_flow_ops".
> >     - As a result, I think that "rte_flow_get_ops()" could be removed
> altogether assuming that all functions have default implementation,
> >       checking for port validity (ENODEV if port not valid) and returning
> ENOSYS.
> >
> > Regardless of the version, I can experiment with additional indirection.
> >
> >
> 
> Both can work, has some cons,
> 
> I think "rte_flow_ops_get()" is more clear approach and rte_flow APIs already
> using it, so no change is required, but using separate struct for fast path will
> create two different structs drivers needs to fill, and both uses slightly different
> way to populate which is not nice.
> 
> For this RFC I think we can go with option 1, and consider updating rest if
> there is more motivation for it.

Ok, sounds good to me.

I redid the benchmark with "dev->flow_ops->callback()" method.
There's no statistically significant difference between this and static array of rte_flow_fp_ops, so I think we can go with option 1.
I'll send the patch for review as soon as possible.

Best regards,
Dariusz Sosnowski
  

Patch

diff --git a/lib/ethdev/ethdev_driver.c b/lib/ethdev/ethdev_driver.c
index bd917a15fc..6c36bce8ab 100644
--- a/lib/ethdev/ethdev_driver.c
+++ b/lib/ethdev/ethdev_driver.c
@@ -10,6 +10,7 @@ 
 
 #include "ethdev_driver.h"
 #include "ethdev_private.h"
+#include "rte_flow_driver.h"
 
 /**
  * A set of values to describe the possible states of a switch domain.
@@ -245,6 +246,8 @@  rte_eth_dev_release_port(struct rte_eth_dev *eth_dev)
 
 	eth_dev_fp_ops_reset(rte_eth_fp_ops + eth_dev->data->port_id);
 
+	rte_flow_fp_ops_reset(eth_dev->data->port_id);
+
 	rte_spinlock_lock(rte_mcfg_ethdev_get_lock());
 
 	eth_dev->state = RTE_ETH_DEV_UNUSED;
diff --git a/lib/ethdev/rte_flow.c b/lib/ethdev/rte_flow.c
index f49d1d3767..a33185d8d2 100644
--- a/lib/ethdev/rte_flow.c
+++ b/lib/ethdev/rte_flow.c
@@ -26,6 +26,8 @@  int32_t rte_flow_dynf_metadata_offs = -1;
 /* Mbuf dynamic field flag bit number for metadata. */
 uint64_t rte_flow_dynf_metadata_mask;
 
+struct rte_flow_fp_ops rte_flow_fp_ops[RTE_MAX_ETHPORTS];
+
 /**
  * Flow elements description tables.
  */
@@ -2000,248 +2002,6 @@  rte_flow_group_set_miss_actions(uint16_t port_id,
 				  NULL, rte_strerror(ENOTSUP));
 }
 
-struct rte_flow *
-rte_flow_async_create(uint16_t port_id,
-		      uint32_t queue_id,
-		      const struct rte_flow_op_attr *op_attr,
-		      struct rte_flow_template_table *template_table,
-		      const struct rte_flow_item pattern[],
-		      uint8_t pattern_template_index,
-		      const struct rte_flow_action actions[],
-		      uint8_t actions_template_index,
-		      void *user_data,
-		      struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	struct rte_flow *flow;
-
-	flow = ops->async_create(dev, queue_id,
-				 op_attr, template_table,
-				 pattern, pattern_template_index,
-				 actions, actions_template_index,
-				 user_data, error);
-	if (flow == NULL)
-		flow_err(port_id, -rte_errno, error);
-
-	rte_flow_trace_async_create(port_id, queue_id, op_attr, template_table,
-				    pattern, pattern_template_index, actions,
-				    actions_template_index, user_data, flow);
-
-	return flow;
-}
-
-struct rte_flow *
-rte_flow_async_create_by_index(uint16_t port_id,
-			       uint32_t queue_id,
-			       const struct rte_flow_op_attr *op_attr,
-			       struct rte_flow_template_table *template_table,
-			       uint32_t rule_index,
-			       const struct rte_flow_action actions[],
-			       uint8_t actions_template_index,
-			       void *user_data,
-			       struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	struct rte_flow *flow;
-
-	flow = ops->async_create_by_index(dev, queue_id,
-					  op_attr, template_table, rule_index,
-					  actions, actions_template_index,
-					  user_data, error);
-	if (flow == NULL)
-		flow_err(port_id, -rte_errno, error);
-	return flow;
-}
-
-int
-rte_flow_async_destroy(uint16_t port_id,
-		       uint32_t queue_id,
-		       const struct rte_flow_op_attr *op_attr,
-		       struct rte_flow *flow,
-		       void *user_data,
-		       struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	int ret;
-
-	ret = flow_err(port_id,
-		       ops->async_destroy(dev, queue_id,
-					  op_attr, flow,
-					  user_data, error),
-		       error);
-
-	rte_flow_trace_async_destroy(port_id, queue_id, op_attr, flow,
-				     user_data, ret);
-
-	return ret;
-}
-
-int
-rte_flow_async_actions_update(uint16_t port_id,
-			      uint32_t queue_id,
-			      const struct rte_flow_op_attr *op_attr,
-			      struct rte_flow *flow,
-			      const struct rte_flow_action actions[],
-			      uint8_t actions_template_index,
-			      void *user_data,
-			      struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	int ret;
-
-	ret = flow_err(port_id,
-		       ops->async_actions_update(dev, queue_id, op_attr,
-						 flow, actions,
-						 actions_template_index,
-						 user_data, error),
-		       error);
-
-	rte_flow_trace_async_actions_update(port_id, queue_id, op_attr, flow,
-					    actions, actions_template_index,
-					    user_data, ret);
-
-	return ret;
-}
-
-int
-rte_flow_push(uint16_t port_id,
-	      uint32_t queue_id,
-	      struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	int ret;
-
-	ret = flow_err(port_id,
-		       ops->push(dev, queue_id, error),
-		       error);
-
-	rte_flow_trace_push(port_id, queue_id, ret);
-
-	return ret;
-}
-
-int
-rte_flow_pull(uint16_t port_id,
-	      uint32_t queue_id,
-	      struct rte_flow_op_result res[],
-	      uint16_t n_res,
-	      struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	int ret;
-	int rc;
-
-	ret = ops->pull(dev, queue_id, res, n_res, error);
-	rc = ret ? ret : flow_err(port_id, ret, error);
-
-	rte_flow_trace_pull(port_id, queue_id, res, n_res, rc);
-
-	return rc;
-}
-
-struct rte_flow_action_handle *
-rte_flow_async_action_handle_create(uint16_t port_id,
-		uint32_t queue_id,
-		const struct rte_flow_op_attr *op_attr,
-		const struct rte_flow_indir_action_conf *indir_action_conf,
-		const struct rte_flow_action *action,
-		void *user_data,
-		struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	struct rte_flow_action_handle *handle;
-
-	handle = ops->async_action_handle_create(dev, queue_id, op_attr,
-					     indir_action_conf, action, user_data, error);
-	if (handle == NULL)
-		flow_err(port_id, -rte_errno, error);
-
-	rte_flow_trace_async_action_handle_create(port_id, queue_id, op_attr,
-						  indir_action_conf, action,
-						  user_data, handle);
-
-	return handle;
-}
-
-int
-rte_flow_async_action_handle_destroy(uint16_t port_id,
-		uint32_t queue_id,
-		const struct rte_flow_op_attr *op_attr,
-		struct rte_flow_action_handle *action_handle,
-		void *user_data,
-		struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	int ret;
-
-	ret = ops->async_action_handle_destroy(dev, queue_id, op_attr,
-					   action_handle, user_data, error);
-	ret = flow_err(port_id, ret, error);
-
-	rte_flow_trace_async_action_handle_destroy(port_id, queue_id, op_attr,
-						   action_handle, user_data, ret);
-
-	return ret;
-}
-
-int
-rte_flow_async_action_handle_update(uint16_t port_id,
-		uint32_t queue_id,
-		const struct rte_flow_op_attr *op_attr,
-		struct rte_flow_action_handle *action_handle,
-		const void *update,
-		void *user_data,
-		struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	int ret;
-
-	ret = ops->async_action_handle_update(dev, queue_id, op_attr,
-					  action_handle, update, user_data, error);
-	ret = flow_err(port_id, ret, error);
-
-	rte_flow_trace_async_action_handle_update(port_id, queue_id, op_attr,
-						  action_handle, update,
-						  user_data, ret);
-
-	return ret;
-}
-
-int
-rte_flow_async_action_handle_query(uint16_t port_id,
-		uint32_t queue_id,
-		const struct rte_flow_op_attr *op_attr,
-		const struct rte_flow_action_handle *action_handle,
-		void *data,
-		void *user_data,
-		struct rte_flow_error *error)
-{
-	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
-	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
-	int ret;
-
-	if (unlikely(!ops))
-		return -rte_errno;
-	ret = ops->async_action_handle_query(dev, queue_id, op_attr,
-					  action_handle, data, user_data, error);
-	ret = flow_err(port_id, ret, error);
-
-	rte_flow_trace_async_action_handle_query(port_id, queue_id, op_attr,
-						 action_handle, data, user_data,
-						 ret);
-
-	return ret;
-}
-
 int
 rte_flow_action_handle_query_update(uint16_t port_id,
 				    struct rte_flow_action_handle *handle,
@@ -2267,35 +2027,6 @@  rte_flow_action_handle_query_update(uint16_t port_id,
 	return flow_err(port_id, ret, error);
 }
 
-int
-rte_flow_async_action_handle_query_update(uint16_t port_id, uint32_t queue_id,
-					  const struct rte_flow_op_attr *attr,
-					  struct rte_flow_action_handle *handle,
-					  const void *update, void *query,
-					  enum rte_flow_query_update_mode mode,
-					  void *user_data,
-					  struct rte_flow_error *error)
-{
-	int ret;
-	struct rte_eth_dev *dev;
-	const struct rte_flow_ops *ops;
-
-	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
-	if (!handle)
-		return -EINVAL;
-	if (!update && !query)
-		return -EINVAL;
-	dev = &rte_eth_devices[port_id];
-	ops = rte_flow_ops_get(port_id, error);
-	if (!ops || !ops->async_action_handle_query_update)
-		return -ENOTSUP;
-	ret = ops->async_action_handle_query_update(dev, queue_id, attr,
-						    handle, update,
-						    query, mode,
-						    user_data, error);
-	return flow_err(port_id, ret, error);
-}
-
 struct rte_flow_action_list_handle *
 rte_flow_action_list_handle_create(uint16_t port_id,
 				   const
@@ -2345,64 +2076,6 @@  rte_flow_action_list_handle_destroy(uint16_t port_id,
 	return ret;
 }
 
-struct rte_flow_action_list_handle *
-rte_flow_async_action_list_handle_create(uint16_t port_id, uint32_t queue_id,
-					 const struct rte_flow_op_attr *attr,
-					 const struct rte_flow_indir_action_conf *conf,
-					 const struct rte_flow_action *actions,
-					 void *user_data,
-					 struct rte_flow_error *error)
-{
-	int ret;
-	struct rte_eth_dev *dev;
-	const struct rte_flow_ops *ops;
-	struct rte_flow_action_list_handle *handle;
-
-	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
-	ops = rte_flow_ops_get(port_id, error);
-	if (!ops || !ops->async_action_list_handle_create) {
-		rte_flow_error_set(error, ENOTSUP,
-				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
-				   "action_list handle not supported");
-		return NULL;
-	}
-	dev = &rte_eth_devices[port_id];
-	handle = ops->async_action_list_handle_create(dev, queue_id, attr, conf,
-						      actions, user_data,
-						      error);
-	ret = flow_err(port_id, -rte_errno, error);
-	rte_flow_trace_async_action_list_handle_create(port_id, queue_id, attr,
-						       conf, actions, user_data,
-						       ret);
-	return handle;
-}
-
-int
-rte_flow_async_action_list_handle_destroy(uint16_t port_id, uint32_t queue_id,
-				 const struct rte_flow_op_attr *op_attr,
-				 struct rte_flow_action_list_handle *handle,
-				 void *user_data, struct rte_flow_error *error)
-{
-	int ret;
-	struct rte_eth_dev *dev;
-	const struct rte_flow_ops *ops;
-
-	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
-	ops = rte_flow_ops_get(port_id, error);
-	if (!ops || !ops->async_action_list_handle_destroy)
-		return rte_flow_error_set(error, ENOTSUP,
-					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
-					  "async action_list handle not supported");
-	dev = &rte_eth_devices[port_id];
-	ret = ops->async_action_list_handle_destroy(dev, queue_id, op_attr,
-						    handle, user_data, error);
-	ret = flow_err(port_id, ret, error);
-	rte_flow_trace_async_action_list_handle_destroy(port_id, queue_id,
-							op_attr, handle,
-							user_data, ret);
-	return ret;
-}
-
 int
 rte_flow_action_list_handle_query_update(uint16_t port_id,
 			 const struct rte_flow_action_list_handle *handle,
@@ -2429,38 +2102,6 @@  rte_flow_action_list_handle_query_update(uint16_t port_id,
 	return ret;
 }
 
-int
-rte_flow_async_action_list_handle_query_update(uint16_t port_id, uint32_t queue_id,
-			 const struct rte_flow_op_attr *attr,
-			 const struct rte_flow_action_list_handle *handle,
-			 const void **update, void **query,
-			 enum rte_flow_query_update_mode mode,
-			 void *user_data, struct rte_flow_error *error)
-{
-	int ret;
-	struct rte_eth_dev *dev;
-	const struct rte_flow_ops *ops;
-
-	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
-	ops = rte_flow_ops_get(port_id, error);
-	if (!ops || !ops->async_action_list_handle_query_update)
-		return rte_flow_error_set(error, ENOTSUP,
-					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
-					  "action_list async query_update not supported");
-	dev = &rte_eth_devices[port_id];
-	ret = ops->async_action_list_handle_query_update(dev, queue_id, attr,
-							 handle, update, query,
-							 mode, user_data,
-							 error);
-	ret = flow_err(port_id, ret, error);
-	rte_flow_trace_async_action_list_handle_query_update(port_id, queue_id,
-							     attr, handle,
-							     update, query,
-							     mode, user_data,
-							     ret);
-	return ret;
-}
-
 int
 rte_flow_calc_table_hash(uint16_t port_id, const struct rte_flow_template_table *table,
 			 const struct rte_flow_item pattern[], uint8_t pattern_template_index,
@@ -2481,3 +2122,277 @@  rte_flow_calc_table_hash(uint16_t port_id, const struct rte_flow_template_table
 					hash, error);
 	return flow_err(port_id, ret, error);
 }
+
+static struct rte_flow *
+dummy_async_create(struct rte_eth_dev *dev __rte_unused,
+		   uint32_t queue __rte_unused,
+		   const struct rte_flow_op_attr *attr __rte_unused,
+		   struct rte_flow_template_table *table __rte_unused,
+		   const struct rte_flow_item items[] __rte_unused,
+		   uint8_t pattern_template_index __rte_unused,
+		   const struct rte_flow_action actions[] __rte_unused,
+		   uint8_t action_template_index __rte_unused,
+		   void *user_data __rte_unused,
+		   struct rte_flow_error *error)
+{
+	rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+			   rte_strerror(ENOSYS));
+	return NULL;
+}
+
+static struct rte_flow *
+dummy_async_create_by_index(struct rte_eth_dev *dev __rte_unused,
+			    uint32_t queue __rte_unused,
+			    const struct rte_flow_op_attr *attr __rte_unused,
+			    struct rte_flow_template_table *table __rte_unused,
+			    uint32_t rule_index __rte_unused,
+			    const struct rte_flow_action actions[] __rte_unused,
+			    uint8_t action_template_index __rte_unused,
+			    void *user_data __rte_unused,
+			    struct rte_flow_error *error)
+{
+	rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+			   rte_strerror(ENOSYS));
+	return NULL;
+}
+
+static int
+dummy_async_actions_update(struct rte_eth_dev *dev __rte_unused,
+			   uint32_t queue_id __rte_unused,
+			   const struct rte_flow_op_attr *op_attr __rte_unused,
+			   struct rte_flow *flow __rte_unused,
+			   const struct rte_flow_action actions[] __rte_unused,
+			   uint8_t actions_template_index __rte_unused,
+			   void *user_data __rte_unused,
+			   struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+static int
+dummy_async_destroy(struct rte_eth_dev *dev __rte_unused,
+		    uint32_t queue_id __rte_unused,
+		    const struct rte_flow_op_attr *op_attr __rte_unused,
+		    struct rte_flow *flow __rte_unused,
+		    void *user_data __rte_unused,
+		    struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+static int
+dummy_push(struct rte_eth_dev *dev __rte_unused,
+	   uint32_t queue_id __rte_unused,
+	   struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+static int
+dummy_pull(struct rte_eth_dev *dev __rte_unused,
+	   uint32_t queue_id __rte_unused,
+	   struct rte_flow_op_result res[] __rte_unused,
+	   uint16_t n_res __rte_unused,
+	   struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+static struct rte_flow_action_handle *
+dummy_async_action_handle_create(
+	struct rte_eth_dev *dev __rte_unused,
+	uint32_t queue_id __rte_unused,
+	const struct rte_flow_op_attr *op_attr __rte_unused,
+	const struct rte_flow_indir_action_conf *indir_action_conf __rte_unused,
+	const struct rte_flow_action *action __rte_unused,
+	void *user_data __rte_unused,
+	struct rte_flow_error *error)
+{
+	rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+			   rte_strerror(ENOSYS));
+	return NULL;
+}
+
+static int
+dummy_async_action_handle_destroy(
+	struct rte_eth_dev *dev __rte_unused,
+	uint32_t queue_id __rte_unused,
+	const struct rte_flow_op_attr *op_attr __rte_unused,
+	struct rte_flow_action_handle *action_handle __rte_unused,
+	void *user_data __rte_unused,
+	struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+static int
+dummy_async_action_handle_update(
+	struct rte_eth_dev *dev __rte_unused,
+	uint32_t queue_id __rte_unused,
+	const struct rte_flow_op_attr *op_attr __rte_unused,
+	struct rte_flow_action_handle *action_handle __rte_unused,
+	const void *update __rte_unused,
+	void *user_data __rte_unused,
+	struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+static int
+dummy_async_action_handle_query(
+	struct rte_eth_dev *dev __rte_unused,
+	uint32_t queue_id __rte_unused,
+	const struct rte_flow_op_attr *op_attr __rte_unused,
+	const struct rte_flow_action_handle *action_handle __rte_unused,
+	void *data __rte_unused,
+	void *user_data __rte_unused,
+	struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+static int
+dummy_async_action_handle_query_update(
+	struct rte_eth_dev *dev __rte_unused,
+	uint32_t queue_id __rte_unused,
+	const struct rte_flow_op_attr *attr __rte_unused,
+	struct rte_flow_action_handle *handle __rte_unused,
+	const void *update __rte_unused,
+	void *query __rte_unused,
+	enum rte_flow_query_update_mode mode __rte_unused,
+	void *user_data __rte_unused,
+	struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+static struct rte_flow_action_list_handle *
+dummy_async_action_list_handle_create(
+	struct rte_eth_dev *dev __rte_unused,
+	uint32_t queue_id __rte_unused,
+	const struct rte_flow_op_attr *attr __rte_unused,
+	const struct rte_flow_indir_action_conf *conf __rte_unused,
+	const struct rte_flow_action *actions __rte_unused,
+	void *user_data __rte_unused,
+	struct rte_flow_error *error)
+{
+	rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+			   rte_strerror(ENOSYS));
+	return NULL;
+}
+
+static int
+dummy_async_action_list_handle_destroy(
+	struct rte_eth_dev *dev __rte_unused,
+	uint32_t queue_id __rte_unused,
+	const struct rte_flow_op_attr *op_attr __rte_unused,
+	struct rte_flow_action_list_handle *handle __rte_unused,
+	void *user_data __rte_unused,
+	struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+static int
+dummy_async_action_list_handle_query_update(
+	struct rte_eth_dev *dev __rte_unused,
+	uint32_t queue_id __rte_unused,
+	const struct rte_flow_op_attr *attr __rte_unused,
+	const struct rte_flow_action_list_handle *handle __rte_unused,
+	const void **update __rte_unused,
+	void **query __rte_unused,
+	enum rte_flow_query_update_mode mode __rte_unused,
+	void *user_data __rte_unused,
+	struct rte_flow_error *error)
+{
+	return rte_flow_error_set(error, ENOSYS, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				  rte_strerror(ENOSYS));
+}
+
+int
+rte_flow_fp_ops_register(uint16_t port_id, const struct rte_flow_fp_ops *ops)
+{
+	struct rte_flow_fp_ops *port_ops;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	port_ops = &rte_flow_fp_ops[port_id];
+
+	if (ops->ctx) {
+		port_ops->ctx = ops->ctx;
+	} else {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	if (ops->async_create_by_index)
+		port_ops->async_create_by_index = ops->async_create_by_index;
+	if (ops->async_create_by_index)
+		port_ops->async_create_by_index = ops->async_create_by_index;
+	if (ops->async_actions_update)
+		port_ops->async_actions_update = ops->async_actions_update;
+	if (ops->async_destroy)
+		port_ops->async_destroy = ops->async_destroy;
+	if (ops->push)
+		port_ops->push = ops->push;
+	if (ops->pull)
+		port_ops->pull = ops->pull;
+	if (ops->async_action_handle_create)
+		port_ops->async_action_handle_create = ops->async_action_handle_create;
+	if (ops->async_action_handle_destroy)
+		port_ops->async_action_handle_destroy = ops->async_action_handle_destroy;
+	if (ops->async_action_handle_update)
+		port_ops->async_action_handle_update = ops->async_action_handle_update;
+	if (ops->async_action_handle_query)
+		port_ops->async_action_handle_query = ops->async_action_handle_query;
+	if (ops->async_action_handle_query_update)
+		port_ops->async_action_handle_query_update = ops->async_action_handle_query_update;
+	if (ops->async_action_list_handle_create)
+		port_ops->async_action_list_handle_create = ops->async_action_list_handle_create;
+	if (ops->async_action_list_handle_destroy)
+		port_ops->async_action_list_handle_destroy = ops->async_action_list_handle_destroy;
+	if (ops->async_action_list_handle_query_update)
+		port_ops->async_action_list_handle_query_update =
+			ops->async_action_list_handle_query_update;
+
+	return 0;
+}
+
+void
+rte_flow_fp_ops_reset(uint16_t port_id)
+{
+	struct rte_flow_fp_ops *ops = &rte_flow_fp_ops[port_id];
+
+	ops->ctx = NULL;
+	ops->async_create = dummy_async_create;
+	ops->async_create_by_index = dummy_async_create_by_index;
+	ops->async_actions_update = dummy_async_actions_update;
+	ops->async_destroy = dummy_async_destroy;
+	ops->push = dummy_push;
+	ops->pull = dummy_pull;
+	ops->async_action_handle_create = dummy_async_action_handle_create;
+	ops->async_action_handle_destroy = dummy_async_action_handle_destroy;
+	ops->async_action_handle_update = dummy_async_action_handle_update;
+	ops->async_action_handle_query = dummy_async_action_handle_query;
+	ops->async_action_handle_query_update = dummy_async_action_handle_query_update;
+	ops->async_action_list_handle_create = dummy_async_action_list_handle_create;
+	ops->async_action_list_handle_destroy = dummy_async_action_list_handle_destroy;
+	ops->async_action_list_handle_query_update =
+		dummy_async_action_list_handle_query_update;
+}
+
+RTE_INIT(rte_flow_init_dummy_fp_ops) {
+	unsigned int port_id;
+
+	for (port_id = 0; port_id < RTE_MAX_ETHPORTS; ++port_id)
+		rte_flow_fp_ops_reset(port_id);
+}
diff --git a/lib/ethdev/rte_flow.h b/lib/ethdev/rte_flow.h
index 78b6bbb159..ccb37c70f1 100644
--- a/lib/ethdev/rte_flow.h
+++ b/lib/ethdev/rte_flow.h
@@ -5921,6 +5921,195 @@  rte_flow_group_set_miss_actions(uint16_t port_id,
 				const struct rte_flow_action actions[],
 				struct rte_flow_error *error);
 
+/* Forward declaration for callback typedef definition. */
+struct rte_flow_op_attr;
+
+/** @internal Enqueue rule creation operation. */
+typedef struct rte_flow *(*rte_flow_async_create_t)(void *ctx,
+						    uint32_t queue,
+						    const struct rte_flow_op_attr *attr,
+						    struct rte_flow_template_table *table,
+						    const struct rte_flow_item *items,
+						    uint8_t pattern_template_index,
+						    const struct rte_flow_action *actions,
+						    uint8_t action_template_index,
+						    void *user_data,
+						    struct rte_flow_error *error);
+
+/** @internal Enqueue rule creation by index operation. */
+typedef struct rte_flow *(*rte_flow_async_create_by_index_t)(void *ctx,
+							     uint32_t queue,
+							     const struct rte_flow_op_attr *attr,
+							     struct rte_flow_template_table *table,
+							     uint32_t rule_index,
+							     const struct rte_flow_action *actions,
+							     uint8_t action_template_index,
+							     void *user_data,
+							     struct rte_flow_error *error);
+
+/** @internal Enqueue rule update operation. */
+typedef int (*rte_flow_async_actions_update_t)(void *ctx,
+					       uint32_t queue_id,
+					       const struct rte_flow_op_attr *op_attr,
+					       struct rte_flow *flow,
+					       const struct rte_flow_action *actions,
+					       uint8_t actions_template_index,
+					       void *user_data,
+					       struct rte_flow_error *error);
+
+/** @internal Enqueue rule destruction operation. */
+typedef int (*rte_flow_async_destroy_t)(void *ctx,
+					uint32_t queue_id,
+					const struct rte_flow_op_attr *op_attr,
+					struct rte_flow *flow,
+					void *user_data,
+					struct rte_flow_error *error);
+
+/** @internal Push all internally stored rules to the HW. */
+typedef int (*rte_flow_push_t)(void *ctx,
+			       uint32_t queue_id,
+			       struct rte_flow_error *error);
+
+/* Forward declaration for callback typedef definition. */
+struct rte_flow_op_result;
+
+/** @internal Pull the flow rule operations results from the HW. */
+typedef int (*rte_flow_pull_t)(void *ctx,
+			       uint32_t queue_id,
+			       struct rte_flow_op_result *res,
+			       uint16_t n_res,
+			       struct rte_flow_error *error);
+
+/** @internal Enqueue indirect action creation operation. */
+typedef struct rte_flow_action_handle *(*rte_flow_async_action_handle_create_t)(
+					void *ctx,
+					uint32_t queue_id,
+					const struct rte_flow_op_attr *op_attr,
+					const struct rte_flow_indir_action_conf *indir_action_conf,
+					const struct rte_flow_action *action,
+					void *user_data,
+					struct rte_flow_error *error);
+
+/** @internal Enqueue indirect action destruction operation. */
+typedef int (*rte_flow_async_action_handle_destroy_t)(void *ctx,
+						      uint32_t queue_id,
+						      const struct rte_flow_op_attr *op_attr,
+						      struct rte_flow_action_handle *action_handle,
+						      void *user_data,
+						      struct rte_flow_error *error);
+
+/** @internal Enqueue indirect action update operation. */
+typedef int (*rte_flow_async_action_handle_update_t)(void *ctx,
+						     uint32_t queue_id,
+						     const struct rte_flow_op_attr *op_attr,
+						     struct rte_flow_action_handle *action_handle,
+						     const void *update,
+						     void *user_data,
+						     struct rte_flow_error *error);
+
+/** @internal Enqueue indirect action query operation. */
+typedef int (*rte_flow_async_action_handle_query_t)
+		(void *ctx,
+		 uint32_t queue_id,
+		 const struct rte_flow_op_attr *op_attr,
+		 const struct rte_flow_action_handle *action_handle,
+		 void *data,
+		 void *user_data,
+		 struct rte_flow_error *error);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Query and update operational mode.
+ *
+ * @see rte_flow_action_handle_query_update()
+ * @see rte_flow_async_action_handle_query_update()
+ */
+enum rte_flow_query_update_mode {
+	/* Default query_update operational mode.
+	 * If both `update` and `query` parameters are not NULL
+	 * the call updates and queries action in default port order.
+	 * If `update` parameter is NULL the call queries action.
+	 * If `query` parameter is NULL the call updates action.
+	 */
+	RTE_FLOW_QU_DEFAULT,
+	/* Force port to query action before update. */
+	RTE_FLOW_QU_QUERY_FIRST,
+	/* Force port to update action before update. */
+	RTE_FLOW_QU_UPDATE_FIRST,
+};
+
+/** @internal Enqueue indirect action query and/or update operation. */
+typedef int (*rte_flow_async_action_handle_query_update_t)(struct rte_eth_dev *dev,
+							   uint32_t queue_id,
+							   const struct rte_flow_op_attr *attr,
+							   struct rte_flow_action_handle *handle,
+							   const void *update, void *query,
+							   enum rte_flow_query_update_mode mode,
+							   void *user_data,
+							   struct rte_flow_error *error);
+
+/* Forward declaration. */
+struct rte_flow_action_list_handle;
+
+/** @internal Enqueue indirect action list creation operation. */
+typedef struct rte_flow_action_list_handle *(*rte_flow_async_action_list_handle_create_t)(
+	struct rte_eth_dev *dev,
+	uint32_t queue_id,
+	const struct rte_flow_op_attr *attr,
+	const struct rte_flow_indir_action_conf *conf,
+	const struct rte_flow_action *actions,
+	void *user_data,
+	struct rte_flow_error *error);
+
+/** @internal Enqueue indirect action list destruction operation. */
+typedef int (*rte_flow_async_action_list_handle_destroy_t)(
+	struct rte_eth_dev *dev,
+	uint32_t queue_id,
+	const struct rte_flow_op_attr *op_attr,
+	struct rte_flow_action_list_handle *handle,
+	void *user_data,
+	struct rte_flow_error *error);
+
+/** @internal Enqueue indirect action list query and/or update operation. */
+typedef int (*rte_flow_async_action_list_handle_query_update_t)(
+	struct rte_eth_dev *dev,
+	uint32_t queue_id,
+	const struct rte_flow_op_attr *attr,
+	const struct rte_flow_action_list_handle *handle,
+	const void **update,
+	void **query,
+	enum rte_flow_query_update_mode mode,
+	void *user_data,
+	struct rte_flow_error *error);
+
+/**
+ * @internal
+ *
+ * Fast path async flow functions and related data are held in a flat array, one entry per ethdev.
+ * It is assumed that each entry is read-only and cache aligned.
+ */
+struct rte_flow_fp_ops {
+	void *ctx;
+	rte_flow_async_create_t async_create;
+	rte_flow_async_create_by_index_t async_create_by_index;
+	rte_flow_async_actions_update_t async_actions_update;
+	rte_flow_async_destroy_t async_destroy;
+	rte_flow_push_t push;
+	rte_flow_pull_t pull;
+	rte_flow_async_action_handle_create_t async_action_handle_create;
+	rte_flow_async_action_handle_destroy_t async_action_handle_destroy;
+	rte_flow_async_action_handle_update_t async_action_handle_update;
+	rte_flow_async_action_handle_query_t async_action_handle_query;
+	rte_flow_async_action_handle_query_update_t async_action_handle_query_update;
+	rte_flow_async_action_list_handle_create_t async_action_list_handle_create;
+	rte_flow_async_action_list_handle_destroy_t async_action_list_handle_destroy;
+	rte_flow_async_action_list_handle_query_update_t async_action_list_handle_query_update;
+} __rte_cache_aligned;
+
+extern struct rte_flow_fp_ops rte_flow_fp_ops[RTE_MAX_ETHPORTS];
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice.
@@ -5973,7 +6162,7 @@  struct rte_flow_op_attr {
  *   Only completion result indicates that if there was success or failure.
  */
 __rte_experimental
-struct rte_flow *
+static inline struct rte_flow *
 rte_flow_async_create(uint16_t port_id,
 		      uint32_t queue_id,
 		      const struct rte_flow_op_attr *op_attr,
@@ -5983,7 +6172,31 @@  rte_flow_async_create(uint16_t port_id,
 		      const struct rte_flow_action actions[],
 		      uint8_t actions_template_index,
 		      void *user_data,
-		      struct rte_flow_error *error);
+		      struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	struct rte_flow *flow;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS) {
+		rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				   rte_strerror(ENODEV));
+		return NULL;
+	}
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	flow = ops->async_create(ops->ctx, queue_id,
+				 op_attr, template_table,
+				 pattern, pattern_template_index,
+				 actions, actions_template_index,
+				 user_data, error);
+
+	rte_flow_trace_async_create(port_id, queue_id, op_attr, template_table,
+				    pattern, pattern_template_index, actions,
+				    actions_template_index, user_data, flow);
+	return flow;
+}
 
 /**
  * @warning
@@ -6018,7 +6231,7 @@  rte_flow_async_create(uint16_t port_id,
  *   Only completion result indicates that if there was success or failure.
  */
 __rte_experimental
-struct rte_flow *
+static inline struct rte_flow *
 rte_flow_async_create_by_index(uint16_t port_id,
 			       uint32_t queue_id,
 			       const struct rte_flow_op_attr *op_attr,
@@ -6027,7 +6240,31 @@  rte_flow_async_create_by_index(uint16_t port_id,
 			       const struct rte_flow_action actions[],
 			       uint8_t actions_template_index,
 			       void *user_data,
-			       struct rte_flow_error *error);
+			       struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	struct rte_flow *flow;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS) {
+		rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				   rte_strerror(ENODEV));
+		return NULL;
+	}
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	flow = ops->async_create_by_index(ops->ctx, queue_id,
+					  op_attr, template_table,
+					  rule_index,
+					  actions, actions_template_index,
+					  user_data, error);
+
+	rte_flow_trace_async_create_by_index(port_id, queue_id, op_attr, template_table, rule_index,
+					     actions, actions_template_index, user_data, flow);
+
+	return flow;
+}
 
 /**
  * @warning
@@ -6059,13 +6296,33 @@  rte_flow_async_create_by_index(uint16_t port_id,
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 __rte_experimental
-int
+static inline int
 rte_flow_async_destroy(uint16_t port_id,
 		       uint32_t queue_id,
 		       const struct rte_flow_op_attr *op_attr,
 		       struct rte_flow *flow,
 		       void *user_data,
-		       struct rte_flow_error *error);
+		       struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->async_destroy(ops->ctx, queue_id,
+				 op_attr, flow,
+				 user_data, error);
+
+	rte_flow_trace_async_destroy(port_id, queue_id, op_attr, flow,
+				     user_data, ret);
+
+	return ret;
+}
 
 /**
  * @warning
@@ -6096,7 +6353,7 @@  rte_flow_async_destroy(uint16_t port_id,
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 __rte_experimental
-int
+static inline int
 rte_flow_async_actions_update(uint16_t port_id,
 			      uint32_t queue_id,
 			      const struct rte_flow_op_attr *op_attr,
@@ -6104,7 +6361,29 @@  rte_flow_async_actions_update(uint16_t port_id,
 			      const struct rte_flow_action actions[],
 			      uint8_t actions_template_index,
 			      void *user_data,
-			      struct rte_flow_error *error);
+			      struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->async_actions_update(ops->ctx, queue_id,
+					op_attr, flow,
+					actions, actions_template_index,
+					user_data, error);
+
+	rte_flow_trace_async_actions_update(port_id, queue_id, op_attr, flow,
+					    actions, actions_template_index,
+					    user_data, ret);
+
+	return ret;
+}
 
 /**
  * @warning
@@ -6127,10 +6406,27 @@  rte_flow_async_actions_update(uint16_t port_id,
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 __rte_experimental
-int
+static inline int
 rte_flow_push(uint16_t port_id,
 	      uint32_t queue_id,
-	      struct rte_flow_error *error);
+	      struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->push(ops->ctx, queue_id, error);
+
+	rte_flow_trace_push(port_id, queue_id, ret);
+
+	return ret;
+}
 
 /**
  * @warning
@@ -6193,12 +6489,29 @@  struct rte_flow_op_result {
  *   a negative errno value otherwise and rte_errno is set.
  */
 __rte_experimental
-int
+static inline int
 rte_flow_pull(uint16_t port_id,
 	      uint32_t queue_id,
 	      struct rte_flow_op_result res[],
 	      uint16_t n_res,
-	      struct rte_flow_error *error);
+	      struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->pull(ops->ctx, queue_id, res, n_res, error);
+
+	rte_flow_trace_pull(port_id, queue_id, res, n_res, ret);
+
+	return ret;
+}
 
 /**
  * @warning
@@ -6227,14 +6540,37 @@  rte_flow_pull(uint16_t port_id,
  *   A valid handle in case of success, NULL otherwise and rte_errno is set.
  */
 __rte_experimental
-struct rte_flow_action_handle *
+static inline struct rte_flow_action_handle *
 rte_flow_async_action_handle_create(uint16_t port_id,
 		uint32_t queue_id,
 		const struct rte_flow_op_attr *op_attr,
 		const struct rte_flow_indir_action_conf *indir_action_conf,
 		const struct rte_flow_action *action,
 		void *user_data,
-		struct rte_flow_error *error);
+		struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	struct rte_flow_action_handle *handle;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				   rte_strerror(ENODEV));
+		return NULL;
+	}
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	handle = ops->async_action_handle_create(ops->ctx, queue_id,
+						 op_attr, indir_action_conf, action,
+						 user_data, error);
+
+	rte_flow_trace_async_action_handle_create(port_id, queue_id, op_attr,
+						  indir_action_conf, action,
+						  user_data, handle);
+
+	return handle;
+}
 
 /**
  * @warning
@@ -6262,13 +6598,32 @@  rte_flow_async_action_handle_create(uint16_t port_id,
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 __rte_experimental
-int
+static inline int
 rte_flow_async_action_handle_destroy(uint16_t port_id,
 		uint32_t queue_id,
 		const struct rte_flow_op_attr *op_attr,
 		struct rte_flow_action_handle *action_handle,
 		void *user_data,
-		struct rte_flow_error *error);
+		struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->async_action_handle_destroy(ops->ctx, queue_id, op_attr, action_handle,
+					       user_data, error);
+
+	rte_flow_trace_async_action_handle_destroy(port_id, queue_id, op_attr,
+						   action_handle, user_data, ret);
+
+	return ret;
+}
 
 /**
  * @warning
@@ -6301,14 +6656,34 @@  rte_flow_async_action_handle_destroy(uint16_t port_id,
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 __rte_experimental
-int
+static inline int
 rte_flow_async_action_handle_update(uint16_t port_id,
 		uint32_t queue_id,
 		const struct rte_flow_op_attr *op_attr,
 		struct rte_flow_action_handle *action_handle,
 		const void *update,
 		void *user_data,
-		struct rte_flow_error *error);
+		struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->async_action_handle_update(ops->ctx, queue_id, op_attr, action_handle, update,
+					      user_data, error);
+
+	rte_flow_trace_async_action_handle_update(port_id, queue_id, op_attr,
+						  action_handle, update,
+						  user_data, ret);
+
+	return ret;
+}
 
 /**
  * @warning
@@ -6345,28 +6720,34 @@  rte_flow_async_action_handle_update(uint16_t port_id,
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 __rte_experimental
-int
+static inline int
 rte_flow_async_action_handle_query(uint16_t port_id,
 		uint32_t queue_id,
 		const struct rte_flow_op_attr *op_attr,
 		const struct rte_flow_action_handle *action_handle,
 		void *data,
 		void *user_data,
-		struct rte_flow_error *error);
+		struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
 
-/**
- * @warning
- * @b EXPERIMENTAL: this API may change without prior notice.
- *
- * Query and update operational mode.
- *
- * @see rte_flow_action_handle_query_update()
- * @see rte_flow_async_action_handle_query_update()
- */
-enum rte_flow_query_update_mode {
-	RTE_FLOW_QU_QUERY_FIRST = 1,  /**< Query before update. */
-	RTE_FLOW_QU_UPDATE_FIRST,     /**< Query after  update. */
-};
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->async_action_handle_query(ops->ctx, queue_id, op_attr, action_handle, data,
+					      user_data, error);
+
+	rte_flow_trace_async_action_handle_query(port_id, queue_id, op_attr,
+						 action_handle, data, user_data,
+						 ret);
+
+	return ret;
+}
 
 /**
  * @warning
@@ -6445,16 +6826,35 @@  rte_flow_action_handle_query_update(uint16_t port_id,
  *             both *update* and *query* are NULL.
  */
 __rte_experimental
-int
+static inline int
 rte_flow_async_action_handle_query_update(uint16_t port_id, uint32_t queue_id,
 					  const struct rte_flow_op_attr *attr,
 					  struct rte_flow_action_handle *handle,
 					  const void *update, void *query,
 					  enum rte_flow_query_update_mode mode,
 					  void *user_data,
-					  struct rte_flow_error *error);
+					  struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
 
-struct rte_flow_action_list_handle;
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->async_action_handle_query_update(ops->ctx, queue_id, attr, handle,
+						    update, query, mode,
+						    user_data, error);
+
+	rte_flow_trace_async_action_handle_query_update(port_id, queue_id, attr,
+							handle, update, query, mode,
+							user_data, ret);
+
+	return ret;
+}
 
 /**
  * @warning
@@ -6548,13 +6948,36 @@  rte_flow_action_list_handle_create(uint16_t port_id,
  *   - (-ENOTSUP) if *action* list element valid but unsupported.
  */
 __rte_experimental
-struct rte_flow_action_list_handle *
+static inline struct rte_flow_action_list_handle *
 rte_flow_async_action_list_handle_create(uint16_t port_id, uint32_t queue_id,
 					 const struct rte_flow_op_attr *attr,
 					 const struct rte_flow_indir_action_conf *conf,
 					 const struct rte_flow_action *actions,
 					 void *user_data,
-					 struct rte_flow_error *error);
+					 struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	struct rte_flow_action_list_handle *handle;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)  {
+		rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				   rte_strerror(ENODEV));
+		return NULL;
+	}
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	handle = ops->async_action_list_handle_create(ops->ctx, queue_id,
+						      attr, conf, actions,
+						      user_data, error);
+
+	rte_flow_trace_async_action_list_handle_create(port_id, queue_id, attr,
+						       conf, actions, user_data,
+						       handle);
+
+	return handle;
+}
 
 /**
  * @warning
@@ -6614,12 +7037,32 @@  rte_flow_action_list_handle_destroy(uint16_t port_id,
  *   - (-EBUSY) if actions list pointed by *action* handle still used
  */
 __rte_experimental
-int
+static inline int
 rte_flow_async_action_list_handle_destroy
 		(uint16_t port_id, uint32_t queue_id,
 		 const struct rte_flow_op_attr *op_attr,
 		 struct rte_flow_action_list_handle *handle,
-		 void *user_data, struct rte_flow_error *error);
+		 void *user_data, struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->async_action_list_handle_destroy(ops->ctx, queue_id, op_attr, handle,
+						    user_data, error);
+
+	rte_flow_trace_async_action_list_handle_destroy(port_id, queue_id,
+							op_attr, handle,
+							user_data, ret);
+
+	return ret;
+}
 
 /**
  * @warning
@@ -6709,14 +7152,37 @@  rte_flow_action_list_handle_query_update(uint16_t port_id,
  *             both *update* and *query* are NULL.
  */
 __rte_experimental
-int
+static inline int
 rte_flow_async_action_list_handle_query_update(uint16_t port_id, uint32_t queue_id,
 					  const struct rte_flow_op_attr *attr,
 					  const struct rte_flow_action_list_handle *handle,
 					  const void **update, void **query,
 					  enum rte_flow_query_update_mode mode,
 					  void *user_data,
-					  struct rte_flow_error *error);
+					  struct rte_flow_error *error)
+{
+	struct rte_flow_fp_ops *ops;
+	int ret;
+
+#ifdef RTE_FLOW_DEBUG
+	if (port_id >= RTE_MAX_ETHPORTS)
+		return rte_flow_error_set(error, ENODEV, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  rte_strerror(ENODEV));
+#endif
+
+	ops = &rte_flow_fp_ops[port_id];
+	ret = ops->async_action_list_handle_query_update(ops->ctx, queue_id, attr, handle,
+							 update, query, mode,
+							 user_data, error);
+
+	rte_flow_trace_async_action_list_handle_query_update(port_id, queue_id,
+							     attr, handle,
+							     update, query,
+							     mode, user_data,
+							     ret);
+
+	return ret;
+}
 
 /**
  * @warning
diff --git a/lib/ethdev/rte_flow_driver.h b/lib/ethdev/rte_flow_driver.h
index f35f659503..d537af1693 100644
--- a/lib/ethdev/rte_flow_driver.h
+++ b/lib/ethdev/rte_flow_driver.h
@@ -234,122 +234,12 @@  struct rte_flow_ops {
 		 const struct rte_flow_group_attr *attr,
 		 const struct rte_flow_action actions[],
 		 struct rte_flow_error *err);
-	/** See rte_flow_async_create() */
-	struct rte_flow *(*async_create)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 struct rte_flow_template_table *template_table,
-		 const struct rte_flow_item pattern[],
-		 uint8_t pattern_template_index,
-		 const struct rte_flow_action actions[],
-		 uint8_t actions_template_index,
-		 void *user_data,
-		 struct rte_flow_error *err);
-	/** See rte_flow_async_create_by_index() */
-	struct rte_flow *(*async_create_by_index)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 struct rte_flow_template_table *template_table,
-		 uint32_t rule_index,
-		 const struct rte_flow_action actions[],
-		 uint8_t actions_template_index,
-		 void *user_data,
-		 struct rte_flow_error *err);
-	/** See rte_flow_async_destroy() */
-	int (*async_destroy)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 struct rte_flow *flow,
-		 void *user_data,
-		 struct rte_flow_error *err);
-	/** See rte_flow_push() */
-	int (*push)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 struct rte_flow_error *err);
-	/** See rte_flow_pull() */
-	int (*pull)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 struct rte_flow_op_result res[],
-		 uint16_t n_res,
-		 struct rte_flow_error *error);
-	/** See rte_flow_async_action_handle_create() */
-	struct rte_flow_action_handle *(*async_action_handle_create)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 const struct rte_flow_indir_action_conf *indir_action_conf,
-		 const struct rte_flow_action *action,
-		 void *user_data,
-		 struct rte_flow_error *err);
-	/** See rte_flow_async_action_handle_destroy() */
-	int (*async_action_handle_destroy)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 struct rte_flow_action_handle *action_handle,
-		 void *user_data,
-		 struct rte_flow_error *error);
-	/** See rte_flow_async_action_handle_update() */
-	int (*async_action_handle_update)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 struct rte_flow_action_handle *action_handle,
-		 const void *update,
-		 void *user_data,
-		 struct rte_flow_error *error);
-	/** See rte_flow_async_action_handle_query() */
-	int (*async_action_handle_query)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 const struct rte_flow_action_handle *action_handle,
-		 void *data,
-		 void *user_data,
-		 struct rte_flow_error *error);
-	/** See rte_flow_async_action_handle_query_update */
-	int (*async_action_handle_query_update)
-		(struct rte_eth_dev *dev, uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 struct rte_flow_action_handle *action_handle,
-		 const void *update, void *query,
-		 enum rte_flow_query_update_mode qu_mode,
-		 void *user_data, struct rte_flow_error *error);
 	/** See rte_flow_actions_update(). */
 	int (*actions_update)
 		(struct rte_eth_dev *dev,
 		 struct rte_flow *flow,
 		 const struct rte_flow_action actions[],
 		 struct rte_flow_error *error);
-	/** See rte_flow_async_actions_update() */
-	int (*async_actions_update)
-		(struct rte_eth_dev *dev,
-		 uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 struct rte_flow *flow,
-		 const struct rte_flow_action actions[],
-		 uint8_t actions_template_index,
-		 void *user_data,
-		 struct rte_flow_error *error);
-	/** @see rte_flow_async_action_list_handle_create() */
-	struct rte_flow_action_list_handle *
-	(*async_action_list_handle_create)
-		(struct rte_eth_dev *dev, uint32_t queue_id,
-		 const struct rte_flow_op_attr *attr,
-		 const struct rte_flow_indir_action_conf *conf,
-		 const struct rte_flow_action *actions,
-		 void *user_data, struct rte_flow_error *error);
-	/** @see rte_flow_async_action_list_handle_destroy() */
-	int (*async_action_list_handle_destroy)
-		(struct rte_eth_dev *dev, uint32_t queue_id,
-		 const struct rte_flow_op_attr *op_attr,
-		 struct rte_flow_action_list_handle *action_handle,
-		 void *user_data, struct rte_flow_error *error);
 	/** @see rte_flow_action_list_handle_query_update() */
 	int (*action_list_handle_query_update)
 		(struct rte_eth_dev *dev,
@@ -357,14 +247,6 @@  struct rte_flow_ops {
 		 const void **update, void **query,
 		 enum rte_flow_query_update_mode mode,
 		 struct rte_flow_error *error);
-	/** @see rte_flow_async_action_list_handle_query_update() */
-	int (*async_action_list_handle_query_update)
-		(struct rte_eth_dev *dev, uint32_t queue_id,
-		 const struct rte_flow_op_attr *attr,
-		 const struct rte_flow_action_list_handle *handle,
-		 const void **update, void **query,
-		 enum rte_flow_query_update_mode mode,
-		 void *user_data, struct rte_flow_error *error);
 	/** @see rte_flow_calc_table_hash() */
 	int (*flow_calc_table_hash)
 		(struct rte_eth_dev *dev, const struct rte_flow_template_table *table,
@@ -394,6 +276,32 @@  rte_flow_ops_get(uint16_t port_id, struct rte_flow_error *error);
 int
 rte_flow_restore_info_dynflag_register(void);
 
+/**
+ * Register fast path flow operations for a given port.
+ *
+ * Whenever a new device is probed the PMD is responsible for calling this function
+ * to provide relevant fast path flow operations implementation.
+ *
+ * @param port_id
+ *   Port identifier.
+ * @param ops
+ *   Pointer to fast path flow operations container.
+ *
+ * @return
+ *   0 on success. Negative errno otherwise
+ */
+int
+rte_flow_fp_ops_register(uint16_t port_id, const struct rte_flow_fp_ops *ops);
+
+/**
+ * Provides dummy callbacks for fast path flow API functions.
+ *
+ * @param port_id
+ *   Port identifier.
+ */
+void
+rte_flow_fp_ops_reset(uint16_t port_id);
+
 #ifdef __cplusplus
 }
 #endif