diff mbox series

parray: introduce internal API for dynamic arrays

Message ID 20210614105839.3379790-1-thomas@monjalon.net (mailing list archive)
State Rejected
Delegated to: Thomas Monjalon
Headers show
Series parray: introduce internal API for dynamic arrays | expand

Checks

Context Check Description
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Functional fail Functional Testing issues
ci/iol-testing fail Testing issues
ci/iol-abi-testing success Testing PASS
ci/github-robot success github build: passed
ci/intel-Testing success Testing PASS
ci/Intel-compilation fail Compilation issues
ci/checkpatch warning coding style issues

Commit Message

Thomas Monjalon June 14, 2021, 10:58 a.m. UTC
Performance of access in a fixed-size array is very good
because of cache locality
and because there is a single pointer to dereference.
The only drawback is the lack of flexibility:
the size of such an array cannot be increase at runtime.

An approach to this problem is to allocate the array at runtime,
being as efficient as static arrays, but still limited to a maximum.

That's why the API rte_parray is introduced,
allowing to declare an array of pointer which can be resized dynamically
and automatically at runtime while keeping a good read performance.

After resize, the previous array is kept until the next resize
to avoid crashs during a read without any lock.

Each element is a pointer to a memory chunk dynamically allocated.
This is not good for cache locality but it allows to keep the same
memory per element, no matter how the array is resized.
Cache locality could be improved with mempools.
The other drawback is having to dereference one more pointer
to read an element.

There is not much locks, so the API is for internal use only.
This API may be used to completely remove some compilation-time maximums.

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
---
 MAINTAINERS                  |   1 +
 app/test/meson.build         |   2 +
 app/test/test_parray.c       | 120 ++++++++++++++++++++++++++
 lib/eal/common/meson.build   |   1 +
 lib/eal/common/rte_parray.c  | 161 +++++++++++++++++++++++++++++++++++
 lib/eal/include/meson.build  |   1 +
 lib/eal/include/rte_parray.h | 138 ++++++++++++++++++++++++++++++
 lib/eal/version.map          |   4 +
 8 files changed, 428 insertions(+)
 create mode 100644 app/test/test_parray.c
 create mode 100644 lib/eal/common/rte_parray.c
 create mode 100644 lib/eal/include/rte_parray.h

Comments

Morten Brørup June 14, 2021, 12:22 p.m. UTC | #1
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Monday, 14 June 2021 12.59
> 
> Performance of access in a fixed-size array is very good
> because of cache locality
> and because there is a single pointer to dereference.
> The only drawback is the lack of flexibility:
> the size of such an array cannot be increase at runtime.
> 
> An approach to this problem is to allocate the array at runtime,
> being as efficient as static arrays, but still limited to a maximum.
> 
> That's why the API rte_parray is introduced,
> allowing to declare an array of pointer which can be resized
> dynamically
> and automatically at runtime while keeping a good read performance.
> 
> After resize, the previous array is kept until the next resize
> to avoid crashs during a read without any lock.
> 
> Each element is a pointer to a memory chunk dynamically allocated.
> This is not good for cache locality but it allows to keep the same
> memory per element, no matter how the array is resized.
> Cache locality could be improved with mempools.
> The other drawback is having to dereference one more pointer
> to read an element.
> 
> There is not much locks, so the API is for internal use only.
> This API may be used to completely remove some compilation-time
> maximums.

I get the purpose and overall intention of this library.

I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability. It's my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way for me to stop this progress, and I do not intend to oppose to this library. :-)

This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few examples where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line between control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and shrink in the fast path.

If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their application specific per-port runtime data, and this library could serve that purpose too.

[snip]

> +
> +/** Main object representing a dynamic array of pointers. */
> +struct rte_parray {
> +	/** Array of pointer to dynamically allocated struct. */
> +	void **array;
> +	/** Old array before resize, freed on next resize. */
> +	void **old_array;
> +	/* Lock for alloc/free operations. */
> +	pthread_mutex_t mutex;
> +	/** Current size of the full array. */
> +	int32_t size;
> +	/** Number of allocated elements. */
> +	int32_t count;
> +	/** Last allocated element. */
> +	int32_t last;
> +};

Why not uint32_t for size, count and last?

Consider if the hot members of the struct should be moved closer together, for increasing the probability that they end up in the same cache line if the structure is not cache line aligned. Probably not important, just wanted to mention it.

-Morten
Bruce Richardson June 14, 2021, 1:15 p.m. UTC | #2
On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > Sent: Monday, 14 June 2021 12.59
> > 
> > Performance of access in a fixed-size array is very good
> > because of cache locality
> > and because there is a single pointer to dereference.
> > The only drawback is the lack of flexibility:
> > the size of such an array cannot be increase at runtime.
> > 
> > An approach to this problem is to allocate the array at runtime,
> > being as efficient as static arrays, but still limited to a maximum.
> > 
> > That's why the API rte_parray is introduced,
> > allowing to declare an array of pointer which can be resized
> > dynamically
> > and automatically at runtime while keeping a good read performance.
> > 
> > After resize, the previous array is kept until the next resize
> > to avoid crashs during a read without any lock.
> > 
> > Each element is a pointer to a memory chunk dynamically allocated.
> > This is not good for cache locality but it allows to keep the same
> > memory per element, no matter how the array is resized.
> > Cache locality could be improved with mempools.
> > The other drawback is having to dereference one more pointer
> > to read an element.
> > 
> > There is not much locks, so the API is for internal use only.
> > This API may be used to completely remove some compilation-time
> > maximums.
> 
> I get the purpose and overall intention of this library.
> 
> I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability. It's my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way for me to stop this progress, and I do not intend to oppose to this library. :-)
> 
> This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few examples where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line between control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and shrink in the fast path.
> 
> If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their application specific per-port runtime data, and this library could serve that purpose too.
> 

Thanks Thomas for starting this discussion and Morten for follow-up.

My thinking is as follows, and I'm particularly keeping in mind the cases
of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.

While I dislike the hard-coded limits in DPDK, I'm also not convinced that
we should switch away from the flat arrays or that we need fully dynamic
arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
house here, where we keep the ethdevs as an array, but one allocated/sized
at runtime rather than statically. This would allow us to have a
compile-time default value, but, for use cases that need it, allow use of a
flag e.g.  "max-ethdevs" to change the size of the parameter given to the
malloc call for the array.  This max limit could then be provided to apps
too if they want to match any array sizes. [Alternatively those apps could
check the provided size and error out if the size has been increased beyond
what the app is designed to use?]. There would be no extra dereferences per
rx/tx burst call in this scenario so performance should be the same as
before (potentially better if array is in hugepage memory, I suppose).

Regards,
/Bruce
Thomas Monjalon June 14, 2021, 1:28 p.m. UTC | #3
14/06/2021 14:22, Morten Brørup:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > Sent: Monday, 14 June 2021 12.59
> > 
> > Performance of access in a fixed-size array is very good
> > because of cache locality
> > and because there is a single pointer to dereference.
> > The only drawback is the lack of flexibility:
> > the size of such an array cannot be increase at runtime.
> > 
> > An approach to this problem is to allocate the array at runtime,
> > being as efficient as static arrays, but still limited to a maximum.
> > 
> > That's why the API rte_parray is introduced,
> > allowing to declare an array of pointer which can be resized
> > dynamically
> > and automatically at runtime while keeping a good read performance.
> > 
> > After resize, the previous array is kept until the next resize
> > to avoid crashs during a read without any lock.
> > 
> > Each element is a pointer to a memory chunk dynamically allocated.
> > This is not good for cache locality but it allows to keep the same
> > memory per element, no matter how the array is resized.
> > Cache locality could be improved with mempools.
> > The other drawback is having to dereference one more pointer
> > to read an element.
> > 
> > There is not much locks, so the API is for internal use only.
> > This API may be used to completely remove some compilation-time
> > maximums.
> 
> I get the purpose and overall intention of this library.
> 
> I probably already mentioned that I prefer
> "embedded style programming" with fixed size arrays,
> rather than runtime configurability.
> It's my personal opinion, and the DPDK Tech Board clearly prefers
> reducing the amount of compile time configurability,
> so there is no way for me to stop this progress,
> and I do not intend to oppose to this library. :-)

Embedded-style is highly customized and limited.
DPDK is more used in standard servers where
deployment must be easy and dynamically configurable.
That's my view on where we go, but I understand
some can have opposite goals. Thus the discussion :)

> This library is likely to become a core library of DPDK,
> so I think it is important getting it right.
> Could you please mention a few examples where
> you think this internal library should be used,

It could be used for device arrays which are managed
(alloc/free) in the main thread as part of init
and hotplug operations.
Other threads should be readers only.

> and where it should not be used.
> Then it is easier to discuss if the border line
> between control path and data plane is correct.
> E.g. this library is not intended to be used for dynamically
> sized packet queues that grow and shrink in the fast path.

Correct.
If fast path threads need to alloc/free, this is not the right API.
That's not a queue, just a growing array where each element has an index.

> If the library becomes a core DPDK library,
> it should probably be public instead of internal.
> E.g. if the library is used to make RTE_MAX_ETHPORTS dynamic
> instead of compile time fixed,
> then some applications might also need dynamically sized arrays
> for their application specific per-port runtime data,
> and this library could serve that purpose too.

It could be convenient but risky if users don't understand well
the limitations. I am not sure what to do.

> [snip]
> 
> > +
> > +/** Main object representing a dynamic array of pointers. */
> > +struct rte_parray {
> > +	/** Array of pointer to dynamically allocated struct. */
> > +	void **array;
> > +	/** Old array before resize, freed on next resize. */
> > +	void **old_array;
> > +	/* Lock for alloc/free operations. */
> > +	pthread_mutex_t mutex;
> > +	/** Current size of the full array. */
> > +	int32_t size;
> > +	/** Number of allocated elements. */
> > +	int32_t count;
> > +	/** Last allocated element. */
> > +	int32_t last;
> > +};
> 
> Why not uint32_t for size, count and last?

2 reasons:
1/ anyway we are limited to int32_t for the index.
2/ having the same type for all avoid compiler complaining
when comparing values.

> Consider if the hot members of the struct should be moved
> closer together, for increasing the probability
> that they end up in the same cache line
> if the structure is not cache line aligned. Probably not important,
> just wanted to mention it.

The only hot member is the array itself.
Depending on mutex implementation, all could be in a single cacheline.
Thomas Monjalon June 14, 2021, 1:32 p.m. UTC | #4
14/06/2021 15:15, Bruce Richardson:
> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > > Sent: Monday, 14 June 2021 12.59
> > > 
> > > Performance of access in a fixed-size array is very good
> > > because of cache locality
> > > and because there is a single pointer to dereference.
> > > The only drawback is the lack of flexibility:
> > > the size of such an array cannot be increase at runtime.
> > > 
> > > An approach to this problem is to allocate the array at runtime,
> > > being as efficient as static arrays, but still limited to a maximum.
> > > 
> > > That's why the API rte_parray is introduced,
> > > allowing to declare an array of pointer which can be resized
> > > dynamically
> > > and automatically at runtime while keeping a good read performance.
> > > 
> > > After resize, the previous array is kept until the next resize
> > > to avoid crashs during a read without any lock.
> > > 
> > > Each element is a pointer to a memory chunk dynamically allocated.
> > > This is not good for cache locality but it allows to keep the same
> > > memory per element, no matter how the array is resized.
> > > Cache locality could be improved with mempools.
> > > The other drawback is having to dereference one more pointer
> > > to read an element.
> > > 
> > > There is not much locks, so the API is for internal use only.
> > > This API may be used to completely remove some compilation-time
> > > maximums.
> > 
> > I get the purpose and overall intention of this library.
> > 
> > I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability. It's my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way for me to stop this progress, and I do not intend to oppose to this library. :-)
> > 
> > This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few examples where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line between control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and shrink in the fast path.
> > 
> > If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their application specific per-port runtime data, and this library could serve that purpose too.
> > 
> 
> Thanks Thomas for starting this discussion and Morten for follow-up.
> 
> My thinking is as follows, and I'm particularly keeping in mind the cases
> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> 
> While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> we should switch away from the flat arrays or that we need fully dynamic
> arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> house here, where we keep the ethdevs as an array, but one allocated/sized
> at runtime rather than statically. This would allow us to have a
> compile-time default value, but, for use cases that need it, allow use of a
> flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> malloc call for the array.  This max limit could then be provided to apps
> too if they want to match any array sizes. [Alternatively those apps could
> check the provided size and error out if the size has been increased beyond
> what the app is designed to use?]. There would be no extra dereferences per
> rx/tx burst call in this scenario so performance should be the same as
> before (potentially better if array is in hugepage memory, I suppose).

I think we need some benchmarks to decide what is the best tradeoff.
I spent time on this implementation, but sorry I won't have time for benchmarks.
Volunteers?
Ananyev, Konstantin June 14, 2021, 2:59 p.m. UTC | #5
> 
> 14/06/2021 15:15, Bruce Richardson:
> > On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > > > Sent: Monday, 14 June 2021 12.59
> > > >
> > > > Performance of access in a fixed-size array is very good
> > > > because of cache locality
> > > > and because there is a single pointer to dereference.
> > > > The only drawback is the lack of flexibility:
> > > > the size of such an array cannot be increase at runtime.
> > > >
> > > > An approach to this problem is to allocate the array at runtime,
> > > > being as efficient as static arrays, but still limited to a maximum.
> > > >
> > > > That's why the API rte_parray is introduced,
> > > > allowing to declare an array of pointer which can be resized
> > > > dynamically
> > > > and automatically at runtime while keeping a good read performance.
> > > >
> > > > After resize, the previous array is kept until the next resize
> > > > to avoid crashs during a read without any lock.
> > > >
> > > > Each element is a pointer to a memory chunk dynamically allocated.
> > > > This is not good for cache locality but it allows to keep the same
> > > > memory per element, no matter how the array is resized.
> > > > Cache locality could be improved with mempools.
> > > > The other drawback is having to dereference one more pointer
> > > > to read an element.
> > > >
> > > > There is not much locks, so the API is for internal use only.
> > > > This API may be used to completely remove some compilation-time
> > > > maximums.
> > >
> > > I get the purpose and overall intention of this library.
> > >
> > > I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability. It's
> my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way for
> me to stop this progress, and I do not intend to oppose to this library. :-)
> > >
> > > This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few examples
> where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line between
> control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and shrink in
> the fast path.
> > >
> > > If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their
> application specific per-port runtime data, and this library could serve that purpose too.
> > >
> >
> > Thanks Thomas for starting this discussion and Morten for follow-up.
> >
> > My thinking is as follows, and I'm particularly keeping in mind the cases
> > of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> >
> > While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> > we should switch away from the flat arrays or that we need fully dynamic
> > arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> > house here, where we keep the ethdevs as an array, but one allocated/sized
> > at runtime rather than statically. This would allow us to have a
> > compile-time default value, but, for use cases that need it, allow use of a
> > flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> > malloc call for the array.  This max limit could then be provided to apps
> > too if they want to match any array sizes. [Alternatively those apps could
> > check the provided size and error out if the size has been increased beyond
> > what the app is designed to use?]. There would be no extra dereferences per
> > rx/tx burst call in this scenario so performance should be the same as
> > before (potentially better if array is in hugepage memory, I suppose).
> 
> I think we need some benchmarks to decide what is the best tradeoff.
> I spent time on this implementation, but sorry I won't have time for benchmarks.
> Volunteers?
 
I had only a quick look at your approach so far.
But from what I can read, in MT environment your suggestion will require
extra synchronization for each read-write access to such parray element (lock, rcu, ...).
I think what Bruce suggests will be much ligther, easier to implement and less error prone.
At least for rte_ethdevs[] and friends.
Konstantin
Morten Brørup June 14, 2021, 3:48 p.m. UTC | #6
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Monday, 14 June 2021 15.32
> 
> 14/06/2021 15:15, Bruce Richardson:
> > On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> Monjalon
> > > > Sent: Monday, 14 June 2021 12.59
> > > >
> > > > Performance of access in a fixed-size array is very good
> > > > because of cache locality
> > > > and because there is a single pointer to dereference.
> > > > The only drawback is the lack of flexibility:
> > > > the size of such an array cannot be increase at runtime.
> > > >
> > > > An approach to this problem is to allocate the array at runtime,
> > > > being as efficient as static arrays, but still limited to a
> maximum.
> > > >
> > > > That's why the API rte_parray is introduced,
> > > > allowing to declare an array of pointer which can be resized
> > > > dynamically
> > > > and automatically at runtime while keeping a good read
> performance.
> > > >
> > > > After resize, the previous array is kept until the next resize
> > > > to avoid crashs during a read without any lock.
> > > >
> > > > Each element is a pointer to a memory chunk dynamically
> allocated.
> > > > This is not good for cache locality but it allows to keep the
> same
> > > > memory per element, no matter how the array is resized.
> > > > Cache locality could be improved with mempools.
> > > > The other drawback is having to dereference one more pointer
> > > > to read an element.
> > > >
> > > > There is not much locks, so the API is for internal use only.
> > > > This API may be used to completely remove some compilation-time
> > > > maximums.
> > >
> > > I get the purpose and overall intention of this library.
> > >
> > > I probably already mentioned that I prefer "embedded style
> programming" with fixed size arrays, rather than runtime
> configurability. It's my personal opinion, and the DPDK Tech Board
> clearly prefers reducing the amount of compile time configurability, so
> there is no way for me to stop this progress, and I do not intend to
> oppose to this library. :-)
> > >
> > > This library is likely to become a core library of DPDK, so I think
> it is important getting it right. Could you please mention a few
> examples where you think this internal library should be used, and
> where it should not be used. Then it is easier to discuss if the border
> line between control path and data plane is correct. E.g. this library
> is not intended to be used for dynamically sized packet queues that
> grow and shrink in the fast path.
> > >
> > > If the library becomes a core DPDK library, it should probably be
> public instead of internal. E.g. if the library is used to make
> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some
> applications might also need dynamically sized arrays for their
> application specific per-port runtime data, and this library could
> serve that purpose too.
> > >
> >
> > Thanks Thomas for starting this discussion and Morten for follow-up.
> >
> > My thinking is as follows, and I'm particularly keeping in mind the
> cases
> > of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> >
> > While I dislike the hard-coded limits in DPDK, I'm also not convinced
> that
> > we should switch away from the flat arrays or that we need fully
> dynamic
> > arrays that grow/shrink at runtime for ethdevs. I would suggest a
> half-way
> > house here, where we keep the ethdevs as an array, but one
> allocated/sized
> > at runtime rather than statically. This would allow us to have a
> > compile-time default value, but, for use cases that need it, allow
> use of a
> > flag e.g.  "max-ethdevs" to change the size of the parameter given to
> the
> > malloc call for the array.  This max limit could then be provided to
> apps
> > too if they want to match any array sizes. [Alternatively those apps
> could
> > check the provided size and error out if the size has been increased
> beyond
> > what the app is designed to use?]. There would be no extra
> dereferences per
> > rx/tx burst call in this scenario so performance should be the same
> as
> > before (potentially better if array is in hugepage memory, I
> suppose).

If performance can be improved by allocating array memory differently, we can just allocate memory differently - dynamically sized arrays are not required. :-)

> 
> I think we need some benchmarks to decide what is the best tradeoff.

While performance is always important, the DPDK community seems willing to trade in a little bit of performance for obtaining some other great benefit. I agree with this pragmatic approach. However, the word "tradeoff" triggered another line of thinking:

Regarding this library, we must carefully consider if the benefit is worth the added complexity. We shouldn't introduce additional complexity only to save a few MB of memory, and for the pure principle of avoiding compile time configuration parameters.

It would be much simpler to just increase RTE_MAX_ETHPORTS to something big enough to hold a sufficiently large array. And possibly add an rte_max_ethports variable to indicate the number of populated entries in the array, for use when iterating over the array.

Can we come up with another example than RTE_MAX_ETHPORTS where this library provides a better benefit?

> I spent time on this implementation, but sorry I won't have time for
> benchmarks.
> Volunteers?
Jerin Jacob June 14, 2021, 3:48 p.m. UTC | #7
On Mon, Jun 14, 2021 at 8:29 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
> >
> > 14/06/2021 15:15, Bruce Richardson:
> > > On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > > > > Sent: Monday, 14 June 2021 12.59
> > > > >
> > > > > Performance of access in a fixed-size array is very good
> > > > > because of cache locality
> > > > > and because there is a single pointer to dereference.
> > > > > The only drawback is the lack of flexibility:
> > > > > the size of such an array cannot be increase at runtime.
> > > > >
> > > > > An approach to this problem is to allocate the array at runtime,
> > > > > being as efficient as static arrays, but still limited to a maximum.
> > > > >
> > > > > That's why the API rte_parray is introduced,
> > > > > allowing to declare an array of pointer which can be resized
> > > > > dynamically
> > > > > and automatically at runtime while keeping a good read performance.
> > > > >
> > > > > After resize, the previous array is kept until the next resize
> > > > > to avoid crashs during a read without any lock.
> > > > >
> > > > > Each element is a pointer to a memory chunk dynamically allocated.
> > > > > This is not good for cache locality but it allows to keep the same
> > > > > memory per element, no matter how the array is resized.
> > > > > Cache locality could be improved with mempools.
> > > > > The other drawback is having to dereference one more pointer
> > > > > to read an element.
> > > > >
> > > > > There is not much locks, so the API is for internal use only.
> > > > > This API may be used to completely remove some compilation-time
> > > > > maximums.
> > > >
> > > > I get the purpose and overall intention of this library.
> > > >
> > > > I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability. It's
> > my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way for
> > me to stop this progress, and I do not intend to oppose to this library. :-)
> > > >
> > > > This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few examples
> > where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line between
> > control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and shrink in
> > the fast path.
> > > >
> > > > If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
> > RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their
> > application specific per-port runtime data, and this library could serve that purpose too.
> > > >
> > >
> > > Thanks Thomas for starting this discussion and Morten for follow-up.
> > >
> > > My thinking is as follows, and I'm particularly keeping in mind the cases
> > > of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> > >
> > > While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> > > we should switch away from the flat arrays or that we need fully dynamic
> > > arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> > > house here, where we keep the ethdevs as an array, but one allocated/sized
> > > at runtime rather than statically. This would allow us to have a
> > > compile-time default value, but, for use cases that need it, allow use of a
> > > flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> > > malloc call for the array.  This max limit could then be provided to apps
> > > too if they want to match any array sizes. [Alternatively those apps could
> > > check the provided size and error out if the size has been increased beyond
> > > what the app is designed to use?]. There would be no extra dereferences per
> > > rx/tx burst call in this scenario so performance should be the same as
> > > before (potentially better if array is in hugepage memory, I suppose).
> >
> > I think we need some benchmarks to decide what is the best tradeoff.
> > I spent time on this implementation, but sorry I won't have time for benchmarks.
> > Volunteers?
>
> I had only a quick look at your approach so far.
> But from what I can read, in MT environment your suggestion will require
> extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> At least for rte_ethdevs[] and friends.

+1

> Konstantin
>
>
Ananyev, Konstantin June 14, 2021, 3:54 p.m. UTC | #8
> >
> > 14/06/2021 15:15, Bruce Richardson:
> > > On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > > > > Sent: Monday, 14 June 2021 12.59
> > > > >
> > > > > Performance of access in a fixed-size array is very good
> > > > > because of cache locality
> > > > > and because there is a single pointer to dereference.
> > > > > The only drawback is the lack of flexibility:
> > > > > the size of such an array cannot be increase at runtime.
> > > > >
> > > > > An approach to this problem is to allocate the array at runtime,
> > > > > being as efficient as static arrays, but still limited to a maximum.
> > > > >
> > > > > That's why the API rte_parray is introduced,
> > > > > allowing to declare an array of pointer which can be resized
> > > > > dynamically
> > > > > and automatically at runtime while keeping a good read performance.
> > > > >
> > > > > After resize, the previous array is kept until the next resize
> > > > > to avoid crashs during a read without any lock.
> > > > >
> > > > > Each element is a pointer to a memory chunk dynamically allocated.
> > > > > This is not good for cache locality but it allows to keep the same
> > > > > memory per element, no matter how the array is resized.
> > > > > Cache locality could be improved with mempools.
> > > > > The other drawback is having to dereference one more pointer
> > > > > to read an element.
> > > > >
> > > > > There is not much locks, so the API is for internal use only.
> > > > > This API may be used to completely remove some compilation-time
> > > > > maximums.
> > > >
> > > > I get the purpose and overall intention of this library.
> > > >
> > > > I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability.
> It's
> > my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way for
> > me to stop this progress, and I do not intend to oppose to this library. :-)
> > > >
> > > > This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few
> examples
> > where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line between
> > control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and shrink
> in
> > the fast path.
> > > >
> > > > If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
> > RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their
> > application specific per-port runtime data, and this library could serve that purpose too.
> > > >
> > >
> > > Thanks Thomas for starting this discussion and Morten for follow-up.
> > >
> > > My thinking is as follows, and I'm particularly keeping in mind the cases
> > > of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> > >
> > > While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> > > we should switch away from the flat arrays or that we need fully dynamic
> > > arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> > > house here, where we keep the ethdevs as an array, but one allocated/sized
> > > at runtime rather than statically. This would allow us to have a
> > > compile-time default value, but, for use cases that need it, allow use of a
> > > flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> > > malloc call for the array.  This max limit could then be provided to apps
> > > too if they want to match any array sizes. [Alternatively those apps could
> > > check the provided size and error out if the size has been increased beyond
> > > what the app is designed to use?]. There would be no extra dereferences per
> > > rx/tx burst call in this scenario so performance should be the same as
> > > before (potentially better if array is in hugepage memory, I suppose).
> >
> > I think we need some benchmarks to decide what is the best tradeoff.
> > I spent time on this implementation, but sorry I won't have time for benchmarks.
> > Volunteers?
> 
> I had only a quick look at your approach so far.
> But from what I can read, in MT environment your suggestion will require
> extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> At least for rte_ethdevs[] and friends.
> Konstantin

One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
any regressions.
That could still be flat array with max_size specified at application startup.
2. Hide rest of rte_ethdev struct in .c.
That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
(flat array, vector, hash, linked list) without ABI/API breakages.

Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
(to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
Probably some macro can be provided to simplify it.

The only significant complication I can foresee with implementing that approach -
we'll need a an array of 'fast' function pointers per queue, not per device as we have now
(to avoid extra indirection for callback implementation).
Though as a bonus we'll have ability to use different RX/TX funcions per queue.
Thomas Monjalon June 15, 2021, 6:48 a.m. UTC | #9
14/06/2021 17:48, Morten Brørup:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> It would be much simpler to just increase RTE_MAX_ETHPORTS to something big enough to hold a sufficiently large array. And possibly add an rte_max_ethports variable to indicate the number of populated entries in the array, for use when iterating over the array.
> 
> Can we come up with another example than RTE_MAX_ETHPORTS where this library provides a better benefit?

What is big enough?
Is 640KB enough for RAM? ;)

When dealing with microservices switching, the numbers can increase very fast.
Thomas Monjalon June 15, 2021, 6:52 a.m. UTC | #10
14/06/2021 17:48, Jerin Jacob:
> On Mon, Jun 14, 2021 at 8:29 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> > > 14/06/2021 15:15, Bruce Richardson:
> > > > On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > > > > > Sent: Monday, 14 June 2021 12.59
> > > > > >
> > > > > > Performance of access in a fixed-size array is very good
> > > > > > because of cache locality
> > > > > > and because there is a single pointer to dereference.
> > > > > > The only drawback is the lack of flexibility:
> > > > > > the size of such an array cannot be increase at runtime.
> > > > > >
> > > > > > An approach to this problem is to allocate the array at runtime,
> > > > > > being as efficient as static arrays, but still limited to a maximum.
> > > > > >
> > > > > > That's why the API rte_parray is introduced,
> > > > > > allowing to declare an array of pointer which can be resized
> > > > > > dynamically
> > > > > > and automatically at runtime while keeping a good read performance.
> > > > > >
> > > > > > After resize, the previous array is kept until the next resize
> > > > > > to avoid crashs during a read without any lock.
> > > > > >
> > > > > > Each element is a pointer to a memory chunk dynamically allocated.
> > > > > > This is not good for cache locality but it allows to keep the same
> > > > > > memory per element, no matter how the array is resized.
> > > > > > Cache locality could be improved with mempools.
> > > > > > The other drawback is having to dereference one more pointer
> > > > > > to read an element.
> > > > > >
> > > > > > There is not much locks, so the API is for internal use only.
> > > > > > This API may be used to completely remove some compilation-time
> > > > > > maximums.
> > > > >
> > > > > I get the purpose and overall intention of this library.
> > > > >
> > > > > I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability. It's
> > > my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way for
> > > me to stop this progress, and I do not intend to oppose to this library. :-)
> > > > >
> > > > > This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few examples
> > > where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line between
> > > control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and shrink in
> > > the fast path.
> > > > >
> > > > > If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
> > > RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their
> > > application specific per-port runtime data, and this library could serve that purpose too.
> > > > >
> > > >
> > > > Thanks Thomas for starting this discussion and Morten for follow-up.
> > > >
> > > > My thinking is as follows, and I'm particularly keeping in mind the cases
> > > > of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> > > >
> > > > While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> > > > we should switch away from the flat arrays or that we need fully dynamic
> > > > arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> > > > house here, where we keep the ethdevs as an array, but one allocated/sized
> > > > at runtime rather than statically. This would allow us to have a
> > > > compile-time default value, but, for use cases that need it, allow use of a
> > > > flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> > > > malloc call for the array.  This max limit could then be provided to apps
> > > > too if they want to match any array sizes. [Alternatively those apps could
> > > > check the provided size and error out if the size has been increased beyond
> > > > what the app is designed to use?]. There would be no extra dereferences per
> > > > rx/tx burst call in this scenario so performance should be the same as
> > > > before (potentially better if array is in hugepage memory, I suppose).
> > >
> > > I think we need some benchmarks to decide what is the best tradeoff.
> > > I spent time on this implementation, but sorry I won't have time for benchmarks.
> > > Volunteers?
> >
> > I had only a quick look at your approach so far.
> > But from what I can read, in MT environment your suggestion will require
> > extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> > I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> > At least for rte_ethdevs[] and friends.
> 
> +1

Please could you have a deeper look and tell me why we need more locks?
The element pointers doesn't change.
Only the array pointer change at resize,
but the old one is still usable until the next resize.
I think we don't need more.
Morten Brørup June 15, 2021, 7:53 a.m. UTC | #11
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Tuesday, 15 June 2021 08.48
> 
> 14/06/2021 17:48, Morten Brørup:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> Monjalon
> > It would be much simpler to just increase RTE_MAX_ETHPORTS to
> something big enough to hold a sufficiently large array. And possibly
> add an rte_max_ethports variable to indicate the number of populated
> entries in the array, for use when iterating over the array.
> >
> > Can we come up with another example than RTE_MAX_ETHPORTS where this
> library provides a better benefit?
> 
> What is big enough?
> Is 640KB enough for RAM? ;)

Good point!

I think we agree that:
- The cost of this library is some added complexity, i.e. working with a dynamically sized array through a library instead of just indexing into a compile time fixed size array.
- The main benefit of this library is saving some RAM (and still allowing a potentially very high number of ports.)

My point was: The amount of RAM we are saving is a key parameter for the cost/benefit analysis. And since I don't think the rte_eth_devices[] array uses a significant amount of memory, I was asking for some other array using more memory, where the cost/benefit analysis would come out more advantageous to your proposed parray library.

> 
> When dealing with microservices switching, the numbers can increase
> very fast.

Yes, I strongly supported increasing the port_id type from 8 to 16 bits for this reason, when it was discussed at the DPDK Userspace a few years ago in Dublin. And with large RTE_MAX_QUEUES_PER_PORT values, the rte_eth_dev structure uses quite a lot of space for the rx/tx callback arrays. But the memory usage of rte_eth_devices[] is still relatively insignificant in a system wide context.

If main purpose is to optimize the rte_eth_devices[] array, I think there are better alternatives than this library. Bruce and Konstantin already threw a few ideas on the table.
Jerin Jacob June 15, 2021, 8 a.m. UTC | #12
On Tue, Jun 15, 2021 at 12:22 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 14/06/2021 17:48, Jerin Jacob:
> > On Mon, Jun 14, 2021 at 8:29 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > > > 14/06/2021 15:15, Bruce Richardson:
> > > > > On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > > > > > > Sent: Monday, 14 June 2021 12.59
> > > > > > >
> > > > > > > Performance of access in a fixed-size array is very good
> > > > > > > because of cache locality
> > > > > > > and because there is a single pointer to dereference.
> > > > > > > The only drawback is the lack of flexibility:
> > > > > > > the size of such an array cannot be increase at runtime.
> > > > > > >
> > > > > > > An approach to this problem is to allocate the array at runtime,
> > > > > > > being as efficient as static arrays, but still limited to a maximum.
> > > > > > >
> > > > > > > That's why the API rte_parray is introduced,
> > > > > > > allowing to declare an array of pointer which can be resized
> > > > > > > dynamically
> > > > > > > and automatically at runtime while keeping a good read performance.
> > > > > > >
> > > > > > > After resize, the previous array is kept until the next resize
> > > > > > > to avoid crashs during a read without any lock.
> > > > > > >
> > > > > > > Each element is a pointer to a memory chunk dynamically allocated.
> > > > > > > This is not good for cache locality but it allows to keep the same
> > > > > > > memory per element, no matter how the array is resized.
> > > > > > > Cache locality could be improved with mempools.
> > > > > > > The other drawback is having to dereference one more pointer
> > > > > > > to read an element.
> > > > > > >
> > > > > > > There is not much locks, so the API is for internal use only.
> > > > > > > This API may be used to completely remove some compilation-time
> > > > > > > maximums.
> > > > > >
> > > > > > I get the purpose and overall intention of this library.
> > > > > >
> > > > > > I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability. It's
> > > > my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way for
> > > > me to stop this progress, and I do not intend to oppose to this library. :-)
> > > > > >
> > > > > > This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few examples
> > > > where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line between
> > > > control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and shrink in
> > > > the fast path.
> > > > > >
> > > > > > If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
> > > > RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their
> > > > application specific per-port runtime data, and this library could serve that purpose too.
> > > > > >
> > > > >
> > > > > Thanks Thomas for starting this discussion and Morten for follow-up.
> > > > >
> > > > > My thinking is as follows, and I'm particularly keeping in mind the cases
> > > > > of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> > > > >
> > > > > While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> > > > > we should switch away from the flat arrays or that we need fully dynamic
> > > > > arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> > > > > house here, where we keep the ethdevs as an array, but one allocated/sized
> > > > > at runtime rather than statically. This would allow us to have a
> > > > > compile-time default value, but, for use cases that need it, allow use of a
> > > > > flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> > > > > malloc call for the array.  This max limit could then be provided to apps
> > > > > too if they want to match any array sizes. [Alternatively those apps could
> > > > > check the provided size and error out if the size has been increased beyond
> > > > > what the app is designed to use?]. There would be no extra dereferences per
> > > > > rx/tx burst call in this scenario so performance should be the same as
> > > > > before (potentially better if array is in hugepage memory, I suppose).
> > > >
> > > > I think we need some benchmarks to decide what is the best tradeoff.
> > > > I spent time on this implementation, but sorry I won't have time for benchmarks.
> > > > Volunteers?
> > >
> > > I had only a quick look at your approach so far.
> > > But from what I can read, in MT environment your suggestion will require
> > > extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> > > I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> > > At least for rte_ethdevs[] and friends.
> >
> > +1
>
> Please could you have a deeper look and tell me why we need more locks?

We don't need more locks (It is fat mutex) now in the implementation.

If it needs to use in fastpath, we need more state of art
synchronization like RCU.

Also, you can take look at VPP dynamic array implementation which is
used in fastpath.

https://docs.fd.io/vpp/21.10/db/d65/vec_8h.html

So the question is the use case for this API. Is it for slowpath item
like ethdev[] memory
or fastpath items like holding an array of mbuf etc.


> The element pointers doesn't change.
> Only the array pointer change at resize,
> but the old one is still usable until the next resize.
> I think we don't need more.
>
>
Bruce Richardson June 15, 2021, 8:44 a.m. UTC | #13
On Tue, Jun 15, 2021 at 09:53:33AM +0200, Morten Brørup wrote:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > Sent: Tuesday, 15 June 2021 08.48
> > 
> > 14/06/2021 17:48, Morten Brørup:
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> > Monjalon
> > > It would be much simpler to just increase RTE_MAX_ETHPORTS to
> > something big enough to hold a sufficiently large array. And possibly
> > add an rte_max_ethports variable to indicate the number of populated
> > entries in the array, for use when iterating over the array.
> > >
> > > Can we come up with another example than RTE_MAX_ETHPORTS where this
> > library provides a better benefit?
> > 
> > What is big enough?
> > Is 640KB enough for RAM? ;)
> 
> Good point!
> 
> I think we agree that:
> - The cost of this library is some added complexity, i.e. working with a dynamically sized array through a library instead of just indexing into a compile time fixed size array.
> - The main benefit of this library is saving some RAM (and still allowing a potentially very high number of ports.)
> 
> My point was: The amount of RAM we are saving is a key parameter for the cost/benefit analysis. And since I don't think the rte_eth_devices[] array uses a significant amount of memory, I was asking for some other array using more memory, where the cost/benefit analysis would come out more advantageous to your proposed parray library.
> 
> > 
> > When dealing with microservices switching, the numbers can increase
> > very fast.
> 
> Yes, I strongly supported increasing the port_id type from 8 to 16 bits for this reason, when it was discussed at the DPDK Userspace a few years ago in Dublin. And with large RTE_MAX_QUEUES_PER_PORT values, the rte_eth_dev structure uses quite a lot of space for the rx/tx callback arrays. But the memory usage of rte_eth_devices[] is still relatively insignificant in a system wide context.
> 
> If main purpose is to optimize the rte_eth_devices[] array, I think there are better alternatives than this library. Bruce and Konstantin already threw a few ideas on the table.
>

Yes, though I think we need to be clear on what problems we are trying to
solve here. A generic resizable array may be a useful library for DPDK in
its own right, but for the ethdev (and other devs) arrays I think my
understanding of the problem is that we want:

* scalability of ethdevs list to large numbers of ports, e.g. 2k
* while not paying a large memory footprint penalty for those apps which
  only need a small number of ports, e.g. 2 or 4.

Is that a fair summary?

/Bruce`
Thomas Monjalon June 15, 2021, 9:18 a.m. UTC | #14
15/06/2021 10:00, Jerin Jacob:
> On Tue, Jun 15, 2021 at 12:22 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > 14/06/2021 17:48, Jerin Jacob:
> > > On Mon, Jun 14, 2021 at 8:29 PM Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com> wrote:
> > > > > 14/06/2021 15:15, Bruce Richardson:
> > > > > > While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> > > > > > we should switch away from the flat arrays or that we need fully dynamic
> > > > > > arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> > > > > > house here, where we keep the ethdevs as an array, but one allocated/sized
> > > > > > at runtime rather than statically. This would allow us to have a
> > > > > > compile-time default value, but, for use cases that need it, allow use of a
> > > > > > flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> > > > > > malloc call for the array.  This max limit could then be provided to apps
> > > > > > too if they want to match any array sizes. [Alternatively those apps could
> > > > > > check the provided size and error out if the size has been increased beyond
> > > > > > what the app is designed to use?]. There would be no extra dereferences per
> > > > > > rx/tx burst call in this scenario so performance should be the same as
> > > > > > before (potentially better if array is in hugepage memory, I suppose).
> > > > >
> > > > > I think we need some benchmarks to decide what is the best tradeoff.
> > > > > I spent time on this implementation, but sorry I won't have time for benchmarks.
> > > > > Volunteers?
> > > >
> > > > I had only a quick look at your approach so far.
> > > > But from what I can read, in MT environment your suggestion will require
> > > > extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> > > > I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> > > > At least for rte_ethdevs[] and friends.
> > >
> > > +1
> >
> > Please could you have a deeper look and tell me why we need more locks?
> 
> We don't need more locks (It is fat mutex) now in the implementation.
> 
> If it needs to use in fastpath, we need more state of art
> synchronization like RCU.
> 
> Also, you can take look at VPP dynamic array implementation which is
> used in fastpath.
> 
> https://docs.fd.io/vpp/21.10/db/d65/vec_8h.html
> 
> So the question is the use case for this API. Is it for slowpath item
> like ethdev[] memory
> or fastpath items like holding an array of mbuf etc.

As I replied to Morten, it is for read in fast path
and alloc/free in slow path.
I should highlight this in the commit log if there is a v2.
That's why there is a mutex in alloc/free and nothing in read access.

> > The element pointers doesn't change.
> > Only the array pointer change at resize,
> > but the old one is still usable until the next resize.
> > I think we don't need more.
Thomas Monjalon June 15, 2021, 9:28 a.m. UTC | #15
15/06/2021 10:44, Bruce Richardson:
> On Tue, Jun 15, 2021 at 09:53:33AM +0200, Morten Brørup wrote:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > > Sent: Tuesday, 15 June 2021 08.48
> > > 
> > > 14/06/2021 17:48, Morten Brørup:
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> > > Monjalon
> > > > It would be much simpler to just increase RTE_MAX_ETHPORTS to
> > > something big enough to hold a sufficiently large array. And possibly
> > > add an rte_max_ethports variable to indicate the number of populated
> > > entries in the array, for use when iterating over the array.
> > > >
> > > > Can we come up with another example than RTE_MAX_ETHPORTS where this
> > > library provides a better benefit?
> > > 
> > > What is big enough?
> > > Is 640KB enough for RAM? ;)
> > 
> > Good point!
> > 
> > I think we agree that:
> > - The cost of this library is some added complexity, i.e. working with a dynamically sized array through a library instead of just indexing into a compile time fixed size array.
> > - The main benefit of this library is saving some RAM (and still allowing a potentially very high number of ports.)
> > 
> > My point was: The amount of RAM we are saving is a key parameter for the cost/benefit analysis. And since I don't think the rte_eth_devices[] array uses a significant amount of memory, I was asking for some other array using more memory, where the cost/benefit analysis would come out more advantageous to your proposed parray library.
> > 
> > > 
> > > When dealing with microservices switching, the numbers can increase
> > > very fast.
> > 
> > Yes, I strongly supported increasing the port_id type from 8 to 16 bits for this reason, when it was discussed at the DPDK Userspace a few years ago in Dublin. And with large RTE_MAX_QUEUES_PER_PORT values, the rte_eth_dev structure uses quite a lot of space for the rx/tx callback arrays. But the memory usage of rte_eth_devices[] is still relatively insignificant in a system wide context.
> > 
> > If main purpose is to optimize the rte_eth_devices[] array, I think there are better alternatives than this library. Bruce and Konstantin already threw a few ideas on the table.
> >
> 
> Yes, though I think we need to be clear on what problems we are trying to
> solve here. A generic resizable array may be a useful library for DPDK in
> its own right, but for the ethdev (and other devs) arrays I think my
> understanding of the problem is that we want:
> 
> * scalability of ethdevs list to large numbers of ports, e.g. 2k
> * while not paying a large memory footprint penalty for those apps which
>   only need a small number of ports, e.g. 2 or 4.
> 
> Is that a fair summary?

Yes.

We must take into account two related issues:
	- the app and libs could allocate some data per device,
increasing the bill.
	- per-device allocation may be more efficient
if allocated on the NUMA node of the device
Ananyev, Konstantin June 15, 2021, 9:33 a.m. UTC | #16
> 14/06/2021 17:48, Jerin Jacob:
> > On Mon, Jun 14, 2021 at 8:29 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > > > 14/06/2021 15:15, Bruce Richardson:
> > > > > On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > > > > > > Sent: Monday, 14 June 2021 12.59
> > > > > > >
> > > > > > > Performance of access in a fixed-size array is very good
> > > > > > > because of cache locality
> > > > > > > and because there is a single pointer to dereference.
> > > > > > > The only drawback is the lack of flexibility:
> > > > > > > the size of such an array cannot be increase at runtime.
> > > > > > >
> > > > > > > An approach to this problem is to allocate the array at runtime,
> > > > > > > being as efficient as static arrays, but still limited to a maximum.
> > > > > > >
> > > > > > > That's why the API rte_parray is introduced,
> > > > > > > allowing to declare an array of pointer which can be resized
> > > > > > > dynamically
> > > > > > > and automatically at runtime while keeping a good read performance.
> > > > > > >
> > > > > > > After resize, the previous array is kept until the next resize
> > > > > > > to avoid crashs during a read without any lock.
> > > > > > >
> > > > > > > Each element is a pointer to a memory chunk dynamically allocated.
> > > > > > > This is not good for cache locality but it allows to keep the same
> > > > > > > memory per element, no matter how the array is resized.
> > > > > > > Cache locality could be improved with mempools.
> > > > > > > The other drawback is having to dereference one more pointer
> > > > > > > to read an element.
> > > > > > >
> > > > > > > There is not much locks, so the API is for internal use only.
> > > > > > > This API may be used to completely remove some compilation-time
> > > > > > > maximums.
> > > > > >
> > > > > > I get the purpose and overall intention of this library.
> > > > > >
> > > > > > I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime
> configurability. It's
> > > > my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way
> for
> > > > me to stop this progress, and I do not intend to oppose to this library. :-)
> > > > > >
> > > > > > This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few
> examples
> > > > where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line
> between
> > > > control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and
> shrink in
> > > > the fast path.
> > > > > >
> > > > > > If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
> > > > RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their
> > > > application specific per-port runtime data, and this library could serve that purpose too.
> > > > > >
> > > > >
> > > > > Thanks Thomas for starting this discussion and Morten for follow-up.
> > > > >
> > > > > My thinking is as follows, and I'm particularly keeping in mind the cases
> > > > > of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> > > > >
> > > > > While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> > > > > we should switch away from the flat arrays or that we need fully dynamic
> > > > > arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> > > > > house here, where we keep the ethdevs as an array, but one allocated/sized
> > > > > at runtime rather than statically. This would allow us to have a
> > > > > compile-time default value, but, for use cases that need it, allow use of a
> > > > > flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> > > > > malloc call for the array.  This max limit could then be provided to apps
> > > > > too if they want to match any array sizes. [Alternatively those apps could
> > > > > check the provided size and error out if the size has been increased beyond
> > > > > what the app is designed to use?]. There would be no extra dereferences per
> > > > > rx/tx burst call in this scenario so performance should be the same as
> > > > > before (potentially better if array is in hugepage memory, I suppose).
> > > >
> > > > I think we need some benchmarks to decide what is the best tradeoff.
> > > > I spent time on this implementation, but sorry I won't have time for benchmarks.
> > > > Volunteers?
> > >
> > > I had only a quick look at your approach so far.
> > > But from what I can read, in MT environment your suggestion will require
> > > extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> > > I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> > > At least for rte_ethdevs[] and friends.
> >
> > +1
> 
> Please could you have a deeper look and tell me why we need more locks?
> The element pointers doesn't change.
> Only the array pointer change at resize,

Yes, array pointer changes at resize, and reader has to read that value
to access elements in the parray. Which means that we need some sync
between readers and updaters to avoid reader using stale pointer (ref-counter, rcu, etc.).
I.E. updater can free old array pointer *only* when it can guarantee that there are no
readers that still use it.    

> but the old one is still usable until the next resize.

Ok, but what is the guarantee that reader would *always* finish till next resize?
As an example of such race condition:

/* global one */
	struct rte_parray pa;

/* thread #1, tries to read elem from the array */ 
 	....
	int **x = pa->array; 

/* thread # 1 get suspended for a while  at that point */

/* meanwhile thread #2 does: */
	....
	/* causes first resize(), x still valid, points to pa->old_array */ 
	rte_parray_alloc(&pa, ...); 
	.....
	/* causes second resize(), x now points to freed memory */
	rte_parray_alloc(&pa, ...);
	...

/* at that point thread #1 resumes: */

	/* contents of x[0] are undefined, 'p' could point anywhere,
	     might cause segfault or silent memory corruption */  
	int *p = x[0];


Yes probability of such situation is quite small.
But it is still possible.

> I think we don't need more.
Thomas Monjalon June 15, 2021, 9:50 a.m. UTC | #17
15/06/2021 11:33, Ananyev, Konstantin:
> > 14/06/2021 17:48, Jerin Jacob:
> > > On Mon, Jun 14, 2021 at 8:29 PM Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com> wrote:
> > > > I had only a quick look at your approach so far.
> > > > But from what I can read, in MT environment your suggestion will require
> > > > extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> > > > I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> > > > At least for rte_ethdevs[] and friends.
> > >
> > > +1
> > 
> > Please could you have a deeper look and tell me why we need more locks?
> > The element pointers doesn't change.
> > Only the array pointer change at resize,
> 
> Yes, array pointer changes at resize, and reader has to read that value
> to access elements in the parray. Which means that we need some sync
> between readers and updaters to avoid reader using stale pointer (ref-counter, rcu, etc.).

No
The old array is still there, so we don't need sync.

> I.E. updater can free old array pointer *only* when it can guarantee that there are no
> readers that still use it.

No
Reading an element is OK because the pointer to the element is not changed.
Getting the pointer to an element from the index is the only thing
which is blocking the freeing of an array,
and I see no reason why dereferencing an index would be longer
than 2 consecutive resizes of the array.

> > but the old one is still usable until the next resize.
> 
> Ok, but what is the guarantee that reader would *always* finish till next resize?
> As an example of such race condition:
> 
> /* global one */
> 	struct rte_parray pa;
> 
> /* thread #1, tries to read elem from the array */ 
>  	....
> 	int **x = pa->array;

We should not save the array pointer.
Each index must be dereferenced with the macro
getting the current array pointer.
So the interrupt is during dereference of a single index.

> /* thread # 1 get suspended for a while  at that point */
> 
> /* meanwhile thread #2 does: */
> 	....
> 	/* causes first resize(), x still valid, points to pa->old_array */ 
> 	rte_parray_alloc(&pa, ...); 
> 	.....
> 	/* causes second resize(), x now points to freed memory */
> 	rte_parray_alloc(&pa, ...);
> 	...

2 resizes is a very long time, it is at minimum 33 allocations!

> /* at that point thread #1 resumes: */
> 
> 	/* contents of x[0] are undefined, 'p' could point anywhere,
> 	     might cause segfault or silent memory corruption */  
> 	int *p = x[0];
> 
> 
> Yes probability of such situation is quite small.
> But it is still possible.

In device probing, I don't see how it is realistically possible:
33 device allocations during 1 device index being dereferenced.
I agree it is tricky, but that's the whole point of finding tricks
to keep fast code.

> > I think we don't need more.
Ananyev, Konstantin June 15, 2021, 10:08 a.m. UTC | #18
> 
> 15/06/2021 11:33, Ananyev, Konstantin:
> > > 14/06/2021 17:48, Jerin Jacob:
> > > > On Mon, Jun 14, 2021 at 8:29 PM Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com> wrote:
> > > > > I had only a quick look at your approach so far.
> > > > > But from what I can read, in MT environment your suggestion will require
> > > > > extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> > > > > I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> > > > > At least for rte_ethdevs[] and friends.
> > > >
> > > > +1
> > >
> > > Please could you have a deeper look and tell me why we need more locks?
> > > The element pointers doesn't change.
> > > Only the array pointer change at resize,
> >
> > Yes, array pointer changes at resize, and reader has to read that value
> > to access elements in the parray. Which means that we need some sync
> > between readers and updaters to avoid reader using stale pointer (ref-counter, rcu, etc.).
> 
> No
> The old array is still there, so we don't need sync.
> 
> > I.E. updater can free old array pointer *only* when it can guarantee that there are no
> > readers that still use it.
> 
> No
> Reading an element is OK because the pointer to the element is not changed.
> Getting the pointer to an element from the index is the only thing
> which is blocking the freeing of an array,
> and I see no reason why dereferencing an index would be longer
> than 2 consecutive resizes of the array.

In general, your thread can be switched off the cpu at any moment.
And you don't know for sure when it will be scheduled back.

> 
> > > but the old one is still usable until the next resize.
> >
> > Ok, but what is the guarantee that reader would *always* finish till next resize?
> > As an example of such race condition:
> >
> > /* global one */
> > 	struct rte_parray pa;
> >
> > /* thread #1, tries to read elem from the array */
> >  	....
> > 	int **x = pa->array;
> 
> We should not save the array pointer.
> Each index must be dereferenced with the macro
> getting the current array pointer.
> So the interrupt is during dereference of a single index.

You still need to read your pa->array somewhere (let say into a register).
Straight after that your thread can be interrupted.
Then when it is scheduled back to the CPU that value (in a register) might be s stale one.

> 
> > /* thread # 1 get suspended for a while  at that point */
> >
> > /* meanwhile thread #2 does: */
> > 	....
> > 	/* causes first resize(), x still valid, points to pa->old_array */
> > 	rte_parray_alloc(&pa, ...);
> > 	.....
> > 	/* causes second resize(), x now points to freed memory */
> > 	rte_parray_alloc(&pa, ...);
> > 	...
> 
> 2 resizes is a very long time, it is at minimum 33 allocations!
> 
> > /* at that point thread #1 resumes: */
> >
> > 	/* contents of x[0] are undefined, 'p' could point anywhere,
> > 	     might cause segfault or silent memory corruption */
> > 	int *p = x[0];
> >
> >
> > Yes probability of such situation is quite small.
> > But it is still possible.
> 
> In device probing, I don't see how it is realistically possible:
> 33 device allocations during 1 device index being dereferenced.

Yeh, it would work fine 1M times, but sometimes will crash.
Which will make it even harder to reproduce, debug and fix.
I think that when introducing a new generic library into DPDK,
we should avoid making such assumptions.

> I agree it is tricky, but that's the whole point of finding tricks
> to keep fast code.

It is not tricky, it is buggy 😊
You introducing a race condition into the new core generic library by design,
and trying to convince people that it is *OK*.
Sorry, but NACK from me till that issue will be addressed.


> 
> > > I think we don't need more.
> 
>
Thomas Monjalon June 15, 2021, 2:02 p.m. UTC | #19
15/06/2021 12:08, Ananyev, Konstantin:
> > 15/06/2021 11:33, Ananyev, Konstantin:
> > > > 14/06/2021 17:48, Jerin Jacob:
> > > > > On Mon, Jun 14, 2021 at 8:29 PM Ananyev, Konstantin
> > > > > <konstantin.ananyev@intel.com> wrote:
> > > > > > I had only a quick look at your approach so far.
> > > > > > But from what I can read, in MT environment your suggestion will require
> > > > > > extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> > > > > > I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> > > > > > At least for rte_ethdevs[] and friends.
> > > > >
> > > > > +1
> > > >
> > > > Please could you have a deeper look and tell me why we need more locks?
> > > > The element pointers doesn't change.
> > > > Only the array pointer change at resize,
> > >
> > > Yes, array pointer changes at resize, and reader has to read that value
> > > to access elements in the parray. Which means that we need some sync
> > > between readers and updaters to avoid reader using stale pointer (ref-counter, rcu, etc.).
> > 
> > No
> > The old array is still there, so we don't need sync.
> > 
> > > I.E. updater can free old array pointer *only* when it can guarantee that there are no
> > > readers that still use it.
> > 
> > No
> > Reading an element is OK because the pointer to the element is not changed.
> > Getting the pointer to an element from the index is the only thing
> > which is blocking the freeing of an array,
> > and I see no reason why dereferencing an index would be longer
> > than 2 consecutive resizes of the array.
> 
> In general, your thread can be switched off the cpu at any moment.
> And you don't know for sure when it will be scheduled back.
> 
> > 
> > > > but the old one is still usable until the next resize.
> > >
> > > Ok, but what is the guarantee that reader would *always* finish till next resize?
> > > As an example of such race condition:
> > >
> > > /* global one */
> > > 	struct rte_parray pa;
> > >
> > > /* thread #1, tries to read elem from the array */
> > >  	....
> > > 	int **x = pa->array;
> > 
> > We should not save the array pointer.
> > Each index must be dereferenced with the macro
> > getting the current array pointer.
> > So the interrupt is during dereference of a single index.
> 
> You still need to read your pa->array somewhere (let say into a register).
> Straight after that your thread can be interrupted.
> Then when it is scheduled back to the CPU that value (in a register) might be s stale one.
> 
> > 
> > > /* thread # 1 get suspended for a while  at that point */
> > >
> > > /* meanwhile thread #2 does: */
> > > 	....
> > > 	/* causes first resize(), x still valid, points to pa->old_array */
> > > 	rte_parray_alloc(&pa, ...);
> > > 	.....
> > > 	/* causes second resize(), x now points to freed memory */
> > > 	rte_parray_alloc(&pa, ...);
> > > 	...
> > 
> > 2 resizes is a very long time, it is at minimum 33 allocations!
> > 
> > > /* at that point thread #1 resumes: */
> > >
> > > 	/* contents of x[0] are undefined, 'p' could point anywhere,
> > > 	     might cause segfault or silent memory corruption */
> > > 	int *p = x[0];
> > >
> > >
> > > Yes probability of such situation is quite small.
> > > But it is still possible.
> > 
> > In device probing, I don't see how it is realistically possible:
> > 33 device allocations during 1 device index being dereferenced.
> 
> Yeh, it would work fine 1M times, but sometimes will crash.

Sometimes a thread will be interrupted during 33 device allocations?

> Which will make it even harder to reproduce, debug and fix.
> I think that when introducing a new generic library into DPDK,
> we should avoid making such assumptions.

I intend to make it internal-only (I should have named it eal_parray).

> > I agree it is tricky, but that's the whole point of finding tricks
> > to keep fast code.
> 
> It is not tricky, it is buggy 😊
> You introducing a race condition into the new core generic library by design,
> and trying to convince people that it is *OK*.

Yes, because I am convinced myself.

> Sorry, but NACK from me till that issue will be addressed.

It is not an issue, but a design.
If you think that a thread can be interrupted during 33 device allocations
then we should find another implementation, but I am quite sure it will be slower.
Honnappa Nagarahalli June 15, 2021, 2:37 p.m. UTC | #20
<snip>

> 
> 15/06/2021 12:08, Ananyev, Konstantin:
> > > 15/06/2021 11:33, Ananyev, Konstantin:
> > > > > 14/06/2021 17:48, Jerin Jacob:
> > > > > > On Mon, Jun 14, 2021 at 8:29 PM Ananyev, Konstantin
> > > > > > <konstantin.ananyev@intel.com> wrote:
> > > > > > > I had only a quick look at your approach so far.
> > > > > > > But from what I can read, in MT environment your suggestion
> > > > > > > will require extra synchronization for each read-write access to
> such parray element (lock, rcu, ...).
> > > > > > > I think what Bruce suggests will be much ligther, easier to
> implement and less error prone.
> > > > > > > At least for rte_ethdevs[] and friends.
> > > > > >
> > > > > > +1
> > > > >
> > > > > Please could you have a deeper look and tell me why we need more
> locks?
> > > > > The element pointers doesn't change.
> > > > > Only the array pointer change at resize,
> > > >
> > > > Yes, array pointer changes at resize, and reader has to read that
> > > > value to access elements in the parray. Which means that we need
> > > > some sync between readers and updaters to avoid reader using stale
> pointer (ref-counter, rcu, etc.).
> > >
> > > No
> > > The old array is still there, so we don't need sync.
> > >
> > > > I.E. updater can free old array pointer *only* when it can
> > > > guarantee that there are no readers that still use it.
> > >
> > > No
> > > Reading an element is OK because the pointer to the element is not
> changed.
> > > Getting the pointer to an element from the index is the only thing
> > > which is blocking the freeing of an array, and I see no reason why
> > > dereferencing an index would be longer than 2 consecutive resizes of
> > > the array.
> >
> > In general, your thread can be switched off the cpu at any moment.
> > And you don't know for sure when it will be scheduled back.
> >
> > >
> > > > > but the old one is still usable until the next resize.
> > > >
> > > > Ok, but what is the guarantee that reader would *always* finish till next
> resize?
> > > > As an example of such race condition:
> > > >
> > > > /* global one */
> > > > 	struct rte_parray pa;
> > > >
> > > > /* thread #1, tries to read elem from the array */
> > > >  	....
> > > > 	int **x = pa->array;
> > >
> > > We should not save the array pointer.
> > > Each index must be dereferenced with the macro getting the current
> > > array pointer.
> > > So the interrupt is during dereference of a single index.
> >
> > You still need to read your pa->array somewhere (let say into a register).
> > Straight after that your thread can be interrupted.
> > Then when it is scheduled back to the CPU that value (in a register) might be
> s stale one.
> >
> > >
> > > > /* thread # 1 get suspended for a while  at that point */
> > > >
> > > > /* meanwhile thread #2 does: */
> > > > 	....
> > > > 	/* causes first resize(), x still valid, points to pa->old_array */
> > > > 	rte_parray_alloc(&pa, ...);
> > > > 	.....
> > > > 	/* causes second resize(), x now points to freed memory */
> > > > 	rte_parray_alloc(&pa, ...);
> > > > 	...
> > >
> > > 2 resizes is a very long time, it is at minimum 33 allocations!
> > >
> > > > /* at that point thread #1 resumes: */
> > > >
> > > > 	/* contents of x[0] are undefined, 'p' could point anywhere,
> > > > 	     might cause segfault or silent memory corruption */
> > > > 	int *p = x[0];
> > > >
> > > >
> > > > Yes probability of such situation is quite small.
> > > > But it is still possible.
> > >
> > > In device probing, I don't see how it is realistically possible:
> > > 33 device allocations during 1 device index being dereferenced.
> >
> > Yeh, it would work fine 1M times, but sometimes will crash.
> 
> Sometimes a thread will be interrupted during 33 device allocations?
> 
> > Which will make it even harder to reproduce, debug and fix.
> > I think that when introducing a new generic library into DPDK, we
> > should avoid making such assumptions.
> 
> I intend to make it internal-only (I should have named it eal_parray).
> 
> > > I agree it is tricky, but that's the whole point of finding tricks
> > > to keep fast code.
> >
> > It is not tricky, it is buggy 😊
> > You introducing a race condition into the new core generic library by
> > design, and trying to convince people that it is *OK*.
> 
> Yes, because I am convinced myself.
> 
> > Sorry, but NACK from me till that issue will be addressed.
Agree here that a synchronization mechanism is required to indicate when it is safe to free the old array. An ACK from the readers is required to free the old array. We cannot use "enough time has passed" argument.

As others have mentioned, I think the key is the use case. Not all use cases require a dynamically resized array. Dynamically allocated array at init time would be enough.

If a dynamically resized array is required, using RCU (or any other mechanism) is necessary. I do not think these use cases should be characterized by the size of the memory/array in question (it might be a small chunk in a system with abundant memory, but might be a big chunk in a system with small amount of memory). The current RCU library provides good options to hide complexities from the application or allow the application to handle complexities if it wants.

> 
> It is not an issue, but a design.
> If you think that a thread can be interrupted during 33 device allocations then
> we should find another implementation, but I am quite sure it will be slower.
>
Jerin Jacob June 16, 2021, 9:42 a.m. UTC | #21
On Tue, Jun 15, 2021 at 12:18 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 14/06/2021 17:48, Morten Brørup:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > It would be much simpler to just increase RTE_MAX_ETHPORTS to something big enough to hold a sufficiently large array. And possibly add an rte_max_ethports variable to indicate the number of populated entries in the array, for use when iterating over the array.
> >
> > Can we come up with another example than RTE_MAX_ETHPORTS where this library provides a better benefit?
>
> What is big enough?
> Is 640KB enough for RAM? ;)

If I understand it correctly, Linux process allocates 640KB due to
that fact currently
struct rte_eth_dev rte_eth_devices[RTE_MAX_ETHPORTS] is global and it
is from BSS.

If we make this from heap i.e use malloc() to allocate this memory
then in my understanding Linux
really won't allocate the real page for backend memory until unless,
someone write/read to this memory.

i.e it will be free virtual memory using Linux memory management help.
If so, we can keep large values for RTE_MAX_ETHPORTS
without wasting any "real" memory even though the system has a few ports.

Thoughts?



>
> When dealing with microservices switching, the numbers can increase very fast.
>
>
Anatoly Burakov June 16, 2021, 11:11 a.m. UTC | #22
On 14-Jun-21 11:58 AM, Thomas Monjalon wrote:
> Performance of access in a fixed-size array is very good
> because of cache locality
> and because there is a single pointer to dereference.
> The only drawback is the lack of flexibility:
> the size of such an array cannot be increase at runtime.
> 
> An approach to this problem is to allocate the array at runtime,
> being as efficient as static arrays, but still limited to a maximum.
> 
> That's why the API rte_parray is introduced,
> allowing to declare an array of pointer which can be resized dynamically
> and automatically at runtime while keeping a good read performance.
> 
> After resize, the previous array is kept until the next resize
> to avoid crashs during a read without any lock.
> 
> Each element is a pointer to a memory chunk dynamically allocated.
> This is not good for cache locality but it allows to keep the same
> memory per element, no matter how the array is resized.
> Cache locality could be improved with mempools.
> The other drawback is having to dereference one more pointer
> to read an element.
> 
> There is not much locks, so the API is for internal use only.
> This API may be used to completely remove some compilation-time maximums.
> 
> Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
> ---

<snip>

> +int32_t
> +rte_parray_find_next(struct rte_parray *obj, int32_t index)
> +{
> +	if (obj == NULL || index < 0) {
> +		rte_errno = EINVAL;
> +		return -1;
> +	}
> +
> +	pthread_mutex_lock(&obj->mutex);
> +
> +	while (index < obj->size && obj->array[index] == NULL)
> +		index++;
> +	if (index >= obj->size)
> +		index = -1;
> +
> +	pthread_mutex_unlock(&obj->mutex);
> +
> +	rte_errno = 0;
> +	return index;
> +}
> +

Just a general comment about this:

I'm not really sure i like this "kinda-sorta-threadsafe-but-not-really" 
approach. IMO something either should be thread-safe, or it should be 
explicitly not thread-safe. There's no point in locking here because any 
user of find_next() will *necessarily* race with other users, because by 
the time we exit the function, the result becomes stale - so why are we 
locking in the first place?

Would be perhaps be better to leave it as non-thread-safe at its core, 
but introduce wrappers for atomic-like access to the array? E.g. 
something like `rte_parray_find_next_free_and_set()` that will perform 
the lock-find-next-set-unlock sequence? Or, alternatively, have the 
mutex there, but provide API's for explicit locking, and put the burden 
on the user to actually do the locking correctly.
Morten Brørup June 16, 2021, 11:27 a.m. UTC | #23
> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Wednesday, 16 June 2021 11.42
> 
> On Tue, Jun 15, 2021 at 12:18 PM Thomas Monjalon <thomas@monjalon.net>
> wrote:
> >
> > 14/06/2021 17:48, Morten Brørup:
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> Monjalon
> > > It would be much simpler to just increase RTE_MAX_ETHPORTS to
> something big enough to hold a sufficiently large array. And possibly
> add an rte_max_ethports variable to indicate the number of populated
> entries in the array, for use when iterating over the array.
> > >
> > > Can we come up with another example than RTE_MAX_ETHPORTS where
> this library provides a better benefit?
> >
> > What is big enough?
> > Is 640KB enough for RAM? ;)
> 
> If I understand it correctly, Linux process allocates 640KB due to
> that fact currently
> struct rte_eth_dev rte_eth_devices[RTE_MAX_ETHPORTS] is global and it
> is from BSS.

Correct.

> If we make this from heap i.e use malloc() to allocate this memory
> then in my understanding Linux
> really won't allocate the real page for backend memory until unless,
> someone write/read to this memory.

If the array is allocated from the heap, its members will be accessed though a pointer to the array, e.g. in rte_eth_rx/tx_burst(). This might affect performance, which is probably why the array is allocated the way it is.

Although it might be worth investigating how much it actually affects the performance.

So we need to do something else if we want to conserve memory and still allow a large rte_eth_devices[] array.

Looking at struct rte_eth_dev, we could reduce its size as follows:

1. Change the two callback arrays post_rx/pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT] to pointers to callback arrays, which are allocated from the heap.
With the default RTE_MAX_QUEUES_PER_PORT of 1024, these two arrays are the sinners that make the struct rte_eth_dev use so much memory. This modification would save 16 KB (minus 16 bytes for the pointers to the two arrays) per port.
Furthermore, these callback arrays would only need to be allocated if the application is compiled with callbacks enabled (#define RTE_ETHDEV_RXTX_CALLBACKS). And they would only need to be sized to the actual number of queues for the port.

The disadvantage is that this would add another level of indirection, although only for applications compiled with callbacks enabled.

2. Remove reserved_64s[4] and reserved_ptrs[4]. This would save 64 bytes per port. Not much, but worth considering if we are changing the API/ABI anyway.
Jerin Jacob June 16, 2021, noon UTC | #24
On Wed, Jun 16, 2021 at 4:57 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > Sent: Wednesday, 16 June 2021 11.42
> >
> > On Tue, Jun 15, 2021 at 12:18 PM Thomas Monjalon <thomas@monjalon.net>
> > wrote:
> > >
> > > 14/06/2021 17:48, Morten Brørup:
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> > Monjalon
> > > > It would be much simpler to just increase RTE_MAX_ETHPORTS to
> > something big enough to hold a sufficiently large array. And possibly
> > add an rte_max_ethports variable to indicate the number of populated
> > entries in the array, for use when iterating over the array.
> > > >
> > > > Can we come up with another example than RTE_MAX_ETHPORTS where
> > this library provides a better benefit?
> > >
> > > What is big enough?
> > > Is 640KB enough for RAM? ;)
> >
> > If I understand it correctly, Linux process allocates 640KB due to
> > that fact currently
> > struct rte_eth_dev rte_eth_devices[RTE_MAX_ETHPORTS] is global and it
> > is from BSS.
>
> Correct.
>
> > If we make this from heap i.e use malloc() to allocate this memory
> > then in my understanding Linux
> > really won't allocate the real page for backend memory until unless,
> > someone write/read to this memory.
>
> If the array is allocated from the heap, its members will be accessed though a pointer to the array, e.g. in rte_eth_rx/tx_burst(). This might affect performance, which is probably why the array is allocated the way it is.
>
> Although it might be worth investigating how much it actually affects the performance.

it should not. From CPU and compiler PoV it is same.
if see cryptodev, it is using following

static struct rte_cryptodev rte_crypto_devices[RTE_CRYPTO_MAX_DEVS];
struct rte_cryptodev *rte_cryptodevs = rte_crypto_devices;

And accessing  rte_cryptodevs[].

Also, this structure is not cache aligned. Probably need to fix it.


> So we need to do something else if we want to conserve memory and still allow a large rte_eth_devices[] array.
>
> Looking at struct rte_eth_dev, we could reduce its size as follows:
>
> 1. Change the two callback arrays post_rx/pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT] to pointers to callback arrays, which are allocated from the heap.
> With the default RTE_MAX_QUEUES_PER_PORT of 1024, these two arrays are the sinners that make the struct rte_eth_dev use so much memory. This modification would save 16 KB (minus 16 bytes for the pointers to the two arrays) per port.
> Furthermore, these callback arrays would only need to be allocated if the application is compiled with callbacks enabled (#define RTE_ETHDEV_RXTX_CALLBACKS). And they would only need to be sized to the actual number of queues for the port.
>
> The disadvantage is that this would add another level of indirection, although only for applications compiled with callbacks enabled.

I think, we don't need one more indirection if all allocated from the
heap. as memory is not wasted if not touched by CPU.

>
> 2. Remove reserved_64s[4] and reserved_ptrs[4]. This would save 64 bytes per port. Not much, but worth considering if we are changing the API/ABI anyway.
>
>
Anatoly Burakov June 16, 2021, 12:22 p.m. UTC | #25
On 16-Jun-21 10:42 AM, Jerin Jacob wrote:
> On Tue, Jun 15, 2021 at 12:18 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>
>> 14/06/2021 17:48, Morten Brørup:
>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
>>> It would be much simpler to just increase RTE_MAX_ETHPORTS to something big enough to hold a sufficiently large array. And possibly add an rte_max_ethports variable to indicate the number of populated entries in the array, for use when iterating over the array.
>>>
>>> Can we come up with another example than RTE_MAX_ETHPORTS where this library provides a better benefit?
>>
>> What is big enough?
>> Is 640KB enough for RAM? ;)
> 
> If I understand it correctly, Linux process allocates 640KB due to
> that fact currently
> struct rte_eth_dev rte_eth_devices[RTE_MAX_ETHPORTS] is global and it
> is from BSS.
> 
> If we make this from heap i.e use malloc() to allocate this memory
> then in my understanding Linux
> really won't allocate the real page for backend memory until unless,
> someone write/read to this memory.
> 
> i.e it will be free virtual memory using Linux memory management help.
> If so, we can keep large values for RTE_MAX_ETHPORTS
> without wasting any "real" memory even though the system has a few ports.
> 
> Thoughts?
> 

mmap works this way with anonymous memory, i'm not so sure about 
malloc()'ed memory. Plus, we can't base these decisions on what Linux 
does because we support other OS's. Do they do this as well?
Jerin Jacob June 16, 2021, 12:59 p.m. UTC | #26
On Wed, Jun 16, 2021 at 5:52 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 16-Jun-21 10:42 AM, Jerin Jacob wrote:
> > On Tue, Jun 15, 2021 at 12:18 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>
> >> 14/06/2021 17:48, Morten Brørup:
> >>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> >>> It would be much simpler to just increase RTE_MAX_ETHPORTS to something big enough to hold a sufficiently large array. And possibly add an rte_max_ethports variable to indicate the number of populated entries in the array, for use when iterating over the array.
> >>>
> >>> Can we come up with another example than RTE_MAX_ETHPORTS where this library provides a better benefit?
> >>
> >> What is big enough?
> >> Is 640KB enough for RAM? ;)
> >
> > If I understand it correctly, Linux process allocates 640KB due to
> > that fact currently
> > struct rte_eth_dev rte_eth_devices[RTE_MAX_ETHPORTS] is global and it
> > is from BSS.
> >
> > If we make this from heap i.e use malloc() to allocate this memory
> > then in my understanding Linux
> > really won't allocate the real page for backend memory until unless,
> > someone write/read to this memory.
> >
> > i.e it will be free virtual memory using Linux memory management help.
> > If so, we can keep large values for RTE_MAX_ETHPORTS
> > without wasting any "real" memory even though the system has a few ports.
> >
> > Thoughts?
> >
>
> mmap works this way with anonymous memory, i'm not so sure about
> malloc()'ed memory.

Looking at online documentation scatters over the internet, sbrk(), is
based on demand paging.
So I am not sure as well. I am also not sure how we can write some
test case to verify it.
Allocating a huge memory through malloc() not failing, not sure it is
due to demand pagging
or Linux over commit feature or combination of both,

if mmap works in this way, we could have EAL abstraction for such
memory alloc like
eal_malloc_demand_page() or so and if Windows also supports it.



> Plus, we can't base these decisions on what Linux
> does because we support other OS's. Do they do this as well?

+ Windows OS maintainers

>
> --
> Thanks,
> Anatoly
Bruce Richardson June 16, 2021, 1:02 p.m. UTC | #27
On Wed, Jun 16, 2021 at 01:27:17PM +0200, Morten Brørup wrote:
> > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > Sent: Wednesday, 16 June 2021 11.42
> > 
> > On Tue, Jun 15, 2021 at 12:18 PM Thomas Monjalon <thomas@monjalon.net>
> > wrote:
> > >
> > > 14/06/2021 17:48, Morten Brørup:
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> > Monjalon
> > > > It would be much simpler to just increase RTE_MAX_ETHPORTS to
> > something big enough to hold a sufficiently large array. And possibly
> > add an rte_max_ethports variable to indicate the number of populated
> > entries in the array, for use when iterating over the array.
> > > >
> > > > Can we come up with another example than RTE_MAX_ETHPORTS where
> > this library provides a better benefit?
> > >
> > > What is big enough?
> > > Is 640KB enough for RAM? ;)
> > 
> > If I understand it correctly, Linux process allocates 640KB due to
> > that fact currently
> > struct rte_eth_dev rte_eth_devices[RTE_MAX_ETHPORTS] is global and it
> > is from BSS.
> 
> Correct.
> 
> > If we make this from heap i.e use malloc() to allocate this memory
> > then in my understanding Linux
> > really won't allocate the real page for backend memory until unless,
> > someone write/read to this memory.
> 
> If the array is allocated from the heap, its members will be accessed though a pointer to the array, e.g. in rte_eth_rx/tx_burst(). This might affect performance, which is probably why the array is allocated the way it is.
>

It depends on whether the array contains pointers to malloced elements or
the array itself is just a single malloced array of all the structures.
While I think the parray proposal referred to the former - which would have
an extra level of indirection - the switch we are discussing here is the
latter which should have no performance difference, since the method of
accessing the elements will be the same, only with the base address
pointing to a different area of memory.
 
> Although it might be worth investigating how much it actually affects the performance.
> 
> So we need to do something else if we want to conserve memory and still allow a large rte_eth_devices[] array.
> 
> Looking at struct rte_eth_dev, we could reduce its size as follows:
> 
> 1. Change the two callback arrays post_rx/pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT] to pointers to callback arrays, which are allocated from the heap.
> With the default RTE_MAX_QUEUES_PER_PORT of 1024, these two arrays are the sinners that make the struct rte_eth_dev use so much memory. This modification would save 16 KB (minus 16 bytes for the pointers to the two arrays) per port.
> Furthermore, these callback arrays would only need to be allocated if the application is compiled with callbacks enabled (#define RTE_ETHDEV_RXTX_CALLBACKS). And they would only need to be sized to the actual number of queues for the port.
> 
> The disadvantage is that this would add another level of indirection, although only for applications compiled with callbacks enabled.
> 
This seems reasonable to at least investigate.

> 2. Remove reserved_64s[4] and reserved_ptrs[4]. This would save 64 bytes per port. Not much, but worth considering if we are changing the API/ABI anyway.
> 
I strongly dislike reserved fields to I would tend to favour these.
However, it does possibly reduce future compatibility if we do need to add
something to ethdev.

Another option is to split ethdev into fast-path and non-fastpath parts -
similar to Konstantin's suggestion of just having an array of the ops. We
can have an array of minimal structures with fastpath ops and queue
pointers, for example, with an ethdev-private pointer to the rest of the
struct elsewhere in memory. Since that second struct would be allocated
on-demand, the size of the ethdev array can be scaled with far smaller
footprint.

/Bruce
Morten Brørup June 16, 2021, 3:01 p.m. UTC | #28
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, 16 June 2021 15.03
> 
> On Wed, Jun 16, 2021 at 01:27:17PM +0200, Morten Brørup wrote:
> > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > Sent: Wednesday, 16 June 2021 11.42
> > >
> > > On Tue, Jun 15, 2021 at 12:18 PM Thomas Monjalon
> <thomas@monjalon.net>
> > > wrote:
> > > >
> > > > 14/06/2021 17:48, Morten Brørup:
> > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> > > Monjalon
> > > > > It would be much simpler to just increase RTE_MAX_ETHPORTS to
> > > something big enough to hold a sufficiently large array. And
> possibly
> > > add an rte_max_ethports variable to indicate the number of
> populated
> > > entries in the array, for use when iterating over the array.
> > > > >
> > > > > Can we come up with another example than RTE_MAX_ETHPORTS where
> > > this library provides a better benefit?
> > > >
> > > > What is big enough?
> > > > Is 640KB enough for RAM? ;)
> > >
> > > If I understand it correctly, Linux process allocates 640KB due to
> > > that fact currently
> > > struct rte_eth_dev rte_eth_devices[RTE_MAX_ETHPORTS] is global and
> it
> > > is from BSS.
> >
> > Correct.
> >
> > > If we make this from heap i.e use malloc() to allocate this memory
> > > then in my understanding Linux
> > > really won't allocate the real page for backend memory until
> unless,
> > > someone write/read to this memory.
> >
> > If the array is allocated from the heap, its members will be accessed
> though a pointer to the array, e.g. in rte_eth_rx/tx_burst(). This
> might affect performance, which is probably why the array is allocated
> the way it is.
> >
> 
> It depends on whether the array contains pointers to malloced elements
> or
> the array itself is just a single malloced array of all the structures.
> While I think the parray proposal referred to the former - which would
> have
> an extra level of indirection - the switch we are discussing here is
> the
> latter which should have no performance difference, since the method of
> accessing the elements will be the same, only with the base address
> pointing to a different area of memory.

I was not talking about an array of pointers. And it is not the same:

int arr[27];
int * parr = arr;

// direct access
int dir(int i) { return arr[i]; }

// indirect access
int indir(int i) { return parr[i]; }

The direct access knows the address of arr, so it will compile to:
        movsx   rdi, edi
        mov     eax, DWORD PTR arr[0+rdi*4]
        ret

The indirect access needs to first read the memory location holding the pointer to the array, and then it can read the array member, so it will compile to:
        mov     rax, QWORD PTR parr[rip]
        movsx   rdi, edi
        mov     eax, DWORD PTR [rax+rdi*4]
        ret

> 
> > Although it might be worth investigating how much it actually affects
> the performance.
> >
> > So we need to do something else if we want to conserve memory and
> still allow a large rte_eth_devices[] array.
> >
> > Looking at struct rte_eth_dev, we could reduce its size as follows:
> >
> > 1. Change the two callback arrays
> post_rx/pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT] to pointers to
> callback arrays, which are allocated from the heap.
> > With the default RTE_MAX_QUEUES_PER_PORT of 1024, these two arrays
> are the sinners that make the struct rte_eth_dev use so much memory.
> This modification would save 16 KB (minus 16 bytes for the pointers to
> the two arrays) per port.
> > Furthermore, these callback arrays would only need to be allocated if
> the application is compiled with callbacks enabled (#define
> RTE_ETHDEV_RXTX_CALLBACKS). And they would only need to be sized to the
> actual number of queues for the port.
> >
> > The disadvantage is that this would add another level of indirection,
> although only for applications compiled with callbacks enabled.
> >
> This seems reasonable to at least investigate.
> 
> > 2. Remove reserved_64s[4] and reserved_ptrs[4]. This would save 64
> bytes per port. Not much, but worth considering if we are changing the
> API/ABI anyway.
> >
> I strongly dislike reserved fields to I would tend to favour these.
> However, it does possibly reduce future compatibility if we do need to
> add
> something to ethdev.

There should be an official policy about adding reserved fields for future compatibility. I'm against adding them, unless it can be argued that they are likely to match what is needed in the future; in the real world there is no way to know if they match future requirements.

> 
> Another option is to split ethdev into fast-path and non-fastpath parts
> -
> similar to Konstantin's suggestion of just having an array of the ops.
> We
> can have an array of minimal structures with fastpath ops and queue
> pointers, for example, with an ethdev-private pointer to the rest of
> the
> struct elsewhere in memory. Since that second struct would be allocated
> on-demand, the size of the ethdev array can be scaled with far smaller
> footprint.
> 
> /Bruce

The rte_eth_dev structures are really well organized now. E.g. the rx/tx function pointers and the pointer to the shared memory data of the driver are in the same cache line. We must be very careful if we change them.

Also, rte_ethdev.h and rte_ethdev_core.h are easy to read and understand.

-Morten
Bruce Richardson June 16, 2021, 5:40 p.m. UTC | #29
On Wed, Jun 16, 2021 at 05:01:46PM +0200, Morten Brørup wrote:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Wednesday, 16 June 2021 15.03
> > 
> > On Wed, Jun 16, 2021 at 01:27:17PM +0200, Morten Brørup wrote:
> > > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > > Sent: Wednesday, 16 June 2021 11.42
> > > >
> > > > On Tue, Jun 15, 2021 at 12:18 PM Thomas Monjalon
> > <thomas@monjalon.net>
> > > > wrote:
> > > > >
> > > > > 14/06/2021 17:48, Morten Brørup:
> > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> > > > Monjalon
> > > > > > It would be much simpler to just increase RTE_MAX_ETHPORTS to
> > > > something big enough to hold a sufficiently large array. And
> > possibly
> > > > add an rte_max_ethports variable to indicate the number of
> > populated
> > > > entries in the array, for use when iterating over the array.
> > > > > >
> > > > > > Can we come up with another example than RTE_MAX_ETHPORTS where
> > > > this library provides a better benefit?
> > > > >
> > > > > What is big enough?
> > > > > Is 640KB enough for RAM? ;)
> > > >
> > > > If I understand it correctly, Linux process allocates 640KB due to
> > > > that fact currently
> > > > struct rte_eth_dev rte_eth_devices[RTE_MAX_ETHPORTS] is global and
> > it
> > > > is from BSS.
> > >
> > > Correct.
> > >
> > > > If we make this from heap i.e use malloc() to allocate this memory
> > > > then in my understanding Linux
> > > > really won't allocate the real page for backend memory until
> > unless,
> > > > someone write/read to this memory.
> > >
> > > If the array is allocated from the heap, its members will be accessed
> > though a pointer to the array, e.g. in rte_eth_rx/tx_burst(). This
> > might affect performance, which is probably why the array is allocated
> > the way it is.
> > >
> > 
> > It depends on whether the array contains pointers to malloced elements
> > or
> > the array itself is just a single malloced array of all the structures.
> > While I think the parray proposal referred to the former - which would
> > have
> > an extra level of indirection - the switch we are discussing here is
> > the
> > latter which should have no performance difference, since the method of
> > accessing the elements will be the same, only with the base address
> > pointing to a different area of memory.
> 
> I was not talking about an array of pointers. And it is not the same:
> 
> int arr[27];
> int * parr = arr;
> 
> // direct access
> int dir(int i) { return arr[i]; }
> 
> // indirect access
> int indir(int i) { return parr[i]; }
> 
> The direct access knows the address of arr, so it will compile to:
>         movsx   rdi, edi
>         mov     eax, DWORD PTR arr[0+rdi*4]
>         ret
> 
> The indirect access needs to first read the memory location holding the pointer to the array, and then it can read the array member, so it will compile to:
>         mov     rax, QWORD PTR parr[rip]
>         movsx   rdi, edi
>         mov     eax, DWORD PTR [rax+rdi*4]
>         ret
> 
Interesting, thanks. Definitely seems like a bit of perf testing will be
needed whatever way we go.
Dmitry Kozlyuk June 16, 2021, 10:58 p.m. UTC | #30
2021-06-16 18:29 (UTC+0530), Jerin Jacob:
> On Wed, Jun 16, 2021 at 5:52 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
> >
> > On 16-Jun-21 10:42 AM, Jerin Jacob wrote:  
> > > On Tue, Jun 15, 2021 at 12:18 PM Thomas Monjalon <thomas@monjalon.net> wrote:  
> > >>
> > >> 14/06/2021 17:48, Morten Brørup:  
> > >>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon  
> > >>> It would be much simpler to just increase RTE_MAX_ETHPORTS to something big enough to hold a sufficiently large array. And possibly add an rte_max_ethports variable to indicate the number of populated entries in the array, for use when iterating over the array.
> > >>>
> > >>> Can we come up with another example than RTE_MAX_ETHPORTS where this library provides a better benefit?  
> > >>
> > >> What is big enough?
> > >> Is 640KB enough for RAM? ;)  
> > >
> > > If I understand it correctly, Linux process allocates 640KB due to
> > > that fact currently
> > > struct rte_eth_dev rte_eth_devices[RTE_MAX_ETHPORTS] is global and it
> > > is from BSS.
> > >
> > > If we make this from heap i.e use malloc() to allocate this memory
> > > then in my understanding Linux
> > > really won't allocate the real page for backend memory until unless,
> > > someone write/read to this memory.
> > >
> > > i.e it will be free virtual memory using Linux memory management help.
> > > If so, we can keep large values for RTE_MAX_ETHPORTS
> > > without wasting any "real" memory even though the system has a few ports.
> > >
> > > Thoughts?
> > >  
> >
> > mmap works this way with anonymous memory, i'm not so sure about
> > malloc()'ed memory.  
> 
> Looking at online documentation scatters over the internet, sbrk(), is
> based on demand paging.
> So I am not sure as well. I am also not sure how we can write some
> test case to verify it.
> Allocating a huge memory through malloc() not failing, not sure it is
> due to demand pagging
> or Linux over commit feature or combination of both,
> 
> if mmap works in this way, we could have EAL abstraction for such
> memory alloc like
> eal_malloc_demand_page() or so and if Windows also supports it.
> 
> 
> 
> > Plus, we can't base these decisions on what Linux
> > does because we support other OS's. Do they do this as well?  
> 
> + Windows OS maintainers

Yes, Windows uses demand paging.

Is it true that BSS is eagerly allocated (i. e. RAM consumed)? If not, and it
shouldn't be, malloc() isn't needed unless hugepages are required.
Ferruh Yigit June 17, 2021, 1:08 p.m. UTC | #31
On 6/14/2021 4:54 PM, Ananyev, Konstantin wrote:
> 
> 
>>>
>>> 14/06/2021 15:15, Bruce Richardson:
>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
>>>>>> Sent: Monday, 14 June 2021 12.59
>>>>>>
>>>>>> Performance of access in a fixed-size array is very good
>>>>>> because of cache locality
>>>>>> and because there is a single pointer to dereference.
>>>>>> The only drawback is the lack of flexibility:
>>>>>> the size of such an array cannot be increase at runtime.
>>>>>>
>>>>>> An approach to this problem is to allocate the array at runtime,
>>>>>> being as efficient as static arrays, but still limited to a maximum.
>>>>>>
>>>>>> That's why the API rte_parray is introduced,
>>>>>> allowing to declare an array of pointer which can be resized
>>>>>> dynamically
>>>>>> and automatically at runtime while keeping a good read performance.
>>>>>>
>>>>>> After resize, the previous array is kept until the next resize
>>>>>> to avoid crashs during a read without any lock.
>>>>>>
>>>>>> Each element is a pointer to a memory chunk dynamically allocated.
>>>>>> This is not good for cache locality but it allows to keep the same
>>>>>> memory per element, no matter how the array is resized.
>>>>>> Cache locality could be improved with mempools.
>>>>>> The other drawback is having to dereference one more pointer
>>>>>> to read an element.
>>>>>>
>>>>>> There is not much locks, so the API is for internal use only.
>>>>>> This API may be used to completely remove some compilation-time
>>>>>> maximums.
>>>>>
>>>>> I get the purpose and overall intention of this library.
>>>>>
>>>>> I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability.
>> It's
>>> my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way for
>>> me to stop this progress, and I do not intend to oppose to this library. :-)
>>>>>
>>>>> This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few
>> examples
>>> where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line between
>>> control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and shrink
>> in
>>> the fast path.
>>>>>
>>>>> If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their
>>> application specific per-port runtime data, and this library could serve that purpose too.
>>>>>
>>>>
>>>> Thanks Thomas for starting this discussion and Morten for follow-up.
>>>>
>>>> My thinking is as follows, and I'm particularly keeping in mind the cases
>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
>>>>
>>>> While I dislike the hard-coded limits in DPDK, I'm also not convinced that
>>>> we should switch away from the flat arrays or that we need fully dynamic
>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
>>>> house here, where we keep the ethdevs as an array, but one allocated/sized
>>>> at runtime rather than statically. This would allow us to have a
>>>> compile-time default value, but, for use cases that need it, allow use of a
>>>> flag e.g.  "max-ethdevs" to change the size of the parameter given to the
>>>> malloc call for the array.  This max limit could then be provided to apps
>>>> too if they want to match any array sizes. [Alternatively those apps could
>>>> check the provided size and error out if the size has been increased beyond
>>>> what the app is designed to use?]. There would be no extra dereferences per
>>>> rx/tx burst call in this scenario so performance should be the same as
>>>> before (potentially better if array is in hugepage memory, I suppose).
>>>
>>> I think we need some benchmarks to decide what is the best tradeoff.
>>> I spent time on this implementation, but sorry I won't have time for benchmarks.
>>> Volunteers?
>>
>> I had only a quick look at your approach so far.
>> But from what I can read, in MT environment your suggestion will require
>> extra synchronization for each read-write access to such parray element (lock, rcu, ...).
>> I think what Bruce suggests will be much ligther, easier to implement and less error prone.
>> At least for rte_ethdevs[] and friends.
>> Konstantin
> 
> One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
> 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
> We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
> any regressions.
> That could still be flat array with max_size specified at application startup.
> 2. Hide rest of rte_ethdev struct in .c.
> That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
> (flat array, vector, hash, linked list) without ABI/API breakages.
> 
> Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
> (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
> Probably some macro can be provided to simplify it.
> 

We are already planning some tasks for ABI stability for v21.11, I think
splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
internal data.

> The only significant complication I can foresee with implementing that approach -
> we'll need a an array of 'fast' function pointers per queue, not per device as we have now
> (to avoid extra indirection for callback implementation).
> Though as a bonus we'll have ability to use different RX/TX funcions per queue.
> 

What do you think split Rx/Tx callback into its own struct too?

Overall 'rte_eth_dev' can be split into three as:
1. rte_eth_dev
2. rte_eth_dev_burst
3. rte_eth_dev_cb

And we can hide 1 from applications even with the inline functions.
Ananyev, Konstantin June 17, 2021, 2:58 p.m. UTC | #32
> >>>
> >>> 14/06/2021 15:15, Bruce Richardson:
> >>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> >>>>>> Sent: Monday, 14 June 2021 12.59
> >>>>>>
> >>>>>> Performance of access in a fixed-size array is very good
> >>>>>> because of cache locality
> >>>>>> and because there is a single pointer to dereference.
> >>>>>> The only drawback is the lack of flexibility:
> >>>>>> the size of such an array cannot be increase at runtime.
> >>>>>>
> >>>>>> An approach to this problem is to allocate the array at runtime,
> >>>>>> being as efficient as static arrays, but still limited to a maximum.
> >>>>>>
> >>>>>> That's why the API rte_parray is introduced,
> >>>>>> allowing to declare an array of pointer which can be resized
> >>>>>> dynamically
> >>>>>> and automatically at runtime while keeping a good read performance.
> >>>>>>
> >>>>>> After resize, the previous array is kept until the next resize
> >>>>>> to avoid crashs during a read without any lock.
> >>>>>>
> >>>>>> Each element is a pointer to a memory chunk dynamically allocated.
> >>>>>> This is not good for cache locality but it allows to keep the same
> >>>>>> memory per element, no matter how the array is resized.
> >>>>>> Cache locality could be improved with mempools.
> >>>>>> The other drawback is having to dereference one more pointer
> >>>>>> to read an element.
> >>>>>>
> >>>>>> There is not much locks, so the API is for internal use only.
> >>>>>> This API may be used to completely remove some compilation-time
> >>>>>> maximums.
> >>>>>
> >>>>> I get the purpose and overall intention of this library.
> >>>>>
> >>>>> I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability.
> >> It's
> >>> my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way
> for
> >>> me to stop this progress, and I do not intend to oppose to this library. :-)
> >>>>>
> >>>>> This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few
> >> examples
> >>> where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line
> between
> >>> control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and
> shrink
> >> in
> >>> the fast path.
> >>>>>
> >>>>> If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
> >>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their
> >>> application specific per-port runtime data, and this library could serve that purpose too.
> >>>>>
> >>>>
> >>>> Thanks Thomas for starting this discussion and Morten for follow-up.
> >>>>
> >>>> My thinking is as follows, and I'm particularly keeping in mind the cases
> >>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> >>>>
> >>>> While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> >>>> we should switch away from the flat arrays or that we need fully dynamic
> >>>> arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> >>>> house here, where we keep the ethdevs as an array, but one allocated/sized
> >>>> at runtime rather than statically. This would allow us to have a
> >>>> compile-time default value, but, for use cases that need it, allow use of a
> >>>> flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> >>>> malloc call for the array.  This max limit could then be provided to apps
> >>>> too if they want to match any array sizes. [Alternatively those apps could
> >>>> check the provided size and error out if the size has been increased beyond
> >>>> what the app is designed to use?]. There would be no extra dereferences per
> >>>> rx/tx burst call in this scenario so performance should be the same as
> >>>> before (potentially better if array is in hugepage memory, I suppose).
> >>>
> >>> I think we need some benchmarks to decide what is the best tradeoff.
> >>> I spent time on this implementation, but sorry I won't have time for benchmarks.
> >>> Volunteers?
> >>
> >> I had only a quick look at your approach so far.
> >> But from what I can read, in MT environment your suggestion will require
> >> extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> >> I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> >> At least for rte_ethdevs[] and friends.
> >> Konstantin
> >
> > One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
> > 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
> > We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
> > any regressions.
> > That could still be flat array with max_size specified at application startup.
> > 2. Hide rest of rte_ethdev struct in .c.
> > That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
> > (flat array, vector, hash, linked list) without ABI/API breakages.
> >
> > Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
> > (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
> > Probably some macro can be provided to simplify it.
> >
> 
> We are already planning some tasks for ABI stability for v21.11, I think
> splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
> internal data.

Ok, sounds good.

> 
> > The only significant complication I can foresee with implementing that approach -
> > we'll need a an array of 'fast' function pointers per queue, not per device as we have now
> > (to avoid extra indirection for callback implementation).
> > Though as a bonus we'll have ability to use different RX/TX funcions per queue.
> >
> 
> What do you think split Rx/Tx callback into its own struct too?
> 
> Overall 'rte_eth_dev' can be split into three as:
> 1. rte_eth_dev
> 2. rte_eth_dev_burst
> 3. rte_eth_dev_cb
> 
> And we can hide 1 from applications even with the inline functions.

As discussed off-line, I think:
it is possible. 
My absolute preference would be to have just 1/2 (with CB hidden).
But even with 1/2/3 in place I think it would be  a good step forward.
Probably worth to start with 1/2/3 first and then see how difficult it
would be to switch to 1/2.
Do you plan to start working on it?
 
Konstantin
Morten Brørup June 17, 2021, 3:17 p.m. UTC | #33
> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> Sent: Thursday, 17 June 2021 16.59
> 
> > >>>
> > >>> 14/06/2021 15:15, Bruce Richardson:
> > >>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> > >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> Monjalon
> > >>>>>> Sent: Monday, 14 June 2021 12.59
> > >>>>>>
> > >>>>>> Performance of access in a fixed-size array is very good
> > >>>>>> because of cache locality
> > >>>>>> and because there is a single pointer to dereference.
> > >>>>>> The only drawback is the lack of flexibility:
> > >>>>>> the size of such an array cannot be increase at runtime.
> > >>>>>>
> > >>>>>> An approach to this problem is to allocate the array at
> runtime,
> > >>>>>> being as efficient as static arrays, but still limited to a
> maximum.
> > >>>>>>
> > >>>>>> That's why the API rte_parray is introduced,
> > >>>>>> allowing to declare an array of pointer which can be resized
> > >>>>>> dynamically
> > >>>>>> and automatically at runtime while keeping a good read
> performance.
> > >>>>>>
> > >>>>>> After resize, the previous array is kept until the next resize
> > >>>>>> to avoid crashs during a read without any lock.
> > >>>>>>
> > >>>>>> Each element is a pointer to a memory chunk dynamically
> allocated.
> > >>>>>> This is not good for cache locality but it allows to keep the
> same
> > >>>>>> memory per element, no matter how the array is resized.
> > >>>>>> Cache locality could be improved with mempools.
> > >>>>>> The other drawback is having to dereference one more pointer
> > >>>>>> to read an element.
> > >>>>>>
> > >>>>>> There is not much locks, so the API is for internal use only.
> > >>>>>> This API may be used to completely remove some compilation-
> time
> > >>>>>> maximums.
> > >>>>>
> > >>>>> I get the purpose and overall intention of this library.
> > >>>>>
> > >>>>> I probably already mentioned that I prefer "embedded style
> programming" with fixed size arrays, rather than runtime
> configurability.
> > >> It's
> > >>> my personal opinion, and the DPDK Tech Board clearly prefers
> reducing the amount of compile time configurability, so there is no way
> > for
> > >>> me to stop this progress, and I do not intend to oppose to this
> library. :-)
> > >>>>>
> > >>>>> This library is likely to become a core library of DPDK, so I
> think it is important getting it right. Could you please mention a few
> > >> examples
> > >>> where you think this internal library should be used, and where
> it should not be used. Then it is easier to discuss if the border line
> > between
> > >>> control path and data plane is correct. E.g. this library is not
> intended to be used for dynamically sized packet queues that grow and
> > shrink
> > >> in
> > >>> the fast path.
> > >>>>>
> > >>>>> If the library becomes a core DPDK library, it should probably
> be public instead of internal. E.g. if the library is used to make
> > >>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some
> applications might also need dynamically sized arrays for their
> > >>> application specific per-port runtime data, and this library
> could serve that purpose too.
> > >>>>>
> > >>>>
> > >>>> Thanks Thomas for starting this discussion and Morten for
> follow-up.
> > >>>>
> > >>>> My thinking is as follows, and I'm particularly keeping in mind
> the cases
> > >>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> > >>>>
> > >>>> While I dislike the hard-coded limits in DPDK, I'm also not
> convinced that
> > >>>> we should switch away from the flat arrays or that we need fully
> dynamic
> > >>>> arrays that grow/shrink at runtime for ethdevs. I would suggest
> a half-way
> > >>>> house here, where we keep the ethdevs as an array, but one
> allocated/sized
> > >>>> at runtime rather than statically. This would allow us to have a
> > >>>> compile-time default value, but, for use cases that need it,
> allow use of a
> > >>>> flag e.g.  "max-ethdevs" to change the size of the parameter
> given to the
> > >>>> malloc call for the array.  This max limit could then be
> provided to apps
> > >>>> too if they want to match any array sizes. [Alternatively those
> apps could
> > >>>> check the provided size and error out if the size has been
> increased beyond
> > >>>> what the app is designed to use?]. There would be no extra
> dereferences per
> > >>>> rx/tx burst call in this scenario so performance should be the
> same as
> > >>>> before (potentially better if array is in hugepage memory, I
> suppose).
> > >>>
> > >>> I think we need some benchmarks to decide what is the best
> tradeoff.
> > >>> I spent time on this implementation, but sorry I won't have time
> for benchmarks.
> > >>> Volunteers?
> > >>
> > >> I had only a quick look at your approach so far.
> > >> But from what I can read, in MT environment your suggestion will
> require
> > >> extra synchronization for each read-write access to such parray
> element (lock, rcu, ...).
> > >> I think what Bruce suggests will be much ligther, easier to
> implement and less error prone.
> > >> At least for rte_ethdevs[] and friends.
> > >> Konstantin
> > >
> > > One more thought here - if we are talking about rte_ethdev[] in
> particular, I think  we can:
> > > 1. move public function pointers (rx_pkt_burst(), etc.) from
> rte_ethdev into a separate flat array.
> > > We can keep it public to still use inline functions for 'fast'
> calls rte_eth_rx_burst(), etc. to avoid
> > > any regressions.
> > > That could still be flat array with max_size specified at
> application startup.
> > > 2. Hide rest of rte_ethdev struct in .c.
> > > That will allow us to change the struct itself and the whole
> rte_ethdev[] table in a way we like
> > > (flat array, vector, hash, linked list) without ABI/API breakages.
> > >
> > > Yes, it would require all PMDs to change prototype for
> pkt_rx_burst() function
> > > (to accept port_id, queue_id instead of queue pointer), but the
> change is mechanical one.
> > > Probably some macro can be provided to simplify it.
> > >
> >
> > We are already planning some tasks for ABI stability for v21.11, I
> think
> > splitting 'struct rte_eth_dev' can be part of that task, it enables
> hiding more
> > internal data.
> 
> Ok, sounds good.
> 
> >
> > > The only significant complication I can foresee with implementing
> that approach -
> > > we'll need a an array of 'fast' function pointers per queue, not
> per device as we have now
> > > (to avoid extra indirection for callback implementation).
> > > Though as a bonus we'll have ability to use different RX/TX
> funcions per queue.
> > >
> >
> > What do you think split Rx/Tx callback into its own struct too?
> >
> > Overall 'rte_eth_dev' can be split into three as:
> > 1. rte_eth_dev
> > 2. rte_eth_dev_burst
> > 3. rte_eth_dev_cb
> >
> > And we can hide 1 from applications even with the inline functions.
> 
> As discussed off-line, I think:
> it is possible.
> My absolute preference would be to have just 1/2 (with CB hidden).
> But even with 1/2/3 in place I think it would be  a good step forward.
> Probably worth to start with 1/2/3 first and then see how difficult it
> would be to switch to 1/2.
> Do you plan to start working on it?
> 
> Konstantin

If you do proceed with this, be very careful. E.g. the inlined rx/tx burst functions should not touch more cache lines than they do today - especially if there are many active ports. The inlined rx/tx burst functions are very simple, so thorough code review (and possibly also of the resulting assembly) is appropriate. Simple performance testing might not detect if more cache lines are accessed than before the modifications.

Don't get me wrong... I do consider this an improvement of the ethdev library; I'm only asking you to take extra care!

-Morten
Ferruh Yigit June 17, 2021, 3:44 p.m. UTC | #34
On 6/17/2021 3:58 PM, Ananyev, Konstantin wrote:
> 
> 
>>>>>
>>>>> 14/06/2021 15:15, Bruce Richardson:
>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
>>>>>>>> Sent: Monday, 14 June 2021 12.59
>>>>>>>>
>>>>>>>> Performance of access in a fixed-size array is very good
>>>>>>>> because of cache locality
>>>>>>>> and because there is a single pointer to dereference.
>>>>>>>> The only drawback is the lack of flexibility:
>>>>>>>> the size of such an array cannot be increase at runtime.
>>>>>>>>
>>>>>>>> An approach to this problem is to allocate the array at runtime,
>>>>>>>> being as efficient as static arrays, but still limited to a maximum.
>>>>>>>>
>>>>>>>> That's why the API rte_parray is introduced,
>>>>>>>> allowing to declare an array of pointer which can be resized
>>>>>>>> dynamically
>>>>>>>> and automatically at runtime while keeping a good read performance.
>>>>>>>>
>>>>>>>> After resize, the previous array is kept until the next resize
>>>>>>>> to avoid crashs during a read without any lock.
>>>>>>>>
>>>>>>>> Each element is a pointer to a memory chunk dynamically allocated.
>>>>>>>> This is not good for cache locality but it allows to keep the same
>>>>>>>> memory per element, no matter how the array is resized.
>>>>>>>> Cache locality could be improved with mempools.
>>>>>>>> The other drawback is having to dereference one more pointer
>>>>>>>> to read an element.
>>>>>>>>
>>>>>>>> There is not much locks, so the API is for internal use only.
>>>>>>>> This API may be used to completely remove some compilation-time
>>>>>>>> maximums.
>>>>>>>
>>>>>>> I get the purpose and overall intention of this library.
>>>>>>>
>>>>>>> I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime configurability.
>>>> It's
>>>>> my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no way
>> for
>>>>> me to stop this progress, and I do not intend to oppose to this library. :-)
>>>>>>>
>>>>>>> This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few
>>>> examples
>>>>> where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line
>> between
>>>>> control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and
>> shrink
>>>> in
>>>>> the fast path.
>>>>>>>
>>>>>>> If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for their
>>>>> application specific per-port runtime data, and this library could serve that purpose too.
>>>>>>>
>>>>>>
>>>>>> Thanks Thomas for starting this discussion and Morten for follow-up.
>>>>>>
>>>>>> My thinking is as follows, and I'm particularly keeping in mind the cases
>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
>>>>>>
>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not convinced that
>>>>>> we should switch away from the flat arrays or that we need fully dynamic
>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
>>>>>> house here, where we keep the ethdevs as an array, but one allocated/sized
>>>>>> at runtime rather than statically. This would allow us to have a
>>>>>> compile-time default value, but, for use cases that need it, allow use of a
>>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter given to the
>>>>>> malloc call for the array.  This max limit could then be provided to apps
>>>>>> too if they want to match any array sizes. [Alternatively those apps could
>>>>>> check the provided size and error out if the size has been increased beyond
>>>>>> what the app is designed to use?]. There would be no extra dereferences per
>>>>>> rx/tx burst call in this scenario so performance should be the same as
>>>>>> before (potentially better if array is in hugepage memory, I suppose).
>>>>>
>>>>> I think we need some benchmarks to decide what is the best tradeoff.
>>>>> I spent time on this implementation, but sorry I won't have time for benchmarks.
>>>>> Volunteers?
>>>>
>>>> I had only a quick look at your approach so far.
>>>> But from what I can read, in MT environment your suggestion will require
>>>> extra synchronization for each read-write access to such parray element (lock, rcu, ...).
>>>> I think what Bruce suggests will be much ligther, easier to implement and less error prone.
>>>> At least for rte_ethdevs[] and friends.
>>>> Konstantin
>>>
>>> One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
>>> 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
>>> We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
>>> any regressions.
>>> That could still be flat array with max_size specified at application startup.
>>> 2. Hide rest of rte_ethdev struct in .c.
>>> That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
>>> (flat array, vector, hash, linked list) without ABI/API breakages.
>>>
>>> Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
>>> (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
>>> Probably some macro can be provided to simplify it.
>>>
>>
>> We are already planning some tasks for ABI stability for v21.11, I think
>> splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
>> internal data.
> 
> Ok, sounds good.
> 
>>
>>> The only significant complication I can foresee with implementing that approach -
>>> we'll need a an array of 'fast' function pointers per queue, not per device as we have now
>>> (to avoid extra indirection for callback implementation).
>>> Though as a bonus we'll have ability to use different RX/TX funcions per queue.
>>>
>>
>> What do you think split Rx/Tx callback into its own struct too?
>>
>> Overall 'rte_eth_dev' can be split into three as:
>> 1. rte_eth_dev
>> 2. rte_eth_dev_burst
>> 3. rte_eth_dev_cb
>>
>> And we can hide 1 from applications even with the inline functions.
> 
> As discussed off-line, I think:
> it is possible.
> My absolute preference would be to have just 1/2 (with CB hidden).

How can we hide the callbacks since they are used by inline burst functions.

> But even with 1/2/3 in place I think it would be  a good step forward.
> Probably worth to start with 1/2/3 first and then see how difficult it
> would be to switch to 1/2.

What do you mean by switch to 1/2?

If we keep having inline functions, and split struct as above three structs, we
can only hide 1, and 2/3 will be still visible to apps because of inline
functions. This way we will be able to hide more still having same performance.

> Do you plan to start working on it?
> 

We are gathering the list of the tasks for the ABI stability, most probably they
will be worked on during v21.11. I can take this one.

> Konstantin
> 
> 
> 
>
Ferruh Yigit June 17, 2021, 4:12 p.m. UTC | #35
On 6/17/2021 4:17 PM, Morten Brørup wrote:
>> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
>> Sent: Thursday, 17 June 2021 16.59
>>
>>>>>>
>>>>>> 14/06/2021 15:15, Bruce Richardson:
>>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
>>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
>> Monjalon
>>>>>>>>> Sent: Monday, 14 June 2021 12.59
>>>>>>>>>
>>>>>>>>> Performance of access in a fixed-size array is very good
>>>>>>>>> because of cache locality
>>>>>>>>> and because there is a single pointer to dereference.
>>>>>>>>> The only drawback is the lack of flexibility:
>>>>>>>>> the size of such an array cannot be increase at runtime.
>>>>>>>>>
>>>>>>>>> An approach to this problem is to allocate the array at
>> runtime,
>>>>>>>>> being as efficient as static arrays, but still limited to a
>> maximum.
>>>>>>>>>
>>>>>>>>> That's why the API rte_parray is introduced,
>>>>>>>>> allowing to declare an array of pointer which can be resized
>>>>>>>>> dynamically
>>>>>>>>> and automatically at runtime while keeping a good read
>> performance.
>>>>>>>>>
>>>>>>>>> After resize, the previous array is kept until the next resize
>>>>>>>>> to avoid crashs during a read without any lock.
>>>>>>>>>
>>>>>>>>> Each element is a pointer to a memory chunk dynamically
>> allocated.
>>>>>>>>> This is not good for cache locality but it allows to keep the
>> same
>>>>>>>>> memory per element, no matter how the array is resized.
>>>>>>>>> Cache locality could be improved with mempools.
>>>>>>>>> The other drawback is having to dereference one more pointer
>>>>>>>>> to read an element.
>>>>>>>>>
>>>>>>>>> There is not much locks, so the API is for internal use only.
>>>>>>>>> This API may be used to completely remove some compilation-
>> time
>>>>>>>>> maximums.
>>>>>>>>
>>>>>>>> I get the purpose and overall intention of this library.
>>>>>>>>
>>>>>>>> I probably already mentioned that I prefer "embedded style
>> programming" with fixed size arrays, rather than runtime
>> configurability.
>>>>> It's
>>>>>> my personal opinion, and the DPDK Tech Board clearly prefers
>> reducing the amount of compile time configurability, so there is no way
>>> for
>>>>>> me to stop this progress, and I do not intend to oppose to this
>> library. :-)
>>>>>>>>
>>>>>>>> This library is likely to become a core library of DPDK, so I
>> think it is important getting it right. Could you please mention a few
>>>>> examples
>>>>>> where you think this internal library should be used, and where
>> it should not be used. Then it is easier to discuss if the border line
>>> between
>>>>>> control path and data plane is correct. E.g. this library is not
>> intended to be used for dynamically sized packet queues that grow and
>>> shrink
>>>>> in
>>>>>> the fast path.
>>>>>>>>
>>>>>>>> If the library becomes a core DPDK library, it should probably
>> be public instead of internal. E.g. if the library is used to make
>>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some
>> applications might also need dynamically sized arrays for their
>>>>>> application specific per-port runtime data, and this library
>> could serve that purpose too.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks Thomas for starting this discussion and Morten for
>> follow-up.
>>>>>>>
>>>>>>> My thinking is as follows, and I'm particularly keeping in mind
>> the cases
>>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
>>>>>>>
>>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not
>> convinced that
>>>>>>> we should switch away from the flat arrays or that we need fully
>> dynamic
>>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest
>> a half-way
>>>>>>> house here, where we keep the ethdevs as an array, but one
>> allocated/sized
>>>>>>> at runtime rather than statically. This would allow us to have a
>>>>>>> compile-time default value, but, for use cases that need it,
>> allow use of a
>>>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter
>> given to the
>>>>>>> malloc call for the array.  This max limit could then be
>> provided to apps
>>>>>>> too if they want to match any array sizes. [Alternatively those
>> apps could
>>>>>>> check the provided size and error out if the size has been
>> increased beyond
>>>>>>> what the app is designed to use?]. There would be no extra
>> dereferences per
>>>>>>> rx/tx burst call in this scenario so performance should be the
>> same as
>>>>>>> before (potentially better if array is in hugepage memory, I
>> suppose).
>>>>>>
>>>>>> I think we need some benchmarks to decide what is the best
>> tradeoff.
>>>>>> I spent time on this implementation, but sorry I won't have time
>> for benchmarks.
>>>>>> Volunteers?
>>>>>
>>>>> I had only a quick look at your approach so far.
>>>>> But from what I can read, in MT environment your suggestion will
>> require
>>>>> extra synchronization for each read-write access to such parray
>> element (lock, rcu, ...).
>>>>> I think what Bruce suggests will be much ligther, easier to
>> implement and less error prone.
>>>>> At least for rte_ethdevs[] and friends.
>>>>> Konstantin
>>>>
>>>> One more thought here - if we are talking about rte_ethdev[] in
>> particular, I think  we can:
>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from
>> rte_ethdev into a separate flat array.
>>>> We can keep it public to still use inline functions for 'fast'
>> calls rte_eth_rx_burst(), etc. to avoid
>>>> any regressions.
>>>> That could still be flat array with max_size specified at
>> application startup.
>>>> 2. Hide rest of rte_ethdev struct in .c.
>>>> That will allow us to change the struct itself and the whole
>> rte_ethdev[] table in a way we like
>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
>>>>
>>>> Yes, it would require all PMDs to change prototype for
>> pkt_rx_burst() function
>>>> (to accept port_id, queue_id instead of queue pointer), but the
>> change is mechanical one.
>>>> Probably some macro can be provided to simplify it.
>>>>
>>>
>>> We are already planning some tasks for ABI stability for v21.11, I
>> think
>>> splitting 'struct rte_eth_dev' can be part of that task, it enables
>> hiding more
>>> internal data.
>>
>> Ok, sounds good.
>>
>>>
>>>> The only significant complication I can foresee with implementing
>> that approach -
>>>> we'll need a an array of 'fast' function pointers per queue, not
>> per device as we have now
>>>> (to avoid extra indirection for callback implementation).
>>>> Though as a bonus we'll have ability to use different RX/TX
>> funcions per queue.
>>>>
>>>
>>> What do you think split Rx/Tx callback into its own struct too?
>>>
>>> Overall 'rte_eth_dev' can be split into three as:
>>> 1. rte_eth_dev
>>> 2. rte_eth_dev_burst
>>> 3. rte_eth_dev_cb
>>>
>>> And we can hide 1 from applications even with the inline functions.
>>
>> As discussed off-line, I think:
>> it is possible.
>> My absolute preference would be to have just 1/2 (with CB hidden).
>> But even with 1/2/3 in place I think it would be  a good step forward.
>> Probably worth to start with 1/2/3 first and then see how difficult it
>> would be to switch to 1/2.
>> Do you plan to start working on it?
>>
>> Konstantin
> 
> If you do proceed with this, be very careful. E.g. the inlined rx/tx burst functions should not touch more cache lines than they do today - especially if there are many active ports. The inlined rx/tx burst functions are very simple, so thorough code review (and possibly also of the resulting assembly) is appropriate. Simple performance testing might not detect if more cache lines are accessed than before the modifications.
> 
> Don't get me wrong... I do consider this an improvement of the ethdev library; I'm only asking you to take extra care!
> 

ack

If we split as above, I think device specific data 'struct rte_eth_dev_data'
should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions access
additional cache line.

To prevent this, what about duplicating 'data' in 2 (rte_eth_dev_burst)? We have
enough space for it to fit into single cache line, currently it is:
struct rte_eth_dev {
        eth_rx_burst_t             rx_pkt_burst;         /*     0     8 */
        eth_tx_burst_t             tx_pkt_burst;         /*     8     8 */
        eth_tx_prep_t              tx_pkt_prepare;       /*    16     8 */
        eth_rx_queue_count_t       rx_queue_count;       /*    24     8 */
        eth_rx_descriptor_done_t   rx_descriptor_done;   /*    32     8 */
        eth_rx_descriptor_status_t rx_descriptor_status; /*    40     8 */
        eth_tx_descriptor_status_t tx_descriptor_status; /*    48     8 */
        struct rte_eth_dev_data *  data;                 /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */

'rx_descriptor_done' is deprecated and will be removed;
Morten Brørup June 17, 2021, 4:55 p.m. UTC | #36
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ferruh Yigit
> Sent: Thursday, 17 June 2021 18.13
> 
> On 6/17/2021 4:17 PM, Morten Brørup wrote:
> >> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> >> Sent: Thursday, 17 June 2021 16.59
> >>
> >>>>>>
> >>>>>> 14/06/2021 15:15, Bruce Richardson:
> >>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> >>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> >> Monjalon
> >>>>>>>>> Sent: Monday, 14 June 2021 12.59
> >>>>>>>>>
> >>>>>>>>> Performance of access in a fixed-size array is very good
> >>>>>>>>> because of cache locality
> >>>>>>>>> and because there is a single pointer to dereference.
> >>>>>>>>> The only drawback is the lack of flexibility:
> >>>>>>>>> the size of such an array cannot be increase at runtime.
> >>>>>>>>>
> >>>>>>>>> An approach to this problem is to allocate the array at
> >> runtime,
> >>>>>>>>> being as efficient as static arrays, but still limited to a
> >> maximum.
> >>>>>>>>>
> >>>>>>>>> That's why the API rte_parray is introduced,
> >>>>>>>>> allowing to declare an array of pointer which can be resized
> >>>>>>>>> dynamically
> >>>>>>>>> and automatically at runtime while keeping a good read
> >> performance.
> >>>>>>>>>
> >>>>>>>>> After resize, the previous array is kept until the next
> resize
> >>>>>>>>> to avoid crashs during a read without any lock.
> >>>>>>>>>
> >>>>>>>>> Each element is a pointer to a memory chunk dynamically
> >> allocated.
> >>>>>>>>> This is not good for cache locality but it allows to keep the
> >> same
> >>>>>>>>> memory per element, no matter how the array is resized.
> >>>>>>>>> Cache locality could be improved with mempools.
> >>>>>>>>> The other drawback is having to dereference one more pointer
> >>>>>>>>> to read an element.
> >>>>>>>>>
> >>>>>>>>> There is not much locks, so the API is for internal use only.
> >>>>>>>>> This API may be used to completely remove some compilation-
> >> time
> >>>>>>>>> maximums.
> >>>>>>>>
> >>>>>>>> I get the purpose and overall intention of this library.
> >>>>>>>>
> >>>>>>>> I probably already mentioned that I prefer "embedded style
> >> programming" with fixed size arrays, rather than runtime
> >> configurability.
> >>>>> It's
> >>>>>> my personal opinion, and the DPDK Tech Board clearly prefers
> >> reducing the amount of compile time configurability, so there is no
> way
> >>> for
> >>>>>> me to stop this progress, and I do not intend to oppose to this
> >> library. :-)
> >>>>>>>>
> >>>>>>>> This library is likely to become a core library of DPDK, so I
> >> think it is important getting it right. Could you please mention a
> few
> >>>>> examples
> >>>>>> where you think this internal library should be used, and where
> >> it should not be used. Then it is easier to discuss if the border
> line
> >>> between
> >>>>>> control path and data plane is correct. E.g. this library is not
> >> intended to be used for dynamically sized packet queues that grow
> and
> >>> shrink
> >>>>> in
> >>>>>> the fast path.
> >>>>>>>>
> >>>>>>>> If the library becomes a core DPDK library, it should probably
> >> be public instead of internal. E.g. if the library is used to make
> >>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then
> some
> >> applications might also need dynamically sized arrays for their
> >>>>>> application specific per-port runtime data, and this library
> >> could serve that purpose too.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Thanks Thomas for starting this discussion and Morten for
> >> follow-up.
> >>>>>>>
> >>>>>>> My thinking is as follows, and I'm particularly keeping in mind
> >> the cases
> >>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> >>>>>>>
> >>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not
> >> convinced that
> >>>>>>> we should switch away from the flat arrays or that we need
> fully
> >> dynamic
> >>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest
> >> a half-way
> >>>>>>> house here, where we keep the ethdevs as an array, but one
> >> allocated/sized
> >>>>>>> at runtime rather than statically. This would allow us to have
> a
> >>>>>>> compile-time default value, but, for use cases that need it,
> >> allow use of a
> >>>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter
> >> given to the
> >>>>>>> malloc call for the array.  This max limit could then be
> >> provided to apps
> >>>>>>> too if they want to match any array sizes. [Alternatively those
> >> apps could
> >>>>>>> check the provided size and error out if the size has been
> >> increased beyond
> >>>>>>> what the app is designed to use?]. There would be no extra
> >> dereferences per
> >>>>>>> rx/tx burst call in this scenario so performance should be the
> >> same as
> >>>>>>> before (potentially better if array is in hugepage memory, I
> >> suppose).
> >>>>>>
> >>>>>> I think we need some benchmarks to decide what is the best
> >> tradeoff.
> >>>>>> I spent time on this implementation, but sorry I won't have time
> >> for benchmarks.
> >>>>>> Volunteers?
> >>>>>
> >>>>> I had only a quick look at your approach so far.
> >>>>> But from what I can read, in MT environment your suggestion will
> >> require
> >>>>> extra synchronization for each read-write access to such parray
> >> element (lock, rcu, ...).
> >>>>> I think what Bruce suggests will be much ligther, easier to
> >> implement and less error prone.
> >>>>> At least for rte_ethdevs[] and friends.
> >>>>> Konstantin
> >>>>
> >>>> One more thought here - if we are talking about rte_ethdev[] in
> >> particular, I think  we can:
> >>>> 1. move public function pointers (rx_pkt_burst(), etc.) from
> >> rte_ethdev into a separate flat array.
> >>>> We can keep it public to still use inline functions for 'fast'
> >> calls rte_eth_rx_burst(), etc. to avoid
> >>>> any regressions.
> >>>> That could still be flat array with max_size specified at
> >> application startup.
> >>>> 2. Hide rest of rte_ethdev struct in .c.
> >>>> That will allow us to change the struct itself and the whole
> >> rte_ethdev[] table in a way we like
> >>>> (flat array, vector, hash, linked list) without ABI/API breakages.
> >>>>
> >>>> Yes, it would require all PMDs to change prototype for
> >> pkt_rx_burst() function
> >>>> (to accept port_id, queue_id instead of queue pointer), but the
> >> change is mechanical one.
> >>>> Probably some macro can be provided to simplify it.
> >>>>
> >>>
> >>> We are already planning some tasks for ABI stability for v21.11, I
> >> think
> >>> splitting 'struct rte_eth_dev' can be part of that task, it enables
> >> hiding more
> >>> internal data.
> >>
> >> Ok, sounds good.
> >>
> >>>
> >>>> The only significant complication I can foresee with implementing
> >> that approach -
> >>>> we'll need a an array of 'fast' function pointers per queue, not
> >> per device as we have now
> >>>> (to avoid extra indirection for callback implementation).
> >>>> Though as a bonus we'll have ability to use different RX/TX
> >> funcions per queue.
> >>>>
> >>>
> >>> What do you think split Rx/Tx callback into its own struct too?
> >>>
> >>> Overall 'rte_eth_dev' can be split into three as:
> >>> 1. rte_eth_dev
> >>> 2. rte_eth_dev_burst
> >>> 3. rte_eth_dev_cb
> >>>
> >>> And we can hide 1 from applications even with the inline functions.
> >>
> >> As discussed off-line, I think:
> >> it is possible.
> >> My absolute preference would be to have just 1/2 (with CB hidden).
> >> But even with 1/2/3 in place I think it would be  a good step
> forward.
> >> Probably worth to start with 1/2/3 first and then see how difficult
> it
> >> would be to switch to 1/2.
> >> Do you plan to start working on it?
> >>
> >> Konstantin
> >
> > If you do proceed with this, be very careful. E.g. the inlined rx/tx
> burst functions should not touch more cache lines than they do today -
> especially if there are many active ports. The inlined rx/tx burst
> functions are very simple, so thorough code review (and possibly also
> of the resulting assembly) is appropriate. Simple performance testing
> might not detect if more cache lines are accessed than before the
> modifications.
> >
> > Don't get me wrong... I do consider this an improvement of the ethdev
> library; I'm only asking you to take extra care!
> >
> 
> ack
> 
> If we split as above, I think device specific data 'struct
> rte_eth_dev_data'
> should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions
> access
> additional cache line.
> 
> To prevent this, what about duplicating 'data' in 2
> (rte_eth_dev_burst)? We have
> enough space for it to fit into single cache line, currently it is:
> struct rte_eth_dev {
>         eth_rx_burst_t             rx_pkt_burst;         /*     0     8
> */
>         eth_tx_burst_t             tx_pkt_burst;         /*     8     8
> */
>         eth_tx_prep_t              tx_pkt_prepare;       /*    16     8
> */
>         eth_rx_queue_count_t       rx_queue_count;       /*    24     8
> */
>         eth_rx_descriptor_done_t   rx_descriptor_done;   /*    32     8
> */
>         eth_rx_descriptor_status_t rx_descriptor_status; /*    40     8
> */
>         eth_tx_descriptor_status_t tx_descriptor_status; /*    48     8
> */
>         struct rte_eth_dev_data *  data;                 /*    56     8
> */
>         /* --- cacheline 1 boundary (64 bytes) --- */
> 
> 'rx_descriptor_done' is deprecated and will be removed;

Makes sense.

Also consider moving 'data' to the top of the new struct, so there is room to add future functions below. (Without growing to more than the one cache line size, one new function can be added when 'rx_descriptor_done' has been removed.)
Ananyev, Konstantin June 17, 2021, 5:05 p.m. UTC | #37
> On 6/17/2021 4:17 PM, Morten Brørup wrote:
> >> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
> >> Sent: Thursday, 17 June 2021 16.59
> >>
> >>>>>>
> >>>>>> 14/06/2021 15:15, Bruce Richardson:
> >>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> >>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
> >> Monjalon
> >>>>>>>>> Sent: Monday, 14 June 2021 12.59
> >>>>>>>>>
> >>>>>>>>> Performance of access in a fixed-size array is very good
> >>>>>>>>> because of cache locality
> >>>>>>>>> and because there is a single pointer to dereference.
> >>>>>>>>> The only drawback is the lack of flexibility:
> >>>>>>>>> the size of such an array cannot be increase at runtime.
> >>>>>>>>>
> >>>>>>>>> An approach to this problem is to allocate the array at
> >> runtime,
> >>>>>>>>> being as efficient as static arrays, but still limited to a
> >> maximum.
> >>>>>>>>>
> >>>>>>>>> That's why the API rte_parray is introduced,
> >>>>>>>>> allowing to declare an array of pointer which can be resized
> >>>>>>>>> dynamically
> >>>>>>>>> and automatically at runtime while keeping a good read
> >> performance.
> >>>>>>>>>
> >>>>>>>>> After resize, the previous array is kept until the next resize
> >>>>>>>>> to avoid crashs during a read without any lock.
> >>>>>>>>>
> >>>>>>>>> Each element is a pointer to a memory chunk dynamically
> >> allocated.
> >>>>>>>>> This is not good for cache locality but it allows to keep the
> >> same
> >>>>>>>>> memory per element, no matter how the array is resized.
> >>>>>>>>> Cache locality could be improved with mempools.
> >>>>>>>>> The other drawback is having to dereference one more pointer
> >>>>>>>>> to read an element.
> >>>>>>>>>
> >>>>>>>>> There is not much locks, so the API is for internal use only.
> >>>>>>>>> This API may be used to completely remove some compilation-
> >> time
> >>>>>>>>> maximums.
> >>>>>>>>
> >>>>>>>> I get the purpose and overall intention of this library.
> >>>>>>>>
> >>>>>>>> I probably already mentioned that I prefer "embedded style
> >> programming" with fixed size arrays, rather than runtime
> >> configurability.
> >>>>> It's
> >>>>>> my personal opinion, and the DPDK Tech Board clearly prefers
> >> reducing the amount of compile time configurability, so there is no way
> >>> for
> >>>>>> me to stop this progress, and I do not intend to oppose to this
> >> library. :-)
> >>>>>>>>
> >>>>>>>> This library is likely to become a core library of DPDK, so I
> >> think it is important getting it right. Could you please mention a few
> >>>>> examples
> >>>>>> where you think this internal library should be used, and where
> >> it should not be used. Then it is easier to discuss if the border line
> >>> between
> >>>>>> control path and data plane is correct. E.g. this library is not
> >> intended to be used for dynamically sized packet queues that grow and
> >>> shrink
> >>>>> in
> >>>>>> the fast path.
> >>>>>>>>
> >>>>>>>> If the library becomes a core DPDK library, it should probably
> >> be public instead of internal. E.g. if the library is used to make
> >>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some
> >> applications might also need dynamically sized arrays for their
> >>>>>> application specific per-port runtime data, and this library
> >> could serve that purpose too.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Thanks Thomas for starting this discussion and Morten for
> >> follow-up.
> >>>>>>>
> >>>>>>> My thinking is as follows, and I'm particularly keeping in mind
> >> the cases
> >>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> >>>>>>>
> >>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not
> >> convinced that
> >>>>>>> we should switch away from the flat arrays or that we need fully
> >> dynamic
> >>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest
> >> a half-way
> >>>>>>> house here, where we keep the ethdevs as an array, but one
> >> allocated/sized
> >>>>>>> at runtime rather than statically. This would allow us to have a
> >>>>>>> compile-time default value, but, for use cases that need it,
> >> allow use of a
> >>>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter
> >> given to the
> >>>>>>> malloc call for the array.  This max limit could then be
> >> provided to apps
> >>>>>>> too if they want to match any array sizes. [Alternatively those
> >> apps could
> >>>>>>> check the provided size and error out if the size has been
> >> increased beyond
> >>>>>>> what the app is designed to use?]. There would be no extra
> >> dereferences per
> >>>>>>> rx/tx burst call in this scenario so performance should be the
> >> same as
> >>>>>>> before (potentially better if array is in hugepage memory, I
> >> suppose).
> >>>>>>
> >>>>>> I think we need some benchmarks to decide what is the best
> >> tradeoff.
> >>>>>> I spent time on this implementation, but sorry I won't have time
> >> for benchmarks.
> >>>>>> Volunteers?
> >>>>>
> >>>>> I had only a quick look at your approach so far.
> >>>>> But from what I can read, in MT environment your suggestion will
> >> require
> >>>>> extra synchronization for each read-write access to such parray
> >> element (lock, rcu, ...).
> >>>>> I think what Bruce suggests will be much ligther, easier to
> >> implement and less error prone.
> >>>>> At least for rte_ethdevs[] and friends.
> >>>>> Konstantin
> >>>>
> >>>> One more thought here - if we are talking about rte_ethdev[] in
> >> particular, I think  we can:
> >>>> 1. move public function pointers (rx_pkt_burst(), etc.) from
> >> rte_ethdev into a separate flat array.
> >>>> We can keep it public to still use inline functions for 'fast'
> >> calls rte_eth_rx_burst(), etc. to avoid
> >>>> any regressions.
> >>>> That could still be flat array with max_size specified at
> >> application startup.
> >>>> 2. Hide rest of rte_ethdev struct in .c.
> >>>> That will allow us to change the struct itself and the whole
> >> rte_ethdev[] table in a way we like
> >>>> (flat array, vector, hash, linked list) without ABI/API breakages.
> >>>>
> >>>> Yes, it would require all PMDs to change prototype for
> >> pkt_rx_burst() function
> >>>> (to accept port_id, queue_id instead of queue pointer), but the
> >> change is mechanical one.
> >>>> Probably some macro can be provided to simplify it.
> >>>>
> >>>
> >>> We are already planning some tasks for ABI stability for v21.11, I
> >> think
> >>> splitting 'struct rte_eth_dev' can be part of that task, it enables
> >> hiding more
> >>> internal data.
> >>
> >> Ok, sounds good.
> >>
> >>>
> >>>> The only significant complication I can foresee with implementing
> >> that approach -
> >>>> we'll need a an array of 'fast' function pointers per queue, not
> >> per device as we have now
> >>>> (to avoid extra indirection for callback implementation).
> >>>> Though as a bonus we'll have ability to use different RX/TX
> >> funcions per queue.
> >>>>
> >>>
> >>> What do you think split Rx/Tx callback into its own struct too?
> >>>
> >>> Overall 'rte_eth_dev' can be split into three as:
> >>> 1. rte_eth_dev
> >>> 2. rte_eth_dev_burst
> >>> 3. rte_eth_dev_cb
> >>>
> >>> And we can hide 1 from applications even with the inline functions.
> >>
> >> As discussed off-line, I think:
> >> it is possible.
> >> My absolute preference would be to have just 1/2 (with CB hidden).
> >> But even with 1/2/3 in place I think it would be  a good step forward.
> >> Probably worth to start with 1/2/3 first and then see how difficult it
> >> would be to switch to 1/2.
> >> Do you plan to start working on it?
> >>
> >> Konstantin
> >
> > If you do proceed with this, be very careful. E.g. the inlined rx/tx burst functions should not touch more cache lines than they do today -
> especially if there are many active ports. The inlined rx/tx burst functions are very simple, so thorough code review (and possibly also of the
> resulting assembly) is appropriate. Simple performance testing might not detect if more cache lines are accessed than before the
> modifications.
> >
> > Don't get me wrong... I do consider this an improvement of the ethdev library; I'm only asking you to take extra care!
> >
> 
> ack
> 
> If we split as above, I think device specific data 'struct rte_eth_dev_data'
> should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions access
> additional cache line.
> 
> To prevent this, what about duplicating 'data' in 2 (rte_eth_dev_burst)? 

I think it would be better to change rx_pkt_burst() to accept port_id and queue_id,
instead of void *.
I.E:
typedef uint16_t (*eth_rx_burst_t)(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts,  uint16_t nb_pkts);

And we can do actual de-referencing of private rxq data inside the actual rx function.

> We have
> enough space for it to fit into single cache line, currently it is:
> struct rte_eth_dev {
>         eth_rx_burst_t             rx_pkt_burst;         /*     0     8 */
>         eth_tx_burst_t             tx_pkt_burst;         /*     8     8 */
>         eth_tx_prep_t              tx_pkt_prepare;       /*    16     8 */
>         eth_rx_queue_count_t       rx_queue_count;       /*    24     8 */
>         eth_rx_descriptor_done_t   rx_descriptor_done;   /*    32     8 */
>         eth_rx_descriptor_status_t rx_descriptor_status; /*    40     8 */
>         eth_tx_descriptor_status_t tx_descriptor_status; /*    48     8 */
>         struct rte_eth_dev_data *  data;                 /*    56     8 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
> 
> 'rx_descriptor_done' is deprecated and will be removed;
Morten Brørup June 18, 2021, 9:14 a.m. UTC | #38
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> Sent: Thursday, 17 June 2021 19.06
> 
> I think it would be better to change rx_pkt_burst() to accept port_id
> and queue_id, instead of void *.

Current:

typedef uint16_t (*eth_rx_burst_t)(void *rxq,
				   struct rte_mbuf **rx_pkts,
				   uint16_t nb_pkts);

> I.E:
> typedef uint16_t (*eth_rx_burst_t)(uint16_t port_id,
>     uint16_t queue_id,
>     struct rte_mbuf **rx_pkts,
>     uint16_t nb_pkts);
> 
> And we can do actual de-referencing of private rxq data inside the
> actual rx function.

Good idea, if it can be done without a performance cost.

The X64 calling convention allows up to 4 parameters passed as registers, so the added parameter should not be a problem.


Another thing:

I just noticed that struct rte_eth_dev_data has "void **rx_queues;" (and similarly for tx_queues).

That should be "void *rx_queues[RTE_MAX_QUEUES_PER_PORT];", like in all the other ethdev structures.

The same structure even has "uint8_t rx_queue_state[RTE_MAX_QUEUES_PER_PORT];", so it's following two different conventions.
Ferruh Yigit June 18, 2021, 10:21 a.m. UTC | #39
On 6/17/2021 5:55 PM, Morten Brørup wrote:
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ferruh Yigit
>> Sent: Thursday, 17 June 2021 18.13
>>
>> On 6/17/2021 4:17 PM, Morten Brørup wrote:
>>>> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
>>>> Sent: Thursday, 17 June 2021 16.59
>>>>
>>>>>>>>
>>>>>>>> 14/06/2021 15:15, Bruce Richardson:
>>>>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
>>>>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
>>>> Monjalon
>>>>>>>>>>> Sent: Monday, 14 June 2021 12.59
>>>>>>>>>>>
>>>>>>>>>>> Performance of access in a fixed-size array is very good
>>>>>>>>>>> because of cache locality
>>>>>>>>>>> and because there is a single pointer to dereference.
>>>>>>>>>>> The only drawback is the lack of flexibility:
>>>>>>>>>>> the size of such an array cannot be increase at runtime.
>>>>>>>>>>>
>>>>>>>>>>> An approach to this problem is to allocate the array at
>>>> runtime,
>>>>>>>>>>> being as efficient as static arrays, but still limited to a
>>>> maximum.
>>>>>>>>>>>
>>>>>>>>>>> That's why the API rte_parray is introduced,
>>>>>>>>>>> allowing to declare an array of pointer which can be resized
>>>>>>>>>>> dynamically
>>>>>>>>>>> and automatically at runtime while keeping a good read
>>>> performance.
>>>>>>>>>>>
>>>>>>>>>>> After resize, the previous array is kept until the next
>> resize
>>>>>>>>>>> to avoid crashs during a read without any lock.
>>>>>>>>>>>
>>>>>>>>>>> Each element is a pointer to a memory chunk dynamically
>>>> allocated.
>>>>>>>>>>> This is not good for cache locality but it allows to keep the
>>>> same
>>>>>>>>>>> memory per element, no matter how the array is resized.
>>>>>>>>>>> Cache locality could be improved with mempools.
>>>>>>>>>>> The other drawback is having to dereference one more pointer
>>>>>>>>>>> to read an element.
>>>>>>>>>>>
>>>>>>>>>>> There is not much locks, so the API is for internal use only.
>>>>>>>>>>> This API may be used to completely remove some compilation-
>>>> time
>>>>>>>>>>> maximums.
>>>>>>>>>>
>>>>>>>>>> I get the purpose and overall intention of this library.
>>>>>>>>>>
>>>>>>>>>> I probably already mentioned that I prefer "embedded style
>>>> programming" with fixed size arrays, rather than runtime
>>>> configurability.
>>>>>>> It's
>>>>>>>> my personal opinion, and the DPDK Tech Board clearly prefers
>>>> reducing the amount of compile time configurability, so there is no
>> way
>>>>> for
>>>>>>>> me to stop this progress, and I do not intend to oppose to this
>>>> library. :-)
>>>>>>>>>>
>>>>>>>>>> This library is likely to become a core library of DPDK, so I
>>>> think it is important getting it right. Could you please mention a
>> few
>>>>>>> examples
>>>>>>>> where you think this internal library should be used, and where
>>>> it should not be used. Then it is easier to discuss if the border
>> line
>>>>> between
>>>>>>>> control path and data plane is correct. E.g. this library is not
>>>> intended to be used for dynamically sized packet queues that grow
>> and
>>>>> shrink
>>>>>>> in
>>>>>>>> the fast path.
>>>>>>>>>>
>>>>>>>>>> If the library becomes a core DPDK library, it should probably
>>>> be public instead of internal. E.g. if the library is used to make
>>>>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then
>> some
>>>> applications might also need dynamically sized arrays for their
>>>>>>>> application specific per-port runtime data, and this library
>>>> could serve that purpose too.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks Thomas for starting this discussion and Morten for
>>>> follow-up.
>>>>>>>>>
>>>>>>>>> My thinking is as follows, and I'm particularly keeping in mind
>>>> the cases
>>>>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
>>>>>>>>>
>>>>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not
>>>> convinced that
>>>>>>>>> we should switch away from the flat arrays or that we need
>> fully
>>>> dynamic
>>>>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest
>>>> a half-way
>>>>>>>>> house here, where we keep the ethdevs as an array, but one
>>>> allocated/sized
>>>>>>>>> at runtime rather than statically. This would allow us to have
>> a
>>>>>>>>> compile-time default value, but, for use cases that need it,
>>>> allow use of a
>>>>>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter
>>>> given to the
>>>>>>>>> malloc call for the array.  This max limit could then be
>>>> provided to apps
>>>>>>>>> too if they want to match any array sizes. [Alternatively those
>>>> apps could
>>>>>>>>> check the provided size and error out if the size has been
>>>> increased beyond
>>>>>>>>> what the app is designed to use?]. There would be no extra
>>>> dereferences per
>>>>>>>>> rx/tx burst call in this scenario so performance should be the
>>>> same as
>>>>>>>>> before (potentially better if array is in hugepage memory, I
>>>> suppose).
>>>>>>>>
>>>>>>>> I think we need some benchmarks to decide what is the best
>>>> tradeoff.
>>>>>>>> I spent time on this implementation, but sorry I won't have time
>>>> for benchmarks.
>>>>>>>> Volunteers?
>>>>>>>
>>>>>>> I had only a quick look at your approach so far.
>>>>>>> But from what I can read, in MT environment your suggestion will
>>>> require
>>>>>>> extra synchronization for each read-write access to such parray
>>>> element (lock, rcu, ...).
>>>>>>> I think what Bruce suggests will be much ligther, easier to
>>>> implement and less error prone.
>>>>>>> At least for rte_ethdevs[] and friends.
>>>>>>> Konstantin
>>>>>>
>>>>>> One more thought here - if we are talking about rte_ethdev[] in
>>>> particular, I think  we can:
>>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from
>>>> rte_ethdev into a separate flat array.
>>>>>> We can keep it public to still use inline functions for 'fast'
>>>> calls rte_eth_rx_burst(), etc. to avoid
>>>>>> any regressions.
>>>>>> That could still be flat array with max_size specified at
>>>> application startup.
>>>>>> 2. Hide rest of rte_ethdev struct in .c.
>>>>>> That will allow us to change the struct itself and the whole
>>>> rte_ethdev[] table in a way we like
>>>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
>>>>>>
>>>>>> Yes, it would require all PMDs to change prototype for
>>>> pkt_rx_burst() function
>>>>>> (to accept port_id, queue_id instead of queue pointer), but the
>>>> change is mechanical one.
>>>>>> Probably some macro can be provided to simplify it.
>>>>>>
>>>>>
>>>>> We are already planning some tasks for ABI stability for v21.11, I
>>>> think
>>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables
>>>> hiding more
>>>>> internal data.
>>>>
>>>> Ok, sounds good.
>>>>
>>>>>
>>>>>> The only significant complication I can foresee with implementing
>>>> that approach -
>>>>>> we'll need a an array of 'fast' function pointers per queue, not
>>>> per device as we have now
>>>>>> (to avoid extra indirection for callback implementation).
>>>>>> Though as a bonus we'll have ability to use different RX/TX
>>>> funcions per queue.
>>>>>>
>>>>>
>>>>> What do you think split Rx/Tx callback into its own struct too?
>>>>>
>>>>> Overall 'rte_eth_dev' can be split into three as:
>>>>> 1. rte_eth_dev
>>>>> 2. rte_eth_dev_burst
>>>>> 3. rte_eth_dev_cb
>>>>>
>>>>> And we can hide 1 from applications even with the inline functions.
>>>>
>>>> As discussed off-line, I think:
>>>> it is possible.
>>>> My absolute preference would be to have just 1/2 (with CB hidden).
>>>> But even with 1/2/3 in place I think it would be  a good step
>> forward.
>>>> Probably worth to start with 1/2/3 first and then see how difficult
>> it
>>>> would be to switch to 1/2.
>>>> Do you plan to start working on it?
>>>>
>>>> Konstantin
>>>
>>> If you do proceed with this, be very careful. E.g. the inlined rx/tx
>> burst functions should not touch more cache lines than they do today -
>> especially if there are many active ports. The inlined rx/tx burst
>> functions are very simple, so thorough code review (and possibly also
>> of the resulting assembly) is appropriate. Simple performance testing
>> might not detect if more cache lines are accessed than before the
>> modifications.
>>>
>>> Don't get me wrong... I do consider this an improvement of the ethdev
>> library; I'm only asking you to take extra care!
>>>
>>
>> ack
>>
>> If we split as above, I think device specific data 'struct
>> rte_eth_dev_data'
>> should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions
>> access
>> additional cache line.
>>
>> To prevent this, what about duplicating 'data' in 2
>> (rte_eth_dev_burst)? We have
>> enough space for it to fit into single cache line, currently it is:
>> struct rte_eth_dev {
>>         eth_rx_burst_t             rx_pkt_burst;         /*     0     8
>> */
>>         eth_tx_burst_t             tx_pkt_burst;         /*     8     8
>> */
>>         eth_tx_prep_t              tx_pkt_prepare;       /*    16     8
>> */
>>         eth_rx_queue_count_t       rx_queue_count;       /*    24     8
>> */
>>         eth_rx_descriptor_done_t   rx_descriptor_done;   /*    32     8
>> */
>>         eth_rx_descriptor_status_t rx_descriptor_status; /*    40     8
>> */
>>         eth_tx_descriptor_status_t tx_descriptor_status; /*    48     8
>> */
>>         struct rte_eth_dev_data *  data;                 /*    56     8
>> */
>>         /* --- cacheline 1 boundary (64 bytes) --- */
>>
>> 'rx_descriptor_done' is deprecated and will be removed;
> 
> Makes sense.
> 
> Also consider moving 'data' to the top of the new struct, so there is room to add future functions below. (Without growing to more than the one cache line size, one new function can be added when 'rx_descriptor_done' has been removed.)
> 

+1
Ferruh Yigit June 18, 2021, 10:28 a.m. UTC | #40
On 6/17/2021 6:05 PM, Ananyev, Konstantin wrote:
> 
> 
>> On 6/17/2021 4:17 PM, Morten Brørup wrote:
>>>> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com]
>>>> Sent: Thursday, 17 June 2021 16.59
>>>>
>>>>>>>>
>>>>>>>> 14/06/2021 15:15, Bruce Richardson:
>>>>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
>>>>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas
>>>> Monjalon
>>>>>>>>>>> Sent: Monday, 14 June 2021 12.59
>>>>>>>>>>>
>>>>>>>>>>> Performance of access in a fixed-size array is very good
>>>>>>>>>>> because of cache locality
>>>>>>>>>>> and because there is a single pointer to dereference.
>>>>>>>>>>> The only drawback is the lack of flexibility:
>>>>>>>>>>> the size of such an array cannot be increase at runtime.
>>>>>>>>>>>
>>>>>>>>>>> An approach to this problem is to allocate the array at
>>>> runtime,
>>>>>>>>>>> being as efficient as static arrays, but still limited to a
>>>> maximum.
>>>>>>>>>>>
>>>>>>>>>>> That's why the API rte_parray is introduced,
>>>>>>>>>>> allowing to declare an array of pointer which can be resized
>>>>>>>>>>> dynamically
>>>>>>>>>>> and automatically at runtime while keeping a good read
>>>> performance.
>>>>>>>>>>>
>>>>>>>>>>> After resize, the previous array is kept until the next resize
>>>>>>>>>>> to avoid crashs during a read without any lock.
>>>>>>>>>>>
>>>>>>>>>>> Each element is a pointer to a memory chunk dynamically
>>>> allocated.
>>>>>>>>>>> This is not good for cache locality but it allows to keep the
>>>> same
>>>>>>>>>>> memory per element, no matter how the array is resized.
>>>>>>>>>>> Cache locality could be improved with mempools.
>>>>>>>>>>> The other drawback is having to dereference one more pointer
>>>>>>>>>>> to read an element.
>>>>>>>>>>>
>>>>>>>>>>> There is not much locks, so the API is for internal use only.
>>>>>>>>>>> This API may be used to completely remove some compilation-
>>>> time
>>>>>>>>>>> maximums.
>>>>>>>>>>
>>>>>>>>>> I get the purpose and overall intention of this library.
>>>>>>>>>>
>>>>>>>>>> I probably already mentioned that I prefer "embedded style
>>>> programming" with fixed size arrays, rather than runtime
>>>> configurability.
>>>>>>> It's
>>>>>>>> my personal opinion, and the DPDK Tech Board clearly prefers
>>>> reducing the amount of compile time configurability, so there is no way
>>>>> for
>>>>>>>> me to stop this progress, and I do not intend to oppose to this
>>>> library. :-)
>>>>>>>>>>
>>>>>>>>>> This library is likely to become a core library of DPDK, so I
>>>> think it is important getting it right. Could you please mention a few
>>>>>>> examples
>>>>>>>> where you think this internal library should be used, and where
>>>> it should not be used. Then it is easier to discuss if the border line
>>>>> between
>>>>>>>> control path and data plane is correct. E.g. this library is not
>>>> intended to be used for dynamically sized packet queues that grow and
>>>>> shrink
>>>>>>> in
>>>>>>>> the fast path.
>>>>>>>>>>
>>>>>>>>>> If the library becomes a core DPDK library, it should probably
>>>> be public instead of internal. E.g. if the library is used to make
>>>>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some
>>>> applications might also need dynamically sized arrays for their
>>>>>>>> application specific per-port runtime data, and this library
>>>> could serve that purpose too.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks Thomas for starting this discussion and Morten for
>>>> follow-up.
>>>>>>>>>
>>>>>>>>> My thinking is as follows, and I'm particularly keeping in mind
>>>> the cases
>>>>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
>>>>>>>>>
>>>>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not
>>>> convinced that
>>>>>>>>> we should switch away from the flat arrays or that we need fully
>>>> dynamic
>>>>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest
>>>> a half-way
>>>>>>>>> house here, where we keep the ethdevs as an array, but one
>>>> allocated/sized
>>>>>>>>> at runtime rather than statically. This would allow us to have a
>>>>>>>>> compile-time default value, but, for use cases that need it,
>>>> allow use of a
>>>>>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter
>>>> given to the
>>>>>>>>> malloc call for the array.  This max limit could then be
>>>> provided to apps
>>>>>>>>> too if they want to match any array sizes. [Alternatively those
>>>> apps could
>>>>>>>>> check the provided size and error out if the size has been
>>>> increased beyond
>>>>>>>>> what the app is designed to use?]. There would be no extra
>>>> dereferences per
>>>>>>>>> rx/tx burst call in this scenario so performance should be the
>>>> same as
>>>>>>>>> before (potentially better if array is in hugepage memory, I
>>>> suppose).
>>>>>>>>
>>>>>>>> I think we need some benchmarks to decide what is the best
>>>> tradeoff.
>>>>>>>> I spent time on this implementation, but sorry I won't have time
>>>> for benchmarks.
>>>>>>>> Volunteers?
>>>>>>>
>>>>>>> I had only a quick look at your approach so far.
>>>>>>> But from what I can read, in MT environment your suggestion will
>>>> require
>>>>>>> extra synchronization for each read-write access to such parray
>>>> element (lock, rcu, ...).
>>>>>>> I think what Bruce suggests will be much ligther, easier to
>>>> implement and less error prone.
>>>>>>> At least for rte_ethdevs[] and friends.
>>>>>>> Konstantin
>>>>>>
>>>>>> One more thought here - if we are talking about rte_ethdev[] in
>>>> particular, I think  we can:
>>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from
>>>> rte_ethdev into a separate flat array.
>>>>>> We can keep it public to still use inline functions for 'fast'
>>>> calls rte_eth_rx_burst(), etc. to avoid
>>>>>> any regressions.
>>>>>> That could still be flat array with max_size specified at
>>>> application startup.
>>>>>> 2. Hide rest of rte_ethdev struct in .c.
>>>>>> That will allow us to change the struct itself and the whole
>>>> rte_ethdev[] table in a way we like
>>>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
>>>>>>
>>>>>> Yes, it would require all PMDs to change prototype for
>>>> pkt_rx_burst() function
>>>>>> (to accept port_id, queue_id instead of queue pointer), but the
>>>> change is mechanical one.
>>>>>> Probably some macro can be provided to simplify it.
>>>>>>
>>>>>
>>>>> We are already planning some tasks for ABI stability for v21.11, I
>>>> think
>>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables
>>>> hiding more
>>>>> internal data.
>>>>
>>>> Ok, sounds good.
>>>>
>>>>>
>>>>>> The only significant complication I can foresee with implementing
>>>> that approach -
>>>>>> we'll need a an array of 'fast' function pointers per queue, not
>>>> per device as we have now
>>>>>> (to avoid extra indirection for callback implementation).
>>>>>> Though as a bonus we'll have ability to use different RX/TX
>>>> funcions per queue.
>>>>>>
>>>>>
>>>>> What do you think split Rx/Tx callback into its own struct too?
>>>>>
>>>>> Overall 'rte_eth_dev' can be split into three as:
>>>>> 1. rte_eth_dev
>>>>> 2. rte_eth_dev_burst
>>>>> 3. rte_eth_dev_cb
>>>>>
>>>>> And we can hide 1 from applications even with the inline functions.
>>>>
>>>> As discussed off-line, I think:
>>>> it is possible.
>>>> My absolute preference would be to have just 1/2 (with CB hidden).
>>>> But even with 1/2/3 in place I think it would be  a good step forward.
>>>> Probably worth to start with 1/2/3 first and then see how difficult it
>>>> would be to switch to 1/2.
>>>> Do you plan to start working on it?
>>>>
>>>> Konstantin
>>>
>>> If you do proceed with this, be very careful. E.g. the inlined rx/tx burst functions should not touch more cache lines than they do today -
>> especially if there are many active ports. The inlined rx/tx burst functions are very simple, so thorough code review (and possibly also of the
>> resulting assembly) is appropriate. Simple performance testing might not detect if more cache lines are accessed than before the
>> modifications.
>>>
>>> Don't get me wrong... I do consider this an improvement of the ethdev library; I'm only asking you to take extra care!
>>>
>>
>> ack
>>
>> If we split as above, I think device specific data 'struct rte_eth_dev_data'
>> should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions access
>> additional cache line.
>>
>> To prevent this, what about duplicating 'data' in 2 (rte_eth_dev_burst)?
> 
> I think it would be better to change rx_pkt_burst() to accept port_id and queue_id,
> instead of void *.
> I.E:
> typedef uint16_t (*eth_rx_burst_t)(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts,  uint16_t nb_pkts);
> 

May not need to add 'port_id', since in the callback you are already in the
driver scope and all required device specific variables already accessible via
help of queue struct.

> And we can do actual de-referencing of private rxq data inside the actual rx function.
> 

Yes we can replace queue struct with 'queue_id', and do the referencing in the
Rx instead of burst API, but what is the benefit of it?

>> We have
>> enough space for it to fit into single cache line, currently it is:
>> struct rte_eth_dev {
>>         eth_rx_burst_t             rx_pkt_burst;         /*     0     8 */
>>         eth_tx_burst_t             tx_pkt_burst;         /*     8     8 */
>>         eth_tx_prep_t              tx_pkt_prepare;       /*    16     8 */
>>         eth_rx_queue_count_t       rx_queue_count;       /*    24     8 */
>>         eth_rx_descriptor_done_t   rx_descriptor_done;   /*    32     8 */
>>         eth_rx_descriptor_status_t rx_descriptor_status; /*    40     8 */
>>         eth_tx_descriptor_status_t tx_descriptor_status; /*    48     8 */
>>         struct rte_eth_dev_data *  data;                 /*    56     8 */
>>         /* --- cacheline 1 boundary (64 bytes) --- */
>>
>> 'rx_descriptor_done' is deprecated and will be removed;
Ananyev, Konstantin June 18, 2021, 10:41 a.m. UTC | #41
> >>>>>
> >>>>> 14/06/2021 15:15, Bruce Richardson:
> >>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
> >>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> >>>>>>>> Sent: Monday, 14 June 2021 12.59
> >>>>>>>>
> >>>>>>>> Performance of access in a fixed-size array is very good
> >>>>>>>> because of cache locality
> >>>>>>>> and because there is a single pointer to dereference.
> >>>>>>>> The only drawback is the lack of flexibility:
> >>>>>>>> the size of such an array cannot be increase at runtime.
> >>>>>>>>
> >>>>>>>> An approach to this problem is to allocate the array at runtime,
> >>>>>>>> being as efficient as static arrays, but still limited to a maximum.
> >>>>>>>>
> >>>>>>>> That's why the API rte_parray is introduced,
> >>>>>>>> allowing to declare an array of pointer which can be resized
> >>>>>>>> dynamically
> >>>>>>>> and automatically at runtime while keeping a good read performance.
> >>>>>>>>
> >>>>>>>> After resize, the previous array is kept until the next resize
> >>>>>>>> to avoid crashs during a read without any lock.
> >>>>>>>>
> >>>>>>>> Each element is a pointer to a memory chunk dynamically allocated.
> >>>>>>>> This is not good for cache locality but it allows to keep the same
> >>>>>>>> memory per element, no matter how the array is resized.
> >>>>>>>> Cache locality could be improved with mempools.
> >>>>>>>> The other drawback is having to dereference one more pointer
> >>>>>>>> to read an element.
> >>>>>>>>
> >>>>>>>> There is not much locks, so the API is for internal use only.
> >>>>>>>> This API may be used to completely remove some compilation-time
> >>>>>>>> maximums.
> >>>>>>>
> >>>>>>> I get the purpose and overall intention of this library.
> >>>>>>>
> >>>>>>> I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime
> configurability.
> >>>> It's
> >>>>> my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no
> way
> >> for
> >>>>> me to stop this progress, and I do not intend to oppose to this library. :-)
> >>>>>>>
> >>>>>>> This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few
> >>>> examples
> >>>>> where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line
> >> between
> >>>>> control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and
> >> shrink
> >>>> in
> >>>>> the fast path.
> >>>>>>>
> >>>>>>> If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
> >>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for
> their
> >>>>> application specific per-port runtime data, and this library could serve that purpose too.
> >>>>>>>
> >>>>>>
> >>>>>> Thanks Thomas for starting this discussion and Morten for follow-up.
> >>>>>>
> >>>>>> My thinking is as follows, and I'm particularly keeping in mind the cases
> >>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
> >>>>>>
> >>>>>> While I dislike the hard-coded limits in DPDK, I'm also not convinced that
> >>>>>> we should switch away from the flat arrays or that we need fully dynamic
> >>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
> >>>>>> house here, where we keep the ethdevs as an array, but one allocated/sized
> >>>>>> at runtime rather than statically. This would allow us to have a
> >>>>>> compile-time default value, but, for use cases that need it, allow use of a
> >>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter given to the
> >>>>>> malloc call for the array.  This max limit could then be provided to apps
> >>>>>> too if they want to match any array sizes. [Alternatively those apps could
> >>>>>> check the provided size and error out if the size has been increased beyond
> >>>>>> what the app is designed to use?]. There would be no extra dereferences per
> >>>>>> rx/tx burst call in this scenario so performance should be the same as
> >>>>>> before (potentially better if array is in hugepage memory, I suppose).
> >>>>>
> >>>>> I think we need some benchmarks to decide what is the best tradeoff.
> >>>>> I spent time on this implementation, but sorry I won't have time for benchmarks.
> >>>>> Volunteers?
> >>>>
> >>>> I had only a quick look at your approach so far.
> >>>> But from what I can read, in MT environment your suggestion will require
> >>>> extra synchronization for each read-write access to such parray element (lock, rcu, ...).
> >>>> I think what Bruce suggests will be much ligther, easier to implement and less error prone.
> >>>> At least for rte_ethdevs[] and friends.
> >>>> Konstantin
> >>>
> >>> One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
> >>> 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
> >>> We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
> >>> any regressions.
> >>> That could still be flat array with max_size specified at application startup.
> >>> 2. Hide rest of rte_ethdev struct in .c.
> >>> That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
> >>> (flat array, vector, hash, linked list) without ABI/API breakages.
> >>>
> >>> Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
> >>> (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
> >>> Probably some macro can be provided to simplify it.
> >>>
> >>
> >> We are already planning some tasks for ABI stability for v21.11, I think
> >> splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
> >> internal data.
> >
> > Ok, sounds good.
> >
> >>
> >>> The only significant complication I can foresee with implementing that approach -
> >>> we'll need a an array of 'fast' function pointers per queue, not per device as we have now
> >>> (to avoid extra indirection for callback implementation).
> >>> Though as a bonus we'll have ability to use different RX/TX funcions per queue.
> >>>
> >>
> >> What do you think split Rx/Tx callback into its own struct too?
> >>
> >> Overall 'rte_eth_dev' can be split into three as:
> >> 1. rte_eth_dev
> >> 2. rte_eth_dev_burst
> >> 3. rte_eth_dev_cb
> >>
> >> And we can hide 1 from applications even with the inline functions.
> >
> > As discussed off-line, I think:
> > it is possible.
> > My absolute preference would be to have just 1/2 (with CB hidden).
> 
> How can we hide the callbacks since they are used by inline burst functions.

I probably I owe a better explanation to what I meant in first mail.
Otherwise it sounds confusing.
I'll try to write a more detailed one in next few days.

> > But even with 1/2/3 in place I think it would be  a good step forward.
> > Probably worth to start with 1/2/3 first and then see how difficult it
> > would be to switch to 1/2.
> 
> What do you mean by switch to 1/2?

When we'll have just:
1. rte_eth_dev (hidden in .c)
2. rte_eth_dev_burst (visible)

And no specific public struct/array for callbacks - they will be hidden in rte_eth_dev.

> 
> If we keep having inline functions, and split struct as above three structs, we
> can only hide 1, and 2/3 will be still visible to apps because of inline
> functions. This way we will be able to hide more still having same performance.

I understand that, and as I said above - I think it is a good step forward.
Though even better would be to hide rte_eth_dev_cb too. 

> 
> > Do you plan to start working on it?
> >
> 
> We are gathering the list of the tasks for the ABI stability, most probably they
> will be worked on during v21.11. I can take this one.

Cool, please keep me in a loop.
I'll try to free some cycles for 21.11 to get involved and help (if needed off-course).
Konstantin
Ferruh Yigit June 18, 2021, 10:47 a.m. UTC | #42
On 6/18/2021 10:14 AM, Morten Brørup wrote:
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
>> Konstantin
>> Sent: Thursday, 17 June 2021 19.06
>>
>> I think it would be better to change rx_pkt_burst() to accept port_id
>> and queue_id, instead of void *.
> 
> Current:
> 
> typedef uint16_t (*eth_rx_burst_t)(void *rxq,
> 				   struct rte_mbuf **rx_pkts,
> 				   uint16_t nb_pkts);
> 
>> I.E:
>> typedef uint16_t (*eth_rx_burst_t)(uint16_t port_id,
>>     uint16_t queue_id,
>>     struct rte_mbuf **rx_pkts,
>>     uint16_t nb_pkts);
>>
>> And we can do actual de-referencing of private rxq data inside the
>> actual rx function.
> 
> Good idea, if it can be done without a performance cost.
> 
> The X64 calling convention allows up to 4 parameters passed as registers, so the added parameter should not be a problem.
> 
> 
> Another thing:
> 
> I just noticed that struct rte_eth_dev_data has "void **rx_queues;" (and similarly for tx_queues).
> 
> That should be "void *rx_queues[RTE_MAX_QUEUES_PER_PORT];", like in all the other ethdev structures.
> 

Why have a fixed size rx_queues array? It is already allocated dynamically in
'rte_eth_dev_configure()' based on actual Rx/Tx queue number.

We are already trying to get rid of compile time fixed array sizes, so I think
better to keep it as it is.

Also this will increase the strcut size.

> The same structure even has "uint8_t rx_queue_state[RTE_MAX_QUEUES_PER_PORT];", so it's following two different conventions.
> 

I wonder if we should should switch these to dynamic allocation too.
Ferruh Yigit June 18, 2021, 10:49 a.m. UTC | #43
On 6/18/2021 11:41 AM, Ananyev, Konstantin wrote:
> 
>>>>>>>
>>>>>>> 14/06/2021 15:15, Bruce Richardson:
>>>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote:
>>>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
>>>>>>>>>> Sent: Monday, 14 June 2021 12.59
>>>>>>>>>>
>>>>>>>>>> Performance of access in a fixed-size array is very good
>>>>>>>>>> because of cache locality
>>>>>>>>>> and because there is a single pointer to dereference.
>>>>>>>>>> The only drawback is the lack of flexibility:
>>>>>>>>>> the size of such an array cannot be increase at runtime.
>>>>>>>>>>
>>>>>>>>>> An approach to this problem is to allocate the array at runtime,
>>>>>>>>>> being as efficient as static arrays, but still limited to a maximum.
>>>>>>>>>>
>>>>>>>>>> That's why the API rte_parray is introduced,
>>>>>>>>>> allowing to declare an array of pointer which can be resized
>>>>>>>>>> dynamically
>>>>>>>>>> and automatically at runtime while keeping a good read performance.
>>>>>>>>>>
>>>>>>>>>> After resize, the previous array is kept until the next resize
>>>>>>>>>> to avoid crashs during a read without any lock.
>>>>>>>>>>
>>>>>>>>>> Each element is a pointer to a memory chunk dynamically allocated.
>>>>>>>>>> This is not good for cache locality but it allows to keep the same
>>>>>>>>>> memory per element, no matter how the array is resized.
>>>>>>>>>> Cache locality could be improved with mempools.
>>>>>>>>>> The other drawback is having to dereference one more pointer
>>>>>>>>>> to read an element.
>>>>>>>>>>
>>>>>>>>>> There is not much locks, so the API is for internal use only.
>>>>>>>>>> This API may be used to completely remove some compilation-time
>>>>>>>>>> maximums.
>>>>>>>>>
>>>>>>>>> I get the purpose and overall intention of this library.
>>>>>>>>>
>>>>>>>>> I probably already mentioned that I prefer "embedded style programming" with fixed size arrays, rather than runtime
>> configurability.
>>>>>> It's
>>>>>>> my personal opinion, and the DPDK Tech Board clearly prefers reducing the amount of compile time configurability, so there is no
>> way
>>>> for
>>>>>>> me to stop this progress, and I do not intend to oppose to this library. :-)
>>>>>>>>>
>>>>>>>>> This library is likely to become a core library of DPDK, so I think it is important getting it right. Could you please mention a few
>>>>>> examples
>>>>>>> where you think this internal library should be used, and where it should not be used. Then it is easier to discuss if the border line
>>>> between
>>>>>>> control path and data plane is correct. E.g. this library is not intended to be used for dynamically sized packet queues that grow and
>>>> shrink
>>>>>> in
>>>>>>> the fast path.
>>>>>>>>>
>>>>>>>>> If the library becomes a core DPDK library, it should probably be public instead of internal. E.g. if the library is used to make
>>>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some applications might also need dynamically sized arrays for
>> their
>>>>>>> application specific per-port runtime data, and this library could serve that purpose too.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks Thomas for starting this discussion and Morten for follow-up.
>>>>>>>>
>>>>>>>> My thinking is as follows, and I'm particularly keeping in mind the cases
>>>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here.
>>>>>>>>
>>>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not convinced that
>>>>>>>> we should switch away from the flat arrays or that we need fully dynamic
>>>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest a half-way
>>>>>>>> house here, where we keep the ethdevs as an array, but one allocated/sized
>>>>>>>> at runtime rather than statically. This would allow us to have a
>>>>>>>> compile-time default value, but, for use cases that need it, allow use of a
>>>>>>>> flag e.g.  "max-ethdevs" to change the size of the parameter given to the
>>>>>>>> malloc call for the array.  This max limit could then be provided to apps
>>>>>>>> too if they want to match any array sizes. [Alternatively those apps could
>>>>>>>> check the provided size and error out if the size has been increased beyond
>>>>>>>> what the app is designed to use?]. There would be no extra dereferences per
>>>>>>>> rx/tx burst call in this scenario so performance should be the same as
>>>>>>>> before (potentially better if array is in hugepage memory, I suppose).
>>>>>>>
>>>>>>> I think we need some benchmarks to decide what is the best tradeoff.
>>>>>>> I spent time on this implementation, but sorry I won't have time for benchmarks.
>>>>>>> Volunteers?
>>>>>>
>>>>>> I had only a quick look at your approach so far.
>>>>>> But from what I can read, in MT environment your suggestion will require
>>>>>> extra synchronization for each read-write access to such parray element (lock, rcu, ...).
>>>>>> I think what Bruce suggests will be much ligther, easier to implement and less error prone.
>>>>>> At least for rte_ethdevs[] and friends.
>>>>>> Konstantin
>>>>>
>>>>> One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
>>>>> We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
>>>>> any regressions.
>>>>> That could still be flat array with max_size specified at application startup.
>>>>> 2. Hide rest of rte_ethdev struct in .c.
>>>>> That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
>>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
>>>>>
>>>>> Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
>>>>> (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
>>>>> Probably some macro can be provided to simplify it.
>>>>>
>>>>
>>>> We are already planning some tasks for ABI stability for v21.11, I think
>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
>>>> internal data.
>>>
>>> Ok, sounds good.
>>>
>>>>
>>>>> The only significant complication I can foresee with implementing that approach -
>>>>> we'll need a an array of 'fast' function pointers per queue, not per device as we have now
>>>>> (to avoid extra indirection for callback implementation).
>>>>> Though as a bonus we'll have ability to use different RX/TX funcions per queue.
>>>>>
>>>>
>>>> What do you think split Rx/Tx callback into its own struct too?
>>>>
>>>> Overall 'rte_eth_dev' can be split into three as:
>>>> 1. rte_eth_dev
>>>> 2. rte_eth_dev_burst
>>>> 3. rte_eth_dev_cb
>>>>
>>>> And we can hide 1 from applications even with the inline functions.
>>>
>>> As discussed off-line, I think:
>>> it is possible.
>>> My absolute preference would be to have just 1/2 (with CB hidden).
>>
>> How can we hide the callbacks since they are used by inline burst functions.
> 
> I probably I owe a better explanation to what I meant in first mail.
> Otherwise it sounds confusing.
> I'll try to write a more detailed one in next few days.
> 
>>> But even with 1/2/3 in place I think it would be  a good step forward.
>>> Probably worth to start with 1/2/3 first and then see how difficult it
>>> would be to switch to 1/2.
>>
>> What do you mean by switch to 1/2?
> 
> When we'll have just:
> 1. rte_eth_dev (hidden in .c)
> 2. rte_eth_dev_burst (visible)
> 
> And no specific public struct/array for callbacks - they will be hidden in rte_eth_dev.
> 

If we can hide them, agree this is better.

>>
>> If we keep having inline functions, and split struct as above three structs, we
>> can only hide 1, and 2/3 will be still visible to apps because of inline
>> functions. This way we will be able to hide more still having same performance.
> 
> I understand that, and as I said above - I think it is a good step forward.
> Though even better would be to hide rte_eth_dev_cb too.
> 
>>
>>> Do you plan to start working on it?
>>>
>>
>> We are gathering the list of the tasks for the ABI stability, most probably they
>> will be worked on during v21.11. I can take this one.
> 
> Cool, please keep me in a loop.
> I'll try to free some cycles for 21.11 to get involved and help (if needed off-course).

That would be great, thanks.
Morten Brørup June 18, 2021, 11:16 a.m. UTC | #44
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ferruh Yigit
> Sent: Friday, 18 June 2021 12.47
> 
> On 6/18/2021 10:14 AM, Morten Brørup wrote:
> > Another thing:
> >
> > I just noticed that struct rte_eth_dev_data has "void **rx_queues;"
> (and similarly for tx_queues).
> >
> > That should be "void *rx_queues[RTE_MAX_QUEUES_PER_PORT];", like in
> all the other ethdev structures.
> >
> 
> Why have a fixed size rx_queues array? It is already allocated
> dynamically in
> 'rte_eth_dev_configure()' based on actual Rx/Tx queue number.

For performance reasons, I guess.

> 
> We are already trying to get rid of compile time fixed array sizes, so
> I think
> better to keep it as it is.
> 
> Also this will increase the strcut size.
> 
> > The same structure even has "uint8_t
> rx_queue_state[RTE_MAX_QUEUES_PER_PORT];", so it's following two
> different conventions.
> >
> 
> I wonder if we should should switch these to dynamic allocation too.
> 

I agree that we should generally stick with one or the other, either fixed or dynamically allocated arrays, for consistency reasons. However, sometimes performance trumps consistency. :-)

If we change more per-queue arrays to dynamic allocation, perhaps it would be beneficial gathering these fields into one per-queue struct, so the struct rte_eth_dev_data only has one array instead of multiple arrays. It depends on how they are being used. (Also, maybe there should be one array in each direction, so it is two arrays, not just one.)
Ananyev, Konstantin June 21, 2021, 11:06 a.m. UTC | #45
Hi everyone,
 
> > >>> One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
> > >>> 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
> > >>> We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
> > >>> any regressions.
> > >>> That could still be flat array with max_size specified at application startup.
> > >>> 2. Hide rest of rte_ethdev struct in .c.
> > >>> That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
> > >>> (flat array, vector, hash, linked list) without ABI/API breakages.
> > >>>
> > >>> Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
> > >>> (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
> > >>> Probably some macro can be provided to simplify it.
> > >>>
> > >>
> > >> We are already planning some tasks for ABI stability for v21.11, I think
> > >> splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
> > >> internal data.
> > >
> > > Ok, sounds good.
> > >
> > >>
> > >>> The only significant complication I can foresee with implementing that approach -
> > >>> we'll need a an array of 'fast' function pointers per queue, not per device as we have now
> > >>> (to avoid extra indirection for callback implementation).
> > >>> Though as a bonus we'll have ability to use different RX/TX funcions per queue.
> > >>>
> > >>
> > >> What do you think split Rx/Tx callback into its own struct too?
> > >>
> > >> Overall 'rte_eth_dev' can be split into three as:
> > >> 1. rte_eth_dev
> > >> 2. rte_eth_dev_burst
> > >> 3. rte_eth_dev_cb
> > >>
> > >> And we can hide 1 from applications even with the inline functions.
> > >
> > > As discussed off-line, I think:
> > > it is possible.
> > > My absolute preference would be to have just 1/2 (with CB hidden).
> >
> > How can we hide the callbacks since they are used by inline burst functions.
> 
> I probably I owe a better explanation to what I meant in first mail.
> Otherwise it sounds confusing.
> I'll try to write a more detailed one in next few days.

Actually I gave it another thought over weekend, and might be we can
hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
an example, but the same principle applies to other 'fast' functions. 

 1. Needed changes for PMDs rx_pkt_burst():
    a) change function prototype to accept 'uint16_t port_id' and 'uint16_t queue_id',
         instead of current 'void *'.
    b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog() function at return.
         This  inline function will do all CB calls for that queue.

To be more specific, let say we have some PMD: xyz with RX function:

uint16_t
xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
{
     struct xyz_rx_queue *rxq = rx_queue;
     uint16_t nb_rx = 0;

     /* do actual stuff here */
    ....
    return nb_rx; 
}

It will be transformed to:

uint16_t
xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
{
         struct xyz_rx_queue *rxq;
         uint16_t nb_rx;

         rxq = _rte_eth_rx_prolog(port_id, queue_id);
         if (rxq == NULL)
             return 0;
         nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
         return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts, nb_pkts);
}

And somewhere in ethdev_private.h:

static inline void *
_rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id); 
{
   struct rte_eth_dev *dev = &rte_eth_devices[port_id];

#ifdef RTE_ETHDEV_DEBUG_RX
        RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
        RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);

        if (queue_id >= dev->data->nb_rx_queues) {
                RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
                return NULL;
        }
#endif
  return dev->data->rx_queues[queue_id];   
}

static inline uint16_t
_rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, const uint16_t nb_pkts); 
{
    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
 
#ifdef RTE_ETHDEV_RXTX_CALLBACKS
        struct rte_eth_rxtx_callback *cb;

        /* __ATOMIC_RELEASE memory order was used when the
         * call back was inserted into the list.
         * Since there is a clear dependency between loading
         * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
         * not required.
         */
        cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
                                __ATOMIC_RELAXED);

        if (unlikely(cb != NULL)) {
                do {
                        nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts, nb_rx,
                                                nb_pkts, cb->param);
                        cb = cb->next;
                } while (cb != NULL);
        }
#endif

        rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
        return nb_rx;
 }

Now, as you said above, in rte_ethdev.h we will keep only a flat array
with pointers to 'fast' functions:
struct {
     eth_rx_burst_t             rx_pkt_burst
      eth_tx_burst_t             tx_pkt_burst;       
      eth_tx_prep_t              tx_pkt_prepare;
     .....
} rte_eth_dev_burst[];

And rte_eth_rx_burst() will look like:

static inline uint16_t
rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
                 struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
{
    if (port_id >= RTE_MAX_ETHPORTS)
        return 0;
   return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts, nb_pkts);
}

Yes, it will require changes in *all* PMDs, but as I said before the changes will be a mechanic ones.
Morten Brørup June 21, 2021, 12:10 p.m. UTC | #46
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> 
> > > How can we hide the callbacks since they are used by inline burst
> functions.
> >
> > I probably I owe a better explanation to what I meant in first mail.
> > Otherwise it sounds confusing.
> > I'll try to write a more detailed one in next few days.
> 
> Actually I gave it another thought over weekend, and might be we can
> hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
> an example, but the same principle applies to other 'fast' functions.
> 
>  1. Needed changes for PMDs rx_pkt_burst():
>     a) change function prototype to accept 'uint16_t port_id' and
> 'uint16_t queue_id',
>          instead of current 'void *'.
>     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog()
> function at return.
>          This  inline function will do all CB calls for that queue.
> 
> To be more specific, let say we have some PMD: xyz with RX function:
> 
> uint16_t
> xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t
> nb_pkts)
> {
>      struct xyz_rx_queue *rxq = rx_queue;
>      uint16_t nb_rx = 0;
> 
>      /* do actual stuff here */
>     ....
>     return nb_rx;
> }
> 
> It will be transformed to:
> 
> uint16_t
> xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> **rx_pkts, uint16_t nb_pkts)
> {
>          struct xyz_rx_queue *rxq;
>          uint16_t nb_rx;
> 
>          rxq = _rte_eth_rx_prolog(port_id, queue_id);
>          if (rxq == NULL)
>              return 0;
>          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
>          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
> nb_pkts);
> }
> 
> And somewhere in ethdev_private.h:
> 
> static inline void *
> _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> {
>    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> 
> #ifdef RTE_ETHDEV_DEBUG_RX
>         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
>         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> 
>         if (queue_id >= dev->data->nb_rx_queues) {
>                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
> queue_id);
>                 return NULL;
>         }
> #endif
>   return dev->data->rx_queues[queue_id];
> }
> 
> static inline uint16_t
> _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> **rx_pkts, const uint16_t nb_pkts);
> {
>     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> 
> #ifdef RTE_ETHDEV_RXTX_CALLBACKS
>         struct rte_eth_rxtx_callback *cb;
> 
>         /* __ATOMIC_RELEASE memory order was used when the
>          * call back was inserted into the list.
>          * Since there is a clear dependency between loading
>          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
>          * not required.
>          */
>         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
>                                 __ATOMIC_RELAXED);
> 
>         if (unlikely(cb != NULL)) {
>                 do {
>                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts,
> nb_rx,
>                                                 nb_pkts, cb->param);
>                         cb = cb->next;
>                 } while (cb != NULL);
>         }
> #endif
> 
>         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts,
> nb_rx);
>         return nb_rx;
>  }

That would make the compiler inline _rte_eth_rx_epilog() into the driver when compiling the DPDK library. But RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application developer to use when compiling the DPDK application.

> 
> Now, as you said above, in rte_ethdev.h we will keep only a flat array
> with pointers to 'fast' functions:
> struct {
>      eth_rx_burst_t             rx_pkt_burst
>       eth_tx_burst_t             tx_pkt_burst;
>       eth_tx_prep_t              tx_pkt_prepare;
>      .....
> } rte_eth_dev_burst[];
> 
> And rte_eth_rx_burst() will look like:
> 
> static inline uint16_t
> rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> {
>     if (port_id >= RTE_MAX_ETHPORTS)
>         return 0;
>    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts,
> nb_pkts);
> }
> 
> Yes, it will require changes in *all* PMDs, but as I said before the
> changes will be a mechanic ones.
Ananyev, Konstantin June 21, 2021, 12:30 p.m. UTC | #47
> 
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > Konstantin
> >
> > > > How can we hide the callbacks since they are used by inline burst
> > functions.
> > >
> > > I probably I owe a better explanation to what I meant in first mail.
> > > Otherwise it sounds confusing.
> > > I'll try to write a more detailed one in next few days.
> >
> > Actually I gave it another thought over weekend, and might be we can
> > hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
> > an example, but the same principle applies to other 'fast' functions.
> >
> >  1. Needed changes for PMDs rx_pkt_burst():
> >     a) change function prototype to accept 'uint16_t port_id' and
> > 'uint16_t queue_id',
> >          instead of current 'void *'.
> >     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog()
> > function at return.
> >          This  inline function will do all CB calls for that queue.
> >
> > To be more specific, let say we have some PMD: xyz with RX function:
> >
> > uint16_t
> > xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t
> > nb_pkts)
> > {
> >      struct xyz_rx_queue *rxq = rx_queue;
> >      uint16_t nb_rx = 0;
> >
> >      /* do actual stuff here */
> >     ....
> >     return nb_rx;
> > }
> >
> > It will be transformed to:
> >
> > uint16_t
> > xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> > **rx_pkts, uint16_t nb_pkts)
> > {
> >          struct xyz_rx_queue *rxq;
> >          uint16_t nb_rx;
> >
> >          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> >          if (rxq == NULL)
> >              return 0;
> >          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
> >          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
> > nb_pkts);
> > }
> >
> > And somewhere in ethdev_private.h:
> >
> > static inline void *
> > _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> > {
> >    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >
> > #ifdef RTE_ETHDEV_DEBUG_RX
> >         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> >         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> >
> >         if (queue_id >= dev->data->nb_rx_queues) {
> >                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
> > queue_id);
> >                 return NULL;
> >         }
> > #endif
> >   return dev->data->rx_queues[queue_id];
> > }
> >
> > static inline uint16_t
> > _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> > **rx_pkts, const uint16_t nb_pkts);
> > {
> >     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >
> > #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> >         struct rte_eth_rxtx_callback *cb;
> >
> >         /* __ATOMIC_RELEASE memory order was used when the
> >          * call back was inserted into the list.
> >          * Since there is a clear dependency between loading
> >          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
> >          * not required.
> >          */
> >         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
> >                                 __ATOMIC_RELAXED);
> >
> >         if (unlikely(cb != NULL)) {
> >                 do {
> >                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts,
> > nb_rx,
> >                                                 nb_pkts, cb->param);
> >                         cb = cb->next;
> >                 } while (cb != NULL);
> >         }
> > #endif
> >
> >         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts,
> > nb_rx);
> >         return nb_rx;
> >  }
> 
> That would make the compiler inline _rte_eth_rx_epilog() into the driver when compiling the DPDK library. But
> RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application developer to use when compiling the DPDK application.

I believe it is for both - user app and DPDK drivers.
AFAIK, they both have to use the same rte_config.h, otherwise things will be broken.
If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev, then 
user wouldn't be able to add a callback at first place. 
BTW,  such change will allow us to make RTE_ETHDEV_RXTX_CALLBACKS
internal for ethdev/PMD layer, which is a good thing from my perspective. 

> 
> >
> > Now, as you said above, in rte_ethdev.h we will keep only a flat array
> > with pointers to 'fast' functions:
> > struct {
> >      eth_rx_burst_t             rx_pkt_burst
> >       eth_tx_burst_t             tx_pkt_burst;
> >       eth_tx_prep_t              tx_pkt_prepare;
> >      .....
> > } rte_eth_dev_burst[];
> >
> > And rte_eth_rx_burst() will look like:
> >
> > static inline uint16_t
> > rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
> >                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> > {
> >     if (port_id >= RTE_MAX_ETHPORTS)
> >         return 0;
> >    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts,
> > nb_pkts);
> > }
> >
> > Yes, it will require changes in *all* PMDs, but as I said before the
> > changes will be a mechanic ones.
Morten Brørup June 21, 2021, 1:28 p.m. UTC | #48
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> 
> >
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > > Konstantin
> > >
> > > > > How can we hide the callbacks since they are used by inline
> burst
> > > functions.
> > > >
> > > > I probably I owe a better explanation to what I meant in first
> mail.
> > > > Otherwise it sounds confusing.
> > > > I'll try to write a more detailed one in next few days.
> > >
> > > Actually I gave it another thought over weekend, and might be we
> can
> > > hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst()
> as
> > > an example, but the same principle applies to other 'fast'
> functions.
> > >
> > >  1. Needed changes for PMDs rx_pkt_burst():
> > >     a) change function prototype to accept 'uint16_t port_id' and
> > > 'uint16_t queue_id',
> > >          instead of current 'void *'.
> > >     b) Each PMD rx_pkt_burst() will have to call
> rte_eth_rx_epilog()
> > > function at return.
> > >          This  inline function will do all CB calls for that queue.
> > >
> > > To be more specific, let say we have some PMD: xyz with RX
> function:
> > >
> > > uint16_t
> > > xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t
> > > nb_pkts)
> > > {
> > >      struct xyz_rx_queue *rxq = rx_queue;
> > >      uint16_t nb_rx = 0;
> > >
> > >      /* do actual stuff here */
> > >     ....
> > >     return nb_rx;
> > > }
> > >
> > > It will be transformed to:
> > >
> > > uint16_t
> > > xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> > > **rx_pkts, uint16_t nb_pkts)
> > > {
> > >          struct xyz_rx_queue *rxq;
> > >          uint16_t nb_rx;
> > >
> > >          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> > >          if (rxq == NULL)
> > >              return 0;
> > >          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
> > >          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
> > > nb_pkts);
> > > }
> > >
> > > And somewhere in ethdev_private.h:
> > >
> > > static inline void *
> > > _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> > > {
> > >    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > >
> > > #ifdef RTE_ETHDEV_DEBUG_RX
> > >         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> > >         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> > >
> > >         if (queue_id >= dev->data->nb_rx_queues) {
> > >                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
> > > queue_id);
> > >                 return NULL;
> > >         }
> > > #endif
> > >   return dev->data->rx_queues[queue_id];
> > > }
> > >
> > > static inline uint16_t
> > > _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct
> rte_mbuf
> > > **rx_pkts, const uint16_t nb_pkts);
> > > {
> > >     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > >
> > > #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> > >         struct rte_eth_rxtx_callback *cb;
> > >
> > >         /* __ATOMIC_RELEASE memory order was used when the
> > >          * call back was inserted into the list.
> > >          * Since there is a clear dependency between loading
> > >          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
> > >          * not required.
> > >          */
> > >         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
> > >                                 __ATOMIC_RELAXED);
> > >
> > >         if (unlikely(cb != NULL)) {
> > >                 do {
> > >                         nb_rx = cb->fn.rx(port_id, queue_id,
> rx_pkts,
> > > nb_rx,
> > >                                                 nb_pkts, cb-
> >param);
> > >                         cb = cb->next;
> > >                 } while (cb != NULL);
> > >         }
> > > #endif
> > >
> > >         rte_ethdev_trace_rx_burst(port_id, queue_id, (void
> **)rx_pkts,
> > > nb_rx);
> > >         return nb_rx;
> > >  }
> >
> > That would make the compiler inline _rte_eth_rx_epilog() into the
> driver when compiling the DPDK library. But
> > RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application
> developer to use when compiling the DPDK application.
> 
> I believe it is for both - user app and DPDK drivers.
> AFAIK, they both have to use the same rte_config.h, otherwise things
> will be broken.
> If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev, then
> user wouldn't be able to add a callback at first place.

In the case of RTE_ETHDEV_RXTX_CALLBACKS, it is independent:

If it is not compiled with the DPDK library, attempts to install callbacks from the application will fail with ENOTSUP.

If it is not compiled with the DPDK application, no time will be spent trying to determine if any there are any callbacks to call.

> BTW,  such change will allow us to make RTE_ETHDEV_RXTX_CALLBACKS
> internal for ethdev/PMD layer, which is a good thing from my
> perspective.

If it can be done without degrading performance for applications not using callbacks.
Ferruh Yigit June 21, 2021, 2:05 p.m. UTC | #49
On 6/21/2021 12:06 PM, Ananyev, Konstantin wrote:
> 
> Hi everyone,
> 
>>>>>> One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
>>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
>>>>>> We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
>>>>>> any regressions.
>>>>>> That could still be flat array with max_size specified at application startup.
>>>>>> 2. Hide rest of rte_ethdev struct in .c.
>>>>>> That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
>>>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
>>>>>>
>>>>>> Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
>>>>>> (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
>>>>>> Probably some macro can be provided to simplify it.
>>>>>>
>>>>>
>>>>> We are already planning some tasks for ABI stability for v21.11, I think
>>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
>>>>> internal data.
>>>>
>>>> Ok, sounds good.
>>>>
>>>>>
>>>>>> The only significant complication I can foresee with implementing that approach -
>>>>>> we'll need a an array of 'fast' function pointers per queue, not per device as we have now
>>>>>> (to avoid extra indirection for callback implementation).
>>>>>> Though as a bonus we'll have ability to use different RX/TX funcions per queue.
>>>>>>
>>>>>
>>>>> What do you think split Rx/Tx callback into its own struct too?
>>>>>
>>>>> Overall 'rte_eth_dev' can be split into three as:
>>>>> 1. rte_eth_dev
>>>>> 2. rte_eth_dev_burst
>>>>> 3. rte_eth_dev_cb
>>>>>
>>>>> And we can hide 1 from applications even with the inline functions.
>>>>
>>>> As discussed off-line, I think:
>>>> it is possible.
>>>> My absolute preference would be to have just 1/2 (with CB hidden).
>>>
>>> How can we hide the callbacks since they are used by inline burst functions.
>>
>> I probably I owe a better explanation to what I meant in first mail.
>> Otherwise it sounds confusing.
>> I'll try to write a more detailed one in next few days.
> 
> Actually I gave it another thought over weekend, and might be we can
> hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
> an example, but the same principle applies to other 'fast' functions.
> 
>  1. Needed changes for PMDs rx_pkt_burst():
>     a) change function prototype to accept 'uint16_t port_id' and 'uint16_t queue_id',
>          instead of current 'void *'.
>     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog() function at return.
>          This  inline function will do all CB calls for that queue.
> 
> To be more specific, let say we have some PMD: xyz with RX function:
> 
> uint16_t
> xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> {
>      struct xyz_rx_queue *rxq = rx_queue;
>      uint16_t nb_rx = 0;
> 
>      /* do actual stuff here */
>     ....
>     return nb_rx;
> }
> 
> It will be transformed to:
> 
> uint16_t
> xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> {
>          struct xyz_rx_queue *rxq;
>          uint16_t nb_rx;
> 
>          rxq = _rte_eth_rx_prolog(port_id, queue_id);
>          if (rxq == NULL)
>              return 0;
>          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
>          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts, nb_pkts);
> }
> 
> And somewhere in ethdev_private.h:
> 
> static inline void *
> _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> {
>    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> 
> #ifdef RTE_ETHDEV_DEBUG_RX
>         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
>         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> 
>         if (queue_id >= dev->data->nb_rx_queues) {
>                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
>                 return NULL;
>         }
> #endif
>   return dev->data->rx_queues[queue_id];
> }
> 
> static inline uint16_t
> _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, const uint16_t nb_pkts);
> {
>     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> 
> #ifdef RTE_ETHDEV_RXTX_CALLBACKS
>         struct rte_eth_rxtx_callback *cb;
> 
>         /* __ATOMIC_RELEASE memory order was used when the
>          * call back was inserted into the list.
>          * Since there is a clear dependency between loading
>          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
>          * not required.
>          */
>         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
>                                 __ATOMIC_RELAXED);
> 
>         if (unlikely(cb != NULL)) {
>                 do {
>                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts, nb_rx,
>                                                 nb_pkts, cb->param);
>                         cb = cb->next;
>                 } while (cb != NULL);
>         }
> #endif
> 
>         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
>         return nb_rx;
>  }
> 
> Now, as you said above, in rte_ethdev.h we will keep only a flat array
> with pointers to 'fast' functions:
> struct {
>      eth_rx_burst_t             rx_pkt_burst
>       eth_tx_burst_t             tx_pkt_burst;
>       eth_tx_prep_t              tx_pkt_prepare;
>      .....
> } rte_eth_dev_burst[];
> 
> And rte_eth_rx_burst() will look like:
> 
> static inline uint16_t
> rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> {
>     if (port_id >= RTE_MAX_ETHPORTS)
>         return 0;
>    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts, nb_pkts);
> }
> 
> Yes, it will require changes in *all* PMDs, but as I said before the changes will be a mechanic ones.
> 

I did not like the idea to push to calling Rx/TX callbacks responsibility to the
drivers, I think it should be in the ethdev layer.

What about making 'rte_eth_rx_epilog' an API and call from 'rte_eth_rx_burst()',
which will add another function call for Rx/Tx callback but shouldn't affect the
Rx/Tx burst.
Ferruh Yigit June 21, 2021, 2:10 p.m. UTC | #50
On 6/21/2021 1:30 PM, Ananyev, Konstantin wrote:
> 
>>
>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
>>> Konstantin
>>>
>>>>> How can we hide the callbacks since they are used by inline burst
>>> functions.
>>>>
>>>> I probably I owe a better explanation to what I meant in first mail.
>>>> Otherwise it sounds confusing.
>>>> I'll try to write a more detailed one in next few days.
>>>
>>> Actually I gave it another thought over weekend, and might be we can
>>> hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
>>> an example, but the same principle applies to other 'fast' functions.
>>>
>>>  1. Needed changes for PMDs rx_pkt_burst():
>>>     a) change function prototype to accept 'uint16_t port_id' and
>>> 'uint16_t queue_id',
>>>          instead of current 'void *'.
>>>     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog()
>>> function at return.
>>>          This  inline function will do all CB calls for that queue.
>>>
>>> To be more specific, let say we have some PMD: xyz with RX function:
>>>
>>> uint16_t
>>> xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t
>>> nb_pkts)
>>> {
>>>      struct xyz_rx_queue *rxq = rx_queue;
>>>      uint16_t nb_rx = 0;
>>>
>>>      /* do actual stuff here */
>>>     ....
>>>     return nb_rx;
>>> }
>>>
>>> It will be transformed to:
>>>
>>> uint16_t
>>> xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
>>> **rx_pkts, uint16_t nb_pkts)
>>> {
>>>          struct xyz_rx_queue *rxq;
>>>          uint16_t nb_rx;
>>>
>>>          rxq = _rte_eth_rx_prolog(port_id, queue_id);
>>>          if (rxq == NULL)
>>>              return 0;
>>>          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
>>>          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
>>> nb_pkts);
>>> }
>>>
>>> And somewhere in ethdev_private.h:
>>>
>>> static inline void *
>>> _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
>>> {
>>>    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
>>>
>>> #ifdef RTE_ETHDEV_DEBUG_RX
>>>         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
>>>         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
>>>
>>>         if (queue_id >= dev->data->nb_rx_queues) {
>>>                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
>>> queue_id);
>>>                 return NULL;
>>>         }
>>> #endif
>>>   return dev->data->rx_queues[queue_id];
>>> }
>>>
>>> static inline uint16_t
>>> _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
>>> **rx_pkts, const uint16_t nb_pkts);
>>> {
>>>     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
>>>
>>> #ifdef RTE_ETHDEV_RXTX_CALLBACKS
>>>         struct rte_eth_rxtx_callback *cb;
>>>
>>>         /* __ATOMIC_RELEASE memory order was used when the
>>>          * call back was inserted into the list.
>>>          * Since there is a clear dependency between loading
>>>          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
>>>          * not required.
>>>          */
>>>         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
>>>                                 __ATOMIC_RELAXED);
>>>
>>>         if (unlikely(cb != NULL)) {
>>>                 do {
>>>                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts,
>>> nb_rx,
>>>                                                 nb_pkts, cb->param);
>>>                         cb = cb->next;
>>>                 } while (cb != NULL);
>>>         }
>>> #endif
>>>
>>>         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts,
>>> nb_rx);
>>>         return nb_rx;
>>>  }
>>
>> That would make the compiler inline _rte_eth_rx_epilog() into the driver when compiling the DPDK library. But
>> RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application developer to use when compiling the DPDK application.
> 
> I believe it is for both - user app and DPDK drivers.
> AFAIK, they both have to use the same rte_config.h, otherwise things will be broken.
> If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev, then
> user wouldn't be able to add a callback at first place.
> BTW,  such change will allow us to make RTE_ETHDEV_RXTX_CALLBACKS
> internal for ethdev/PMD layer, which is a good thing from my perspective.
> 

It is possible to use binary drivers (.so) as plugin. Currently application can
decide to use or not use Rx/Tx callbacks even with binary drivers, but this
change adds a complexity to this usecase.

>>
>>>
>>> Now, as you said above, in rte_ethdev.h we will keep only a flat array
>>> with pointers to 'fast' functions:
>>> struct {
>>>      eth_rx_burst_t             rx_pkt_burst
>>>       eth_tx_burst_t             tx_pkt_burst;
>>>       eth_tx_prep_t              tx_pkt_prepare;
>>>      .....
>>> } rte_eth_dev_burst[];
>>>
>>> And rte_eth_rx_burst() will look like:
>>>
>>> static inline uint16_t
>>> rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>>>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
>>> {
>>>     if (port_id >= RTE_MAX_ETHPORTS)
>>>         return 0;
>>>    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts,
>>> nb_pkts);
>>> }
>>>
>>> Yes, it will require changes in *all* PMDs, but as I said before the
>>> changes will be a mechanic ones.
Ananyev, Konstantin June 21, 2021, 2:38 p.m. UTC | #51
> 
> On 6/21/2021 1:30 PM, Ananyev, Konstantin wrote:
> >
> >>
> >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> >>> Konstantin
> >>>
> >>>>> How can we hide the callbacks since they are used by inline burst
> >>> functions.
> >>>>
> >>>> I probably I owe a better explanation to what I meant in first mail.
> >>>> Otherwise it sounds confusing.
> >>>> I'll try to write a more detailed one in next few days.
> >>>
> >>> Actually I gave it another thought over weekend, and might be we can
> >>> hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
> >>> an example, but the same principle applies to other 'fast' functions.
> >>>
> >>>  1. Needed changes for PMDs rx_pkt_burst():
> >>>     a) change function prototype to accept 'uint16_t port_id' and
> >>> 'uint16_t queue_id',
> >>>          instead of current 'void *'.
> >>>     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog()
> >>> function at return.
> >>>          This  inline function will do all CB calls for that queue.
> >>>
> >>> To be more specific, let say we have some PMD: xyz with RX function:
> >>>
> >>> uint16_t
> >>> xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t
> >>> nb_pkts)
> >>> {
> >>>      struct xyz_rx_queue *rxq = rx_queue;
> >>>      uint16_t nb_rx = 0;
> >>>
> >>>      /* do actual stuff here */
> >>>     ....
> >>>     return nb_rx;
> >>> }
> >>>
> >>> It will be transformed to:
> >>>
> >>> uint16_t
> >>> xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> >>> **rx_pkts, uint16_t nb_pkts)
> >>> {
> >>>          struct xyz_rx_queue *rxq;
> >>>          uint16_t nb_rx;
> >>>
> >>>          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> >>>          if (rxq == NULL)
> >>>              return 0;
> >>>          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
> >>>          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
> >>> nb_pkts);
> >>> }
> >>>
> >>> And somewhere in ethdev_private.h:
> >>>
> >>> static inline void *
> >>> _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> >>> {
> >>>    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >>>
> >>> #ifdef RTE_ETHDEV_DEBUG_RX
> >>>         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> >>>         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> >>>
> >>>         if (queue_id >= dev->data->nb_rx_queues) {
> >>>                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
> >>> queue_id);
> >>>                 return NULL;
> >>>         }
> >>> #endif
> >>>   return dev->data->rx_queues[queue_id];
> >>> }
> >>>
> >>> static inline uint16_t
> >>> _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> >>> **rx_pkts, const uint16_t nb_pkts);
> >>> {
> >>>     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >>>
> >>> #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> >>>         struct rte_eth_rxtx_callback *cb;
> >>>
> >>>         /* __ATOMIC_RELEASE memory order was used when the
> >>>          * call back was inserted into the list.
> >>>          * Since there is a clear dependency between loading
> >>>          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
> >>>          * not required.
> >>>          */
> >>>         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
> >>>                                 __ATOMIC_RELAXED);
> >>>
> >>>         if (unlikely(cb != NULL)) {
> >>>                 do {
> >>>                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts,
> >>> nb_rx,
> >>>                                                 nb_pkts, cb->param);
> >>>                         cb = cb->next;
> >>>                 } while (cb != NULL);
> >>>         }
> >>> #endif
> >>>
> >>>         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts,
> >>> nb_rx);
> >>>         return nb_rx;
> >>>  }
> >>
> >> That would make the compiler inline _rte_eth_rx_epilog() into the driver when compiling the DPDK library. But
> >> RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application developer to use when compiling the DPDK application.
> >
> > I believe it is for both - user app and DPDK drivers.
> > AFAIK, they both have to use the same rte_config.h, otherwise things will be broken.
> > If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev, then
> > user wouldn't be able to add a callback at first place.
> > BTW,  such change will allow us to make RTE_ETHDEV_RXTX_CALLBACKS
> > internal for ethdev/PMD layer, which is a good thing from my perspective.
> >
> 
> It is possible to use binary drivers (.so) as plugin. Currently application can
> decide to use or not use Rx/Tx callbacks even with binary drivers, but this
> change adds a complexity to this usecase.

Not sure I understand you here...
Can you explain a bit more what do you mean?

> 
> >>
> >>>
> >>> Now, as you said above, in rte_ethdev.h we will keep only a flat array
> >>> with pointers to 'fast' functions:
> >>> struct {
> >>>      eth_rx_burst_t             rx_pkt_burst
> >>>       eth_tx_burst_t             tx_pkt_burst;
> >>>       eth_tx_prep_t              tx_pkt_prepare;
> >>>      .....
> >>> } rte_eth_dev_burst[];
> >>>
> >>> And rte_eth_rx_burst() will look like:
> >>>
> >>> static inline uint16_t
> >>> rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
> >>>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> >>> {
> >>>     if (port_id >= RTE_MAX_ETHPORTS)
> >>>         return 0;
> >>>    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts,
> >>> nb_pkts);
> >>> }
> >>>
> >>> Yes, it will require changes in *all* PMDs, but as I said before the
> >>> changes will be a mechanic ones.
Ananyev, Konstantin June 21, 2021, 2:42 p.m. UTC | #52
> >>>>>> One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
> >>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
> >>>>>> We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
> >>>>>> any regressions.
> >>>>>> That could still be flat array with max_size specified at application startup.
> >>>>>> 2. Hide rest of rte_ethdev struct in .c.
> >>>>>> That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
> >>>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
> >>>>>>
> >>>>>> Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
> >>>>>> (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
> >>>>>> Probably some macro can be provided to simplify it.
> >>>>>>
> >>>>>
> >>>>> We are already planning some tasks for ABI stability for v21.11, I think
> >>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
> >>>>> internal data.
> >>>>
> >>>> Ok, sounds good.
> >>>>
> >>>>>
> >>>>>> The only significant complication I can foresee with implementing that approach -
> >>>>>> we'll need a an array of 'fast' function pointers per queue, not per device as we have now
> >>>>>> (to avoid extra indirection for callback implementation).
> >>>>>> Though as a bonus we'll have ability to use different RX/TX funcions per queue.
> >>>>>>
> >>>>>
> >>>>> What do you think split Rx/Tx callback into its own struct too?
> >>>>>
> >>>>> Overall 'rte_eth_dev' can be split into three as:
> >>>>> 1. rte_eth_dev
> >>>>> 2. rte_eth_dev_burst
> >>>>> 3. rte_eth_dev_cb
> >>>>>
> >>>>> And we can hide 1 from applications even with the inline functions.
> >>>>
> >>>> As discussed off-line, I think:
> >>>> it is possible.
> >>>> My absolute preference would be to have just 1/2 (with CB hidden).
> >>>
> >>> How can we hide the callbacks since they are used by inline burst functions.
> >>
> >> I probably I owe a better explanation to what I meant in first mail.
> >> Otherwise it sounds confusing.
> >> I'll try to write a more detailed one in next few days.
> >
> > Actually I gave it another thought over weekend, and might be we can
> > hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
> > an example, but the same principle applies to other 'fast' functions.
> >
> >  1. Needed changes for PMDs rx_pkt_burst():
> >     a) change function prototype to accept 'uint16_t port_id' and 'uint16_t queue_id',
> >          instead of current 'void *'.
> >     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog() function at return.
> >          This  inline function will do all CB calls for that queue.
> >
> > To be more specific, let say we have some PMD: xyz with RX function:
> >
> > uint16_t
> > xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> > {
> >      struct xyz_rx_queue *rxq = rx_queue;
> >      uint16_t nb_rx = 0;
> >
> >      /* do actual stuff here */
> >     ....
> >     return nb_rx;
> > }
> >
> > It will be transformed to:
> >
> > uint16_t
> > xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> > {
> >          struct xyz_rx_queue *rxq;
> >          uint16_t nb_rx;
> >
> >          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> >          if (rxq == NULL)
> >              return 0;
> >          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
> >          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts, nb_pkts);
> > }
> >
> > And somewhere in ethdev_private.h:
> >
> > static inline void *
> > _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> > {
> >    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >
> > #ifdef RTE_ETHDEV_DEBUG_RX
> >         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> >         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> >
> >         if (queue_id >= dev->data->nb_rx_queues) {
> >                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
> >                 return NULL;
> >         }
> > #endif
> >   return dev->data->rx_queues[queue_id];
> > }
> >
> > static inline uint16_t
> > _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, const uint16_t nb_pkts);
> > {
> >     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >
> > #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> >         struct rte_eth_rxtx_callback *cb;
> >
> >         /* __ATOMIC_RELEASE memory order was used when the
> >          * call back was inserted into the list.
> >          * Since there is a clear dependency between loading
> >          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
> >          * not required.
> >          */
> >         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
> >                                 __ATOMIC_RELAXED);
> >
> >         if (unlikely(cb != NULL)) {
> >                 do {
> >                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts, nb_rx,
> >                                                 nb_pkts, cb->param);
> >                         cb = cb->next;
> >                 } while (cb != NULL);
> >         }
> > #endif
> >
> >         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
> >         return nb_rx;
> >  }
> >
> > Now, as you said above, in rte_ethdev.h we will keep only a flat array
> > with pointers to 'fast' functions:
> > struct {
> >      eth_rx_burst_t             rx_pkt_burst
> >       eth_tx_burst_t             tx_pkt_burst;
> >       eth_tx_prep_t              tx_pkt_prepare;
> >      .....
> > } rte_eth_dev_burst[];
> >
> > And rte_eth_rx_burst() will look like:
> >
> > static inline uint16_t
> > rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
> >                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> > {
> >     if (port_id >= RTE_MAX_ETHPORTS)
> >         return 0;
> >    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts, nb_pkts);
> > }
> >
> > Yes, it will require changes in *all* PMDs, but as I said before the changes will be a mechanic ones.
> >
> 
> I did not like the idea to push to calling Rx/TX callbacks responsibility to the
> drivers, I think it should be in the ethdev layer.

Well, I'd say it is an ethdev layer function that has to be called by PMD 😊

> 
> What about making 'rte_eth_rx_epilog' an API and call from 'rte_eth_rx_burst()',
> which will add another function call for Rx/Tx callback but shouldn't affect the
> Rx/Tx burst.

But then we either need to expose call-back information to the user or pay the penalty
for extra function call, correct?
Ferruh Yigit June 21, 2021, 3:32 p.m. UTC | #53
On 6/21/2021 3:42 PM, Ananyev, Konstantin wrote:
> 
>>>>>>>> One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
>>>>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
>>>>>>>> We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
>>>>>>>> any regressions.
>>>>>>>> That could still be flat array with max_size specified at application startup.
>>>>>>>> 2. Hide rest of rte_ethdev struct in .c.
>>>>>>>> That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
>>>>>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
>>>>>>>>
>>>>>>>> Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
>>>>>>>> (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
>>>>>>>> Probably some macro can be provided to simplify it.
>>>>>>>>
>>>>>>>
>>>>>>> We are already planning some tasks for ABI stability for v21.11, I think
>>>>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
>>>>>>> internal data.
>>>>>>
>>>>>> Ok, sounds good.
>>>>>>
>>>>>>>
>>>>>>>> The only significant complication I can foresee with implementing that approach -
>>>>>>>> we'll need a an array of 'fast' function pointers per queue, not per device as we have now
>>>>>>>> (to avoid extra indirection for callback implementation).
>>>>>>>> Though as a bonus we'll have ability to use different RX/TX funcions per queue.
>>>>>>>>
>>>>>>>
>>>>>>> What do you think split Rx/Tx callback into its own struct too?
>>>>>>>
>>>>>>> Overall 'rte_eth_dev' can be split into three as:
>>>>>>> 1. rte_eth_dev
>>>>>>> 2. rte_eth_dev_burst
>>>>>>> 3. rte_eth_dev_cb
>>>>>>>
>>>>>>> And we can hide 1 from applications even with the inline functions.
>>>>>>
>>>>>> As discussed off-line, I think:
>>>>>> it is possible.
>>>>>> My absolute preference would be to have just 1/2 (with CB hidden).
>>>>>
>>>>> How can we hide the callbacks since they are used by inline burst functions.
>>>>
>>>> I probably I owe a better explanation to what I meant in first mail.
>>>> Otherwise it sounds confusing.
>>>> I'll try to write a more detailed one in next few days.
>>>
>>> Actually I gave it another thought over weekend, and might be we can
>>> hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
>>> an example, but the same principle applies to other 'fast' functions.
>>>
>>>  1. Needed changes for PMDs rx_pkt_burst():
>>>     a) change function prototype to accept 'uint16_t port_id' and 'uint16_t queue_id',
>>>          instead of current 'void *'.
>>>     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog() function at return.
>>>          This  inline function will do all CB calls for that queue.
>>>
>>> To be more specific, let say we have some PMD: xyz with RX function:
>>>
>>> uint16_t
>>> xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>>> {
>>>      struct xyz_rx_queue *rxq = rx_queue;
>>>      uint16_t nb_rx = 0;
>>>
>>>      /* do actual stuff here */
>>>     ....
>>>     return nb_rx;
>>> }
>>>
>>> It will be transformed to:
>>>
>>> uint16_t
>>> xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>>> {
>>>          struct xyz_rx_queue *rxq;
>>>          uint16_t nb_rx;
>>>
>>>          rxq = _rte_eth_rx_prolog(port_id, queue_id);
>>>          if (rxq == NULL)
>>>              return 0;
>>>          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
>>>          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts, nb_pkts);
>>> }
>>>
>>> And somewhere in ethdev_private.h:
>>>
>>> static inline void *
>>> _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
>>> {
>>>    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
>>>
>>> #ifdef RTE_ETHDEV_DEBUG_RX
>>>         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
>>>         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
>>>
>>>         if (queue_id >= dev->data->nb_rx_queues) {
>>>                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
>>>                 return NULL;
>>>         }
>>> #endif
>>>   return dev->data->rx_queues[queue_id];
>>> }
>>>
>>> static inline uint16_t
>>> _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, const uint16_t nb_pkts);
>>> {
>>>     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
>>>
>>> #ifdef RTE_ETHDEV_RXTX_CALLBACKS
>>>         struct rte_eth_rxtx_callback *cb;
>>>
>>>         /* __ATOMIC_RELEASE memory order was used when the
>>>          * call back was inserted into the list.
>>>          * Since there is a clear dependency between loading
>>>          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
>>>          * not required.
>>>          */
>>>         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
>>>                                 __ATOMIC_RELAXED);
>>>
>>>         if (unlikely(cb != NULL)) {
>>>                 do {
>>>                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts, nb_rx,
>>>                                                 nb_pkts, cb->param);
>>>                         cb = cb->next;
>>>                 } while (cb != NULL);
>>>         }
>>> #endif
>>>
>>>         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
>>>         return nb_rx;
>>>  }
>>>
>>> Now, as you said above, in rte_ethdev.h we will keep only a flat array
>>> with pointers to 'fast' functions:
>>> struct {
>>>      eth_rx_burst_t             rx_pkt_burst
>>>       eth_tx_burst_t             tx_pkt_burst;
>>>       eth_tx_prep_t              tx_pkt_prepare;
>>>      .....
>>> } rte_eth_dev_burst[];
>>>
>>> And rte_eth_rx_burst() will look like:
>>>
>>> static inline uint16_t
>>> rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>>>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
>>> {
>>>     if (port_id >= RTE_MAX_ETHPORTS)
>>>         return 0;
>>>    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts, nb_pkts);
>>> }
>>>
>>> Yes, it will require changes in *all* PMDs, but as I said before the changes will be a mechanic ones.
>>>
>>
>> I did not like the idea to push to calling Rx/TX callbacks responsibility to the
>> drivers, I think it should be in the ethdev layer.
> 
> Well, I'd say it is an ethdev layer function that has to be called by PMD 😊
> 
>>
>> What about making 'rte_eth_rx_epilog' an API and call from 'rte_eth_rx_burst()',
>> which will add another function call for Rx/Tx callback but shouldn't affect the
>> Rx/Tx burst.
> 
> But then we either need to expose call-back information to the user or pay the penalty
> for extra function call, correct?
> 

Right. As a middle ground, we can keep Rx/Tx burst functions as inline, but have
the Rx/Tx callback part of it as function, so get the hit only for callbacks.
Ananyev, Konstantin June 21, 2021, 3:37 p.m. UTC | #54
> 
> On 6/21/2021 3:42 PM, Ananyev, Konstantin wrote:
> >
> >>>>>>>> One more thought here - if we are talking about rte_ethdev[] in particular, I think  we can:
> >>>>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from rte_ethdev into a separate flat array.
> >>>>>>>> We can keep it public to still use inline functions for 'fast' calls rte_eth_rx_burst(), etc. to avoid
> >>>>>>>> any regressions.
> >>>>>>>> That could still be flat array with max_size specified at application startup.
> >>>>>>>> 2. Hide rest of rte_ethdev struct in .c.
> >>>>>>>> That will allow us to change the struct itself and the whole rte_ethdev[] table in a way we like
> >>>>>>>> (flat array, vector, hash, linked list) without ABI/API breakages.
> >>>>>>>>
> >>>>>>>> Yes, it would require all PMDs to change prototype for pkt_rx_burst() function
> >>>>>>>> (to accept port_id, queue_id instead of queue pointer), but the change is mechanical one.
> >>>>>>>> Probably some macro can be provided to simplify it.
> >>>>>>>>
> >>>>>>>
> >>>>>>> We are already planning some tasks for ABI stability for v21.11, I think
> >>>>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables hiding more
> >>>>>>> internal data.
> >>>>>>
> >>>>>> Ok, sounds good.
> >>>>>>
> >>>>>>>
> >>>>>>>> The only significant complication I can foresee with implementing that approach -
> >>>>>>>> we'll need a an array of 'fast' function pointers per queue, not per device as we have now
> >>>>>>>> (to avoid extra indirection for callback implementation).
> >>>>>>>> Though as a bonus we'll have ability to use different RX/TX funcions per queue.
> >>>>>>>>
> >>>>>>>
> >>>>>>> What do you think split Rx/Tx callback into its own struct too?
> >>>>>>>
> >>>>>>> Overall 'rte_eth_dev' can be split into three as:
> >>>>>>> 1. rte_eth_dev
> >>>>>>> 2. rte_eth_dev_burst
> >>>>>>> 3. rte_eth_dev_cb
> >>>>>>>
> >>>>>>> And we can hide 1 from applications even with the inline functions.
> >>>>>>
> >>>>>> As discussed off-line, I think:
> >>>>>> it is possible.
> >>>>>> My absolute preference would be to have just 1/2 (with CB hidden).
> >>>>>
> >>>>> How can we hide the callbacks since they are used by inline burst functions.
> >>>>
> >>>> I probably I owe a better explanation to what I meant in first mail.
> >>>> Otherwise it sounds confusing.
> >>>> I'll try to write a more detailed one in next few days.
> >>>
> >>> Actually I gave it another thought over weekend, and might be we can
> >>> hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
> >>> an example, but the same principle applies to other 'fast' functions.
> >>>
> >>>  1. Needed changes for PMDs rx_pkt_burst():
> >>>     a) change function prototype to accept 'uint16_t port_id' and 'uint16_t queue_id',
> >>>          instead of current 'void *'.
> >>>     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog() function at return.
> >>>          This  inline function will do all CB calls for that queue.
> >>>
> >>> To be more specific, let say we have some PMD: xyz with RX function:
> >>>
> >>> uint16_t
> >>> xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> >>> {
> >>>      struct xyz_rx_queue *rxq = rx_queue;
> >>>      uint16_t nb_rx = 0;
> >>>
> >>>      /* do actual stuff here */
> >>>     ....
> >>>     return nb_rx;
> >>> }
> >>>
> >>> It will be transformed to:
> >>>
> >>> uint16_t
> >>> xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> >>> {
> >>>          struct xyz_rx_queue *rxq;
> >>>          uint16_t nb_rx;
> >>>
> >>>          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> >>>          if (rxq == NULL)
> >>>              return 0;
> >>>          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
> >>>          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts, nb_pkts);
> >>> }
> >>>
> >>> And somewhere in ethdev_private.h:
> >>>
> >>> static inline void *
> >>> _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> >>> {
> >>>    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >>>
> >>> #ifdef RTE_ETHDEV_DEBUG_RX
> >>>         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> >>>         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> >>>
> >>>         if (queue_id >= dev->data->nb_rx_queues) {
> >>>                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
> >>>                 return NULL;
> >>>         }
> >>> #endif
> >>>   return dev->data->rx_queues[queue_id];
> >>> }
> >>>
> >>> static inline uint16_t
> >>> _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf **rx_pkts, const uint16_t nb_pkts);
> >>> {
> >>>     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >>>
> >>> #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> >>>         struct rte_eth_rxtx_callback *cb;
> >>>
> >>>         /* __ATOMIC_RELEASE memory order was used when the
> >>>          * call back was inserted into the list.
> >>>          * Since there is a clear dependency between loading
> >>>          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
> >>>          * not required.
> >>>          */
> >>>         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
> >>>                                 __ATOMIC_RELAXED);
> >>>
> >>>         if (unlikely(cb != NULL)) {
> >>>                 do {
> >>>                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts, nb_rx,
> >>>                                                 nb_pkts, cb->param);
> >>>                         cb = cb->next;
> >>>                 } while (cb != NULL);
> >>>         }
> >>> #endif
> >>>
> >>>         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
> >>>         return nb_rx;
> >>>  }
> >>>
> >>> Now, as you said above, in rte_ethdev.h we will keep only a flat array
> >>> with pointers to 'fast' functions:
> >>> struct {
> >>>      eth_rx_burst_t             rx_pkt_burst
> >>>       eth_tx_burst_t             tx_pkt_burst;
> >>>       eth_tx_prep_t              tx_pkt_prepare;
> >>>      .....
> >>> } rte_eth_dev_burst[];
> >>>
> >>> And rte_eth_rx_burst() will look like:
> >>>
> >>> static inline uint16_t
> >>> rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
> >>>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> >>> {
> >>>     if (port_id >= RTE_MAX_ETHPORTS)
> >>>         return 0;
> >>>    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts, nb_pkts);
> >>> }
> >>>
> >>> Yes, it will require changes in *all* PMDs, but as I said before the changes will be a mechanic ones.
> >>>
> >>
> >> I did not like the idea to push to calling Rx/TX callbacks responsibility to the
> >> drivers, I think it should be in the ethdev layer.
> >
> > Well, I'd say it is an ethdev layer function that has to be called by PMD 😊
> >
> >>
> >> What about making 'rte_eth_rx_epilog' an API and call from 'rte_eth_rx_burst()',
> >> which will add another function call for Rx/Tx callback but shouldn't affect the
> >> Rx/Tx burst.
> >
> > But then we either need to expose call-back information to the user or pay the penalty
> > for extra function call, correct?
> >
> 
> Right. As a middle ground, we can keep Rx/Tx burst functions as inline, but have
> the Rx/Tx callback part of it as function, so get the hit only for callbacks.

To avoid the  hit we need to expose CB data to the user.
At least number of call-backs currently installed for each queue.
Ferruh Yigit June 21, 2021, 3:56 p.m. UTC | #55
On 6/21/2021 3:38 PM, Ananyev, Konstantin wrote:
> 
>>
>> On 6/21/2021 1:30 PM, Ananyev, Konstantin wrote:
>>>
>>>>
>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
>>>>> Konstantin
>>>>>
>>>>>>> How can we hide the callbacks since they are used by inline burst
>>>>> functions.
>>>>>>
>>>>>> I probably I owe a better explanation to what I meant in first mail.
>>>>>> Otherwise it sounds confusing.
>>>>>> I'll try to write a more detailed one in next few days.
>>>>>
>>>>> Actually I gave it another thought over weekend, and might be we can
>>>>> hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
>>>>> an example, but the same principle applies to other 'fast' functions.
>>>>>
>>>>>  1. Needed changes for PMDs rx_pkt_burst():
>>>>>     a) change function prototype to accept 'uint16_t port_id' and
>>>>> 'uint16_t queue_id',
>>>>>          instead of current 'void *'.
>>>>>     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog()
>>>>> function at return.
>>>>>          This  inline function will do all CB calls for that queue.
>>>>>
>>>>> To be more specific, let say we have some PMD: xyz with RX function:
>>>>>
>>>>> uint16_t
>>>>> xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t
>>>>> nb_pkts)
>>>>> {
>>>>>      struct xyz_rx_queue *rxq = rx_queue;
>>>>>      uint16_t nb_rx = 0;
>>>>>
>>>>>      /* do actual stuff here */
>>>>>     ....
>>>>>     return nb_rx;
>>>>> }
>>>>>
>>>>> It will be transformed to:
>>>>>
>>>>> uint16_t
>>>>> xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
>>>>> **rx_pkts, uint16_t nb_pkts)
>>>>> {
>>>>>          struct xyz_rx_queue *rxq;
>>>>>          uint16_t nb_rx;
>>>>>
>>>>>          rxq = _rte_eth_rx_prolog(port_id, queue_id);
>>>>>          if (rxq == NULL)
>>>>>              return 0;
>>>>>          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
>>>>>          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
>>>>> nb_pkts);
>>>>> }
>>>>>
>>>>> And somewhere in ethdev_private.h:
>>>>>
>>>>> static inline void *
>>>>> _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
>>>>> {
>>>>>    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
>>>>>
>>>>> #ifdef RTE_ETHDEV_DEBUG_RX
>>>>>         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
>>>>>         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
>>>>>
>>>>>         if (queue_id >= dev->data->nb_rx_queues) {
>>>>>                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
>>>>> queue_id);
>>>>>                 return NULL;
>>>>>         }
>>>>> #endif
>>>>>   return dev->data->rx_queues[queue_id];
>>>>> }
>>>>>
>>>>> static inline uint16_t
>>>>> _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
>>>>> **rx_pkts, const uint16_t nb_pkts);
>>>>> {
>>>>>     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
>>>>>
>>>>> #ifdef RTE_ETHDEV_RXTX_CALLBACKS
>>>>>         struct rte_eth_rxtx_callback *cb;
>>>>>
>>>>>         /* __ATOMIC_RELEASE memory order was used when the
>>>>>          * call back was inserted into the list.
>>>>>          * Since there is a clear dependency between loading
>>>>>          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
>>>>>          * not required.
>>>>>          */
>>>>>         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
>>>>>                                 __ATOMIC_RELAXED);
>>>>>
>>>>>         if (unlikely(cb != NULL)) {
>>>>>                 do {
>>>>>                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts,
>>>>> nb_rx,
>>>>>                                                 nb_pkts, cb->param);
>>>>>                         cb = cb->next;
>>>>>                 } while (cb != NULL);
>>>>>         }
>>>>> #endif
>>>>>
>>>>>         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts,
>>>>> nb_rx);
>>>>>         return nb_rx;
>>>>>  }
>>>>
>>>> That would make the compiler inline _rte_eth_rx_epilog() into the driver when compiling the DPDK library. But
>>>> RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application developer to use when compiling the DPDK application.
>>>
>>> I believe it is for both - user app and DPDK drivers.
>>> AFAIK, they both have to use the same rte_config.h, otherwise things will be broken.
>>> If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev, then
>>> user wouldn't be able to add a callback at first place.
>>> BTW,  such change will allow us to make RTE_ETHDEV_RXTX_CALLBACKS
>>> internal for ethdev/PMD layer, which is a good thing from my perspective.
>>>
>>
>> It is possible to use binary drivers (.so) as plugin. Currently application can
>> decide to use or not use Rx/Tx callbacks even with binary drivers, but this
>> change adds a complexity to this usecase.
> 
> Not sure I understand you here...
> Can you explain a bit more what do you mean?
> 

Right now if I have a .so driver, I can decide to use or not to use the Rx/Tx
callbacks by compiling application with relevant config, and .so will work for
both without change.

With proposed change, if .so not enabled Rx/Tx callback, application won't able
to use it.

Application and driver config should be compatible, and adding more compile time
config to drivers that is also used in libraries is adding more points to sync,
hence adding more complexity I believe to binary drivers usecase.

>>
>>>>
>>>>>
>>>>> Now, as you said above, in rte_ethdev.h we will keep only a flat array
>>>>> with pointers to 'fast' functions:
>>>>> struct {
>>>>>      eth_rx_burst_t             rx_pkt_burst
>>>>>       eth_tx_burst_t             tx_pkt_burst;
>>>>>       eth_tx_prep_t              tx_pkt_prepare;
>>>>>      .....
>>>>> } rte_eth_dev_burst[];
>>>>>
>>>>> And rte_eth_rx_burst() will look like:
>>>>>
>>>>> static inline uint16_t
>>>>> rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>>>>>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
>>>>> {
>>>>>     if (port_id >= RTE_MAX_ETHPORTS)
>>>>>         return 0;
>>>>>    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts,
>>>>> nb_pkts);
>>>>> }
>>>>>
>>>>> Yes, it will require changes in *all* PMDs, but as I said before the
>>>>> changes will be a mechanic ones.
>
Ananyev, Konstantin June 21, 2021, 6:17 p.m. UTC | #56
> 
> On 6/21/2021 3:38 PM, Ananyev, Konstantin wrote:
> >
> >>
> >> On 6/21/2021 1:30 PM, Ananyev, Konstantin wrote:
> >>>
> >>>>
> >>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> >>>>> Konstantin
> >>>>>
> >>>>>>> How can we hide the callbacks since they are used by inline burst
> >>>>> functions.
> >>>>>>
> >>>>>> I probably I owe a better explanation to what I meant in first mail.
> >>>>>> Otherwise it sounds confusing.
> >>>>>> I'll try to write a more detailed one in next few days.
> >>>>>
> >>>>> Actually I gave it another thought over weekend, and might be we can
> >>>>> hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst() as
> >>>>> an example, but the same principle applies to other 'fast' functions.
> >>>>>
> >>>>>  1. Needed changes for PMDs rx_pkt_burst():
> >>>>>     a) change function prototype to accept 'uint16_t port_id' and
> >>>>> 'uint16_t queue_id',
> >>>>>          instead of current 'void *'.
> >>>>>     b) Each PMD rx_pkt_burst() will have to call rte_eth_rx_epilog()
> >>>>> function at return.
> >>>>>          This  inline function will do all CB calls for that queue.
> >>>>>
> >>>>> To be more specific, let say we have some PMD: xyz with RX function:
> >>>>>
> >>>>> uint16_t
> >>>>> xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t
> >>>>> nb_pkts)
> >>>>> {
> >>>>>      struct xyz_rx_queue *rxq = rx_queue;
> >>>>>      uint16_t nb_rx = 0;
> >>>>>
> >>>>>      /* do actual stuff here */
> >>>>>     ....
> >>>>>     return nb_rx;
> >>>>> }
> >>>>>
> >>>>> It will be transformed to:
> >>>>>
> >>>>> uint16_t
> >>>>> xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> >>>>> **rx_pkts, uint16_t nb_pkts)
> >>>>> {
> >>>>>          struct xyz_rx_queue *rxq;
> >>>>>          uint16_t nb_rx;
> >>>>>
> >>>>>          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> >>>>>          if (rxq == NULL)
> >>>>>              return 0;
> >>>>>          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
> >>>>>          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
> >>>>> nb_pkts);
> >>>>> }
> >>>>>
> >>>>> And somewhere in ethdev_private.h:
> >>>>>
> >>>>> static inline void *
> >>>>> _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> >>>>> {
> >>>>>    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >>>>>
> >>>>> #ifdef RTE_ETHDEV_DEBUG_RX
> >>>>>         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> >>>>>         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> >>>>>
> >>>>>         if (queue_id >= dev->data->nb_rx_queues) {
> >>>>>                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
> >>>>> queue_id);
> >>>>>                 return NULL;
> >>>>>         }
> >>>>> #endif
> >>>>>   return dev->data->rx_queues[queue_id];
> >>>>> }
> >>>>>
> >>>>> static inline uint16_t
> >>>>> _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> >>>>> **rx_pkts, const uint16_t nb_pkts);
> >>>>> {
> >>>>>     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >>>>>
> >>>>> #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> >>>>>         struct rte_eth_rxtx_callback *cb;
> >>>>>
> >>>>>         /* __ATOMIC_RELEASE memory order was used when the
> >>>>>          * call back was inserted into the list.
> >>>>>          * Since there is a clear dependency between loading
> >>>>>          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
> >>>>>          * not required.
> >>>>>          */
> >>>>>         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
> >>>>>                                 __ATOMIC_RELAXED);
> >>>>>
> >>>>>         if (unlikely(cb != NULL)) {
> >>>>>                 do {
> >>>>>                         nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts,
> >>>>> nb_rx,
> >>>>>                                                 nb_pkts, cb->param);
> >>>>>                         cb = cb->next;
> >>>>>                 } while (cb != NULL);
> >>>>>         }
> >>>>> #endif
> >>>>>
> >>>>>         rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts,
> >>>>> nb_rx);
> >>>>>         return nb_rx;
> >>>>>  }
> >>>>
> >>>> That would make the compiler inline _rte_eth_rx_epilog() into the driver when compiling the DPDK library. But
> >>>> RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application developer to use when compiling the DPDK application.
> >>>
> >>> I believe it is for both - user app and DPDK drivers.
> >>> AFAIK, they both have to use the same rte_config.h, otherwise things will be broken.
> >>> If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev, then
> >>> user wouldn't be able to add a callback at first place.
> >>> BTW,  such change will allow us to make RTE_ETHDEV_RXTX_CALLBACKS
> >>> internal for ethdev/PMD layer, which is a good thing from my perspective.
> >>>
> >>
> >> It is possible to use binary drivers (.so) as plugin. Currently application can
> >> decide to use or not use Rx/Tx callbacks even with binary drivers, but this
> >> change adds a complexity to this usecase.
> >
> > Not sure I understand you here...
> > Can you explain a bit more what do you mean?
> >
> 
> Right now if I have a .so driver, I can decide to use or not to use the Rx/Tx
> callbacks by compiling application with relevant config, and .so will work for
> both without change.

True.

> With proposed change, if .so not enabled Rx/Tx callback, application won't able
> to use it.
> 
> Application and driver config should be compatible, and adding more compile time
> config to drivers that is also used in libraries is adding more points to sync,
> hence adding more complexity I believe to binary drivers usecase.

I agree - right now PMD doesn't use RTE_ETHDEV_RXTX_CALLBACKS,
and with that proposal we add extra config dependency to the PMD.
Also it makes PMD even more tightly coupled with rte_ethdev.
Though is that really an obstacle?
From my understanding dpdk libs and PMDs has to be build with the same config anyway.
Not following this rule can cause all sorts of troubles.

> >>
> >>>>
> >>>>>
> >>>>> Now, as you said above, in rte_ethdev.h we will keep only a flat array
> >>>>> with pointers to 'fast' functions:
> >>>>> struct {
> >>>>>      eth_rx_burst_t             rx_pkt_burst
> >>>>>       eth_tx_burst_t             tx_pkt_burst;
> >>>>>       eth_tx_prep_t              tx_pkt_prepare;
> >>>>>      .....
> >>>>> } rte_eth_dev_burst[];
> >>>>>
> >>>>> And rte_eth_rx_burst() will look like:
> >>>>>
> >>>>> static inline uint16_t
> >>>>> rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
> >>>>>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
> >>>>> {
> >>>>>     if (port_id >= RTE_MAX_ETHPORTS)
> >>>>>         return 0;
> >>>>>    return rte_eth_dev_burst[port_id](port_id, queue_id, rx_pkts,
> >>>>> nb_pkts);
> >>>>> }
> >>>>>
> >>>>> Yes, it will require changes in *all* PMDs, but as I said before the
> >>>>> changes will be a mechanic ones.
> >
Ananyev, Konstantin June 22, 2021, 8:33 a.m. UTC | #57
> 
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > Konstantin
> >
> > >
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > > > Konstantin
> > > >
> > > > > > How can we hide the callbacks since they are used by inline
> > burst
> > > > functions.
> > > > >
> > > > > I probably I owe a better explanation to what I meant in first
> > mail.
> > > > > Otherwise it sounds confusing.
> > > > > I'll try to write a more detailed one in next few days.
> > > >
> > > > Actually I gave it another thought over weekend, and might be we
> > can
> > > > hide rte_eth_dev_cb even in a simpler way. I'd use eth_rx_burst()
> > as
> > > > an example, but the same principle applies to other 'fast'
> > functions.
> > > >
> > > >  1. Needed changes for PMDs rx_pkt_burst():
> > > >     a) change function prototype to accept 'uint16_t port_id' and
> > > > 'uint16_t queue_id',
> > > >          instead of current 'void *'.
> > > >     b) Each PMD rx_pkt_burst() will have to call
> > rte_eth_rx_epilog()
> > > > function at return.
> > > >          This  inline function will do all CB calls for that queue.
> > > >
> > > > To be more specific, let say we have some PMD: xyz with RX
> > function:
> > > >
> > > > uint16_t
> > > > xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t
> > > > nb_pkts)
> > > > {
> > > >      struct xyz_rx_queue *rxq = rx_queue;
> > > >      uint16_t nb_rx = 0;
> > > >
> > > >      /* do actual stuff here */
> > > >     ....
> > > >     return nb_rx;
> > > > }
> > > >
> > > > It will be transformed to:
> > > >
> > > > uint16_t
> > > > xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct rte_mbuf
> > > > **rx_pkts, uint16_t nb_pkts)
> > > > {
> > > >          struct xyz_rx_queue *rxq;
> > > >          uint16_t nb_rx;
> > > >
> > > >          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> > > >          if (rxq == NULL)
> > > >              return 0;
> > > >          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
> > > >          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
> > > > nb_pkts);
> > > > }
> > > >
> > > > And somewhere in ethdev_private.h:
> > > >
> > > > static inline void *
> > > > _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> > > > {
> > > >    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > > >
> > > > #ifdef RTE_ETHDEV_DEBUG_RX
> > > >         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> > > >         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> > > >
> > > >         if (queue_id >= dev->data->nb_rx_queues) {
> > > >                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
> > > > queue_id);
> > > >                 return NULL;
> > > >         }
> > > > #endif
> > > >   return dev->data->rx_queues[queue_id];
> > > > }
> > > >
> > > > static inline uint16_t
> > > > _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct
> > rte_mbuf
> > > > **rx_pkts, const uint16_t nb_pkts);
> > > > {
> > > >     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > > >
> > > > #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> > > >         struct rte_eth_rxtx_callback *cb;
> > > >
> > > >         /* __ATOMIC_RELEASE memory order was used when the
> > > >          * call back was inserted into the list.
> > > >          * Since there is a clear dependency between loading
> > > >          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory order is
> > > >          * not required.
> > > >          */
> > > >         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
> > > >                                 __ATOMIC_RELAXED);
> > > >
> > > >         if (unlikely(cb != NULL)) {
> > > >                 do {
> > > >                         nb_rx = cb->fn.rx(port_id, queue_id,
> > rx_pkts,
> > > > nb_rx,
> > > >                                                 nb_pkts, cb-
> > >param);
> > > >                         cb = cb->next;
> > > >                 } while (cb != NULL);
> > > >         }
> > > > #endif
> > > >
> > > >         rte_ethdev_trace_rx_burst(port_id, queue_id, (void
> > **)rx_pkts,
> > > > nb_rx);
> > > >         return nb_rx;
> > > >  }
> > >
> > > That would make the compiler inline _rte_eth_rx_epilog() into the
> > driver when compiling the DPDK library. But
> > > RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application
> > developer to use when compiling the DPDK application.
> >
> > I believe it is for both - user app and DPDK drivers.
> > AFAIK, they both have to use the same rte_config.h, otherwise things
> > will be broken.
> > If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev, then
> > user wouldn't be able to add a callback at first place.
> 
> In the case of RTE_ETHDEV_RXTX_CALLBACKS, it is independent:

Not really.
There are few libraries within DPDK that do rely on rx/tx callbacks:
bpf, latencystat, pdump, power.
With the approach above their functionality will be broken -
setup functions will return success, but actual callbacks will not be invoked. 
From other side, some libraries do invoke rx/tx burst on their own: ip-pipeline, graph.
For them callback invocation will continue to work, even when
RTE_ETHDEV_RXTX_CALLBACKS is disabled in the app.
In general, building DPDK libs and user app with different rte_config.h is really a bad idea.
It might work in some cases, but I believe it is not supported and user should not rely on it.
If user needs to disable RTE_ETHDEV_RXTX_CALLBACKS in his app, then the proper way would be:
- update rte_config.h
- rebuild both DPDK and the app with new config

> 
> If it is not compiled with the DPDK library, attempts to install callbacks from the application will fail with ENOTSUP.
> 
> If it is not compiled with the DPDK application, no time will be spent trying to determine if any there are any callbacks to call.
> 
> > BTW,  such change will allow us to make RTE_ETHDEV_RXTX_CALLBACKS
> > internal for ethdev/PMD layer, which is a good thing from my
> > perspective.
> 
> If it can be done without degrading performance for applications not using callbacks.
Morten Brørup June 22, 2021, 10:01 a.m. UTC | #58
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> 
> >
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > > Konstantin
> > >
> > > >
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > > > > Konstantin
> > > > >
> > > > > > > How can we hide the callbacks since they are used by inline
> > > burst
> > > > > functions.
> > > > > >
> > > > > > I probably I owe a better explanation to what I meant in
> first
> > > mail.
> > > > > > Otherwise it sounds confusing.
> > > > > > I'll try to write a more detailed one in next few days.
> > > > >
> > > > > Actually I gave it another thought over weekend, and might be
> we
> > > can
> > > > > hide rte_eth_dev_cb even in a simpler way. I'd use
> eth_rx_burst()
> > > as
> > > > > an example, but the same principle applies to other 'fast'
> > > functions.
> > > > >
> > > > >  1. Needed changes for PMDs rx_pkt_burst():
> > > > >     a) change function prototype to accept 'uint16_t port_id'
> and
> > > > > 'uint16_t queue_id',
> > > > >          instead of current 'void *'.
> > > > >     b) Each PMD rx_pkt_burst() will have to call
> > > rte_eth_rx_epilog()
> > > > > function at return.
> > > > >          This  inline function will do all CB calls for that
> queue.
> > > > >
> > > > > To be more specific, let say we have some PMD: xyz with RX
> > > function:
> > > > >
> > > > > uint16_t
> > > > > xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> uint16_t
> > > > > nb_pkts)
> > > > > {
> > > > >      struct xyz_rx_queue *rxq = rx_queue;
> > > > >      uint16_t nb_rx = 0;
> > > > >
> > > > >      /* do actual stuff here */
> > > > >     ....
> > > > >     return nb_rx;
> > > > > }
> > > > >
> > > > > It will be transformed to:
> > > > >
> > > > > uint16_t
> > > > > xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct
> rte_mbuf
> > > > > **rx_pkts, uint16_t nb_pkts)
> > > > > {
> > > > >          struct xyz_rx_queue *rxq;
> > > > >          uint16_t nb_rx;
> > > > >
> > > > >          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> > > > >          if (rxq == NULL)
> > > > >              return 0;
> > > > >          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
> > > > >          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
> > > > > nb_pkts);
> > > > > }
> > > > >
> > > > > And somewhere in ethdev_private.h:
> > > > >
> > > > > static inline void *
> > > > > _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> > > > > {
> > > > >    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > > > >
> > > > > #ifdef RTE_ETHDEV_DEBUG_RX
> > > > >         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> > > > >         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> > > > >
> > > > >         if (queue_id >= dev->data->nb_rx_queues) {
> > > > >                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
> > > > > queue_id);
> > > > >                 return NULL;
> > > > >         }
> > > > > #endif
> > > > >   return dev->data->rx_queues[queue_id];
> > > > > }
> > > > >
> > > > > static inline uint16_t
> > > > > _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct
> > > rte_mbuf
> > > > > **rx_pkts, const uint16_t nb_pkts);
> > > > > {
> > > > >     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > > > >
> > > > > #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> > > > >         struct rte_eth_rxtx_callback *cb;
> > > > >
> > > > >         /* __ATOMIC_RELEASE memory order was used when the
> > > > >          * call back was inserted into the list.
> > > > >          * Since there is a clear dependency between loading
> > > > >          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory
> order is
> > > > >          * not required.
> > > > >          */
> > > > >         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
> > > > >                                 __ATOMIC_RELAXED);
> > > > >
> > > > >         if (unlikely(cb != NULL)) {
> > > > >                 do {
> > > > >                         nb_rx = cb->fn.rx(port_id, queue_id,
> > > rx_pkts,
> > > > > nb_rx,
> > > > >                                                 nb_pkts, cb-
> > > >param);
> > > > >                         cb = cb->next;
> > > > >                 } while (cb != NULL);
> > > > >         }
> > > > > #endif
> > > > >
> > > > >         rte_ethdev_trace_rx_burst(port_id, queue_id, (void
> > > **)rx_pkts,
> > > > > nb_rx);
> > > > >         return nb_rx;
> > > > >  }
> > > >
> > > > That would make the compiler inline _rte_eth_rx_epilog() into the
> > > driver when compiling the DPDK library. But
> > > > RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application
> > > developer to use when compiling the DPDK application.
> > >
> > > I believe it is for both - user app and DPDK drivers.
> > > AFAIK, they both have to use the same rte_config.h, otherwise
> things
> > > will be broken.
> > > If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev, then
> > > user wouldn't be able to add a callback at first place.
> >
> > In the case of RTE_ETHDEV_RXTX_CALLBACKS, it is independent:
> 
> Not really.
> There are few libraries within DPDK that do rely on rx/tx callbacks:
> bpf, latencystat, pdump, power.

I do not consider these to be core libraries in DPDK. If these libraries are used in an application, that application must be compiled with RTE_ETHDEV_RXTX_CALLBACKS.

> With the approach above their functionality will be broken -
> setup functions will return success, but actual callbacks will not be
> invoked.

I just took a look at bpf and latencystat. Bpf correctly checks for the return code, and returns an error if ethdev has been compiled without RTE_ETHDEV_RXTX_CALLBACKS. Latencystat checks for the return code, but only logs the error and continues as if everything is good anyway. That is a bug in the latencystat library.

> From other side, some libraries do invoke rx/tx burst on their own: ip-
> pipeline, graph.
> For them callback invocation will continue to work, even when
> RTE_ETHDEV_RXTX_CALLBACKS is disabled in the app.
> In general, building DPDK libs and user app with different rte_config.h
> is really a bad idea.
> It might work in some cases, but I believe it is not supported and user
> should not rely on it.
> If user needs to disable RTE_ETHDEV_RXTX_CALLBACKS in his app, then the
> proper way would be:
> - update rte_config.h
> - rebuild both DPDK and the app with new config

In principle, I completely agree with your reasoning from a high level perspective.

However, accepting it would probably lead to the RTE_ETHDEV_RXTX_CALLBACKS compile time configuration option being completely removed, and ethdev callbacks being always supported. And I don't think such a performance degradation of a core DPDK library should be accepted.

<rant on>
I was opposed to the "callback hooks" concept from the beginning, and still am. The path that packets take through various functions and pipeline stages should be determined and implemented by the application, not by the DPDK libraries.

If we want to provide a standardized advanced IP pipeline for DPDK, we could offer it as a middle layer library using the underlying DPDK libraries to implement various callbacks, IP fragmentation reassembly, etc.. Don't tweak the core libraries (costing memory and/or performance) to support an increasing amount of supplemental libraries, which may not be used by all applications.

We don't want DPDK to become like the Linux IP stack, with callback hooks and runtime installable protocol handling everywhere. All this cruft with their small performance degradations adds up!
<rant off>

> 
> >
> > If it is not compiled with the DPDK library, attempts to install
> callbacks from the application will fail with ENOTSUP.
> >
> > If it is not compiled with the DPDK application, no time will be
> spent trying to determine if any there are any callbacks to call.
> >
> > > BTW,  such change will allow us to make RTE_ETHDEV_RXTX_CALLBACKS
> > > internal for ethdev/PMD layer, which is a good thing from my
> > > perspective.
> >
> > If it can be done without degrading performance for applications not
> using callbacks.
Ananyev, Konstantin June 22, 2021, 12:13 p.m. UTC | #59
> 
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > Konstantin
> >
> > >
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > > > Konstantin
> > > >
> > > > >
> > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > > > > > Konstantin
> > > > > >
> > > > > > > > How can we hide the callbacks since they are used by inline
> > > > burst
> > > > > > functions.
> > > > > > >
> > > > > > > I probably I owe a better explanation to what I meant in
> > first
> > > > mail.
> > > > > > > Otherwise it sounds confusing.
> > > > > > > I'll try to write a more detailed one in next few days.
> > > > > >
> > > > > > Actually I gave it another thought over weekend, and might be
> > we
> > > > can
> > > > > > hide rte_eth_dev_cb even in a simpler way. I'd use
> > eth_rx_burst()
> > > > as
> > > > > > an example, but the same principle applies to other 'fast'
> > > > functions.
> > > > > >
> > > > > >  1. Needed changes for PMDs rx_pkt_burst():
> > > > > >     a) change function prototype to accept 'uint16_t port_id'
> > and
> > > > > > 'uint16_t queue_id',
> > > > > >          instead of current 'void *'.
> > > > > >     b) Each PMD rx_pkt_burst() will have to call
> > > > rte_eth_rx_epilog()
> > > > > > function at return.
> > > > > >          This  inline function will do all CB calls for that
> > queue.
> > > > > >
> > > > > > To be more specific, let say we have some PMD: xyz with RX
> > > > function:
> > > > > >
> > > > > > uint16_t
> > > > > > xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> > uint16_t
> > > > > > nb_pkts)
> > > > > > {
> > > > > >      struct xyz_rx_queue *rxq = rx_queue;
> > > > > >      uint16_t nb_rx = 0;
> > > > > >
> > > > > >      /* do actual stuff here */
> > > > > >     ....
> > > > > >     return nb_rx;
> > > > > > }
> > > > > >
> > > > > > It will be transformed to:
> > > > > >
> > > > > > uint16_t
> > > > > > xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct
> > rte_mbuf
> > > > > > **rx_pkts, uint16_t nb_pkts)
> > > > > > {
> > > > > >          struct xyz_rx_queue *rxq;
> > > > > >          uint16_t nb_rx;
> > > > > >
> > > > > >          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> > > > > >          if (rxq == NULL)
> > > > > >              return 0;
> > > > > >          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts, nb_pkts);
> > > > > >          return _rte_eth_rx_epilog(port_id, queue_id, rx_pkts,
> > > > > > nb_pkts);
> > > > > > }
> > > > > >
> > > > > > And somewhere in ethdev_private.h:
> > > > > >
> > > > > > static inline void *
> > > > > > _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> > > > > > {
> > > > > >    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > > > > >
> > > > > > #ifdef RTE_ETHDEV_DEBUG_RX
> > > > > >         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> > > > > >         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> > > > > >
> > > > > >         if (queue_id >= dev->data->nb_rx_queues) {
> > > > > >                 RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n",
> > > > > > queue_id);
> > > > > >                 return NULL;
> > > > > >         }
> > > > > > #endif
> > > > > >   return dev->data->rx_queues[queue_id];
> > > > > > }
> > > > > >
> > > > > > static inline uint16_t
> > > > > > _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id, struct
> > > > rte_mbuf
> > > > > > **rx_pkts, const uint16_t nb_pkts);
> > > > > > {
> > > > > >     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > > > > >
> > > > > > #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> > > > > >         struct rte_eth_rxtx_callback *cb;
> > > > > >
> > > > > >         /* __ATOMIC_RELEASE memory order was used when the
> > > > > >          * call back was inserted into the list.
> > > > > >          * Since there is a clear dependency between loading
> > > > > >          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory
> > order is
> > > > > >          * not required.
> > > > > >          */
> > > > > >         cb = __atomic_load_n(&dev->post_rx_burst_cbs[queue_id],
> > > > > >                                 __ATOMIC_RELAXED);
> > > > > >
> > > > > >         if (unlikely(cb != NULL)) {
> > > > > >                 do {
> > > > > >                         nb_rx = cb->fn.rx(port_id, queue_id,
> > > > rx_pkts,
> > > > > > nb_rx,
> > > > > >                                                 nb_pkts, cb-
> > > > >param);
> > > > > >                         cb = cb->next;
> > > > > >                 } while (cb != NULL);
> > > > > >         }
> > > > > > #endif
> > > > > >
> > > > > >         rte_ethdev_trace_rx_burst(port_id, queue_id, (void
> > > > **)rx_pkts,
> > > > > > nb_rx);
> > > > > >         return nb_rx;
> > > > > >  }
> > > > >
> > > > > That would make the compiler inline _rte_eth_rx_epilog() into the
> > > > driver when compiling the DPDK library. But
> > > > > RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application
> > > > developer to use when compiling the DPDK application.
> > > >
> > > > I believe it is for both - user app and DPDK drivers.
> > > > AFAIK, they both have to use the same rte_config.h, otherwise
> > things
> > > > will be broken.
> > > > If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev, then
> > > > user wouldn't be able to add a callback at first place.
> > >
> > > In the case of RTE_ETHDEV_RXTX_CALLBACKS, it is independent:
> >
> > Not really.
> > There are few libraries within DPDK that do rely on rx/tx callbacks:
> > bpf, latencystat, pdump, power.
> 
> I do not consider these to be core libraries in DPDK. If these libraries are used in an application, that application must be compiled with
> RTE_ETHDEV_RXTX_CALLBACKS.
> 
> > With the approach above their functionality will be broken -
> > setup functions will return success, but actual callbacks will not be
> > invoked.
> 
> I just took a look at bpf and latencystat. Bpf correctly checks for the return code, and returns an error if ethdev has been compiled without
> RTE_ETHDEV_RXTX_CALLBACKS. Latencystat checks for the return code, but only logs the error and continues as if everything is good
> anyway. That is a bug in the latencystat library.

If RTE_ETHDEV_RXTX_CALLBACKS Is enabled or disabled for both DPDK and user app - everything will work as expected.
But, as I understand, you consider approach when RTE_ETHDEV_RXTX_CALLBACKS Is enabled in the DPDK, but disabled in the app.
Such approach will cause a problems with some  libraries - as I outlined above. 

> 
> > From other side, some libraries do invoke rx/tx burst on their own: ip-
> > pipeline, graph.
> > For them callback invocation will continue to work, even when
> > RTE_ETHDEV_RXTX_CALLBACKS is disabled in the app.
> > In general, building DPDK libs and user app with different rte_config.h
> > is really a bad idea.
> > It might work in some cases, but I believe it is not supported and user
> > should not rely on it.
> > If user needs to disable RTE_ETHDEV_RXTX_CALLBACKS in his app, then the
> > proper way would be:
> > - update rte_config.h
> > - rebuild both DPDK and the app with new config
> 
> In principle, I completely agree with your reasoning from a high level perspective.
> 
> However, accepting it would probably lead to the RTE_ETHDEV_RXTX_CALLBACKS compile time configuration option being completely
> removed, 

For now, we are not talking about removing or even deprecating RTE_ETHDEV_RXTX_CALLBACKS.
What I am talking about - user has to use it (and other rte_config.h options) properly.
He can't use different configs for DPDK libs and app and expect things 'just work'.
This is not supported right now, I think it will never be.  
If it works right now, this is just implementation specifics, which user should not rely on.

> and ethdev callbacks being always supported. And I don't think such a performance degradation of a core DPDK library should be
> accepted.

As I said above, RTE_ETHDEV_RXTX_CALLBACKS Is still there.
If it is really critical for your app to disable it - you can still do it, you just need to do it in a proper way.

> <rant on>
> I was opposed to the "callback hooks" concept from the beginning, and still am. The path that packets take through various functions and
> pipeline stages should be determined and implemented by the application, not by the DPDK libraries.
> 
> If we want to provide a standardized advanced IP pipeline for DPDK, we could offer it as a middle layer library using the underlying DPDK
> libraries to implement various callbacks, IP fragmentation reassembly, etc.. Don't tweak the core libraries (costing memory and/or
> performance) to support an increasing amount of supplemental libraries, which may not be used by all applications.
> 
> We don't want DPDK to become like the Linux IP stack, with callback hooks and runtime installable protocol handling everywhere. All this
> cruft with their small performance degradations adds up!
> <rant off>
> 
> >
> > >
> > > If it is not compiled with the DPDK library, attempts to install
> > callbacks from the application will fail with ENOTSUP.
> > >
> > > If it is not compiled with the DPDK application, no time will be
> > spent trying to determine if any there are any callbacks to call.
> > >
> > > > BTW,  such change will allow us to make RTE_ETHDEV_RXTX_CALLBACKS
> > > > internal for ethdev/PMD layer, which is a good thing from my
> > > > perspective.
> > >
> > > If it can be done without degrading performance for applications not
> > using callbacks.
Morten Brørup June 22, 2021, 1:18 p.m. UTC | #60
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> Konstantin
> 
> >
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > > Konstantin
> > >
> > > >
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev,
> > > > > Konstantin
> > > > >
> > > > > >
> > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of
> Ananyev,
> > > > > > > Konstantin
> > > > > > >
> > > > > > > > > How can we hide the callbacks since they are used by
> inline
> > > > > burst
> > > > > > > functions.
> > > > > > > >
> > > > > > > > I probably I owe a better explanation to what I meant in
> > > first
> > > > > mail.
> > > > > > > > Otherwise it sounds confusing.
> > > > > > > > I'll try to write a more detailed one in next few days.
> > > > > > >
> > > > > > > Actually I gave it another thought over weekend, and might
> be
> > > we
> > > > > can
> > > > > > > hide rte_eth_dev_cb even in a simpler way. I'd use
> > > eth_rx_burst()
> > > > > as
> > > > > > > an example, but the same principle applies to other 'fast'
> > > > > functions.
> > > > > > >
> > > > > > >  1. Needed changes for PMDs rx_pkt_burst():
> > > > > > >     a) change function prototype to accept 'uint16_t
> port_id'
> > > and
> > > > > > > 'uint16_t queue_id',
> > > > > > >          instead of current 'void *'.
> > > > > > >     b) Each PMD rx_pkt_burst() will have to call
> > > > > rte_eth_rx_epilog()
> > > > > > > function at return.
> > > > > > >          This  inline function will do all CB calls for
> that
> > > queue.
> > > > > > >
> > > > > > > To be more specific, let say we have some PMD: xyz with RX
> > > > > function:
> > > > > > >
> > > > > > > uint16_t
> > > > > > > xyz_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> > > uint16_t
> > > > > > > nb_pkts)
> > > > > > > {
> > > > > > >      struct xyz_rx_queue *rxq = rx_queue;
> > > > > > >      uint16_t nb_rx = 0;
> > > > > > >
> > > > > > >      /* do actual stuff here */
> > > > > > >     ....
> > > > > > >     return nb_rx;
> > > > > > > }
> > > > > > >
> > > > > > > It will be transformed to:
> > > > > > >
> > > > > > > uint16_t
> > > > > > > xyz_recv_pkts(uint16_t port_id, uint16_t queue_id, struct
> > > rte_mbuf
> > > > > > > **rx_pkts, uint16_t nb_pkts)
> > > > > > > {
> > > > > > >          struct xyz_rx_queue *rxq;
> > > > > > >          uint16_t nb_rx;
> > > > > > >
> > > > > > >          rxq = _rte_eth_rx_prolog(port_id, queue_id);
> > > > > > >          if (rxq == NULL)
> > > > > > >              return 0;
> > > > > > >          nb_rx = _xyz_real_recv_pkts(rxq, rx_pkts,
> nb_pkts);
> > > > > > >          return _rte_eth_rx_epilog(port_id, queue_id,
> rx_pkts,
> > > > > > > nb_pkts);
> > > > > > > }
> > > > > > >
> > > > > > > And somewhere in ethdev_private.h:
> > > > > > >
> > > > > > > static inline void *
> > > > > > > _rte_eth_rx_prolog(uint16_t port_id, uint16_t queue_id);
> > > > > > > {
> > > > > > >    struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > > > > > >
> > > > > > > #ifdef RTE_ETHDEV_DEBUG_RX
> > > > > > >         RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, NULL);
> > > > > > >         RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, NULL);
> > > > > > >
> > > > > > >         if (queue_id >= dev->data->nb_rx_queues) {
> > > > > > >                 RTE_ETHDEV_LOG(ERR, "Invalid RX
> queue_id=%u\n",
> > > > > > > queue_id);
> > > > > > >                 return NULL;
> > > > > > >         }
> > > > > > > #endif
> > > > > > >   return dev->data->rx_queues[queue_id];
> > > > > > > }
> > > > > > >
> > > > > > > static inline uint16_t
> > > > > > > _rte_eth_rx_epilog(uint16_t port_id, uint16_t queue_id,
> struct
> > > > > rte_mbuf
> > > > > > > **rx_pkts, const uint16_t nb_pkts);
> > > > > > > {
> > > > > > >     struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > > > > > >
> > > > > > > #ifdef RTE_ETHDEV_RXTX_CALLBACKS
> > > > > > >         struct rte_eth_rxtx_callback *cb;
> > > > > > >
> > > > > > >         /* __ATOMIC_RELEASE memory order was used when the
> > > > > > >          * call back was inserted into the list.
> > > > > > >          * Since there is a clear dependency between
> loading
> > > > > > >          * cb and cb->fn/cb->next, __ATOMIC_ACQUIRE memory
> > > order is
> > > > > > >          * not required.
> > > > > > >          */
> > > > > > >         cb = __atomic_load_n(&dev-
> >post_rx_burst_cbs[queue_id],
> > > > > > >                                 __ATOMIC_RELAXED);
> > > > > > >
> > > > > > >         if (unlikely(cb != NULL)) {
> > > > > > >                 do {
> > > > > > >                         nb_rx = cb->fn.rx(port_id,
> queue_id,
> > > > > rx_pkts,
> > > > > > > nb_rx,
> > > > > > >                                                 nb_pkts,
> cb-
> > > > > >param);
> > > > > > >                         cb = cb->next;
> > > > > > >                 } while (cb != NULL);
> > > > > > >         }
> > > > > > > #endif
> > > > > > >
> > > > > > >         rte_ethdev_trace_rx_burst(port_id, queue_id, (void
> > > > > **)rx_pkts,
> > > > > > > nb_rx);
> > > > > > >         return nb_rx;
> > > > > > >  }
> > > > > >
> > > > > > That would make the compiler inline _rte_eth_rx_epilog() into
> the
> > > > > driver when compiling the DPDK library. But
> > > > > > RTE_ETHDEV_RXTX_CALLBACKS is a definition for the application
> > > > > developer to use when compiling the DPDK application.
> > > > >
> > > > > I believe it is for both - user app and DPDK drivers.
> > > > > AFAIK, they both have to use the same rte_config.h, otherwise
> > > things
> > > > > will be broken.
> > > > > If let say RTE_ETHDEV_RXTX_CALLBACKS is not enabled in ethdev,
> then
> > > > > user wouldn't be able to add a callback at first place.
> > > >
> > > > In the case of RTE_ETHDEV_RXTX_CALLBACKS, it is independent:
> > >
> > > Not really.
> > > There are few libraries within DPDK that do rely on rx/tx
> callbacks:
> > > bpf, latencystat, pdump, power.
> >
> > I do not consider these to be core libraries in DPDK. If these
> libraries are used in an application, that application must be compiled
> with
> > RTE_ETHDEV_RXTX_CALLBACKS.
> >
> > > With the approach above their functionality will be broken -
> > > setup functions will return success, but actual callbacks will not
> be
> > > invoked.
> >
> > I just took a look at bpf and latencystat. Bpf correctly checks for
> the return code, and returns an error if ethdev has been compiled
> without
> > RTE_ETHDEV_RXTX_CALLBACKS. Latencystat checks for the return code,
> but only logs the error and continues as if everything is good
> > anyway. That is a bug in the latencystat library.
> 
> If RTE_ETHDEV_RXTX_CALLBACKS Is enabled or disabled for both DPDK and
> user app - everything will work as expected.
> But, as I understand, you consider approach when
> RTE_ETHDEV_RXTX_CALLBACKS Is enabled in the DPDK, but disabled in the
> app.
> Such approach will cause a problems with some  libraries - as I
> outlined above.
> 
> >
> > > From other side, some libraries do invoke rx/tx burst on their own:
> ip-
> > > pipeline, graph.
> > > For them callback invocation will continue to work, even when
> > > RTE_ETHDEV_RXTX_CALLBACKS is disabled in the app.
> > > In general, building DPDK libs and user app with different
> rte_config.h
> > > is really a bad idea.
> > > It might work in some cases, but I believe it is not supported and
> user
> > > should not rely on it.
> > > If user needs to disable RTE_ETHDEV_RXTX_CALLBACKS in his app, then
> the
> > > proper way would be:
> > > - update rte_config.h
> > > - rebuild both DPDK and the app with new config
> >
> > In principle, I completely agree with your reasoning from a high
> level perspective.
> >
> > However, accepting it would probably lead to the
> RTE_ETHDEV_RXTX_CALLBACKS compile time configuration option being
> completely
> > removed,
> 
> For now, we are not talking about removing or even deprecating
> RTE_ETHDEV_RXTX_CALLBACKS.
> What I am talking about - user has to use it (and other rte_config.h
> options) properly.
> He can't use different configs for DPDK libs and app and expect things
> 'just work'.
> This is not supported right now, I think it will never be.
> If it works right now, this is just implementation specifics, which
> user should not rely on.

I agree.

> 
> > and ethdev callbacks being always supported. And I don't think such a
> performance degradation of a core DPDK library should be
> > accepted.
> 
> As I said above, RTE_ETHDEV_RXTX_CALLBACKS Is still there.
> If it is really critical for your app to disable it - you can still do
> it, you just need to do it in a proper way.

I hope so. This is exactly what I am pleading for: Keep the ability to disable RTE_ETHDEV_RXTX_CALLBACKS at compile time, so there is no performance impact for applications not using it.

I also agree with the limitation that both library and application should be compiled with the same configuration.

> 
> > <rant on>
> > I was opposed to the "callback hooks" concept from the beginning, and
> still am. The path that packets take through various functions and
> > pipeline stages should be determined and implemented by the
> application, not by the DPDK libraries.
> >
> > If we want to provide a standardized advanced IP pipeline for DPDK,
> we could offer it as a middle layer library using the underlying DPDK
> > libraries to implement various callbacks, IP fragmentation
> reassembly, etc.. Don't tweak the core libraries (costing memory and/or
> > performance) to support an increasing amount of supplemental
> libraries, which may not be used by all applications.
> >
> > We don't want DPDK to become like the Linux IP stack, with callback
> hooks and runtime installable protocol handling everywhere. All this
> > cruft with their small performance degradations adds up!
> > <rant off>
> >
> > >
> > > >
> > > > If it is not compiled with the DPDK library, attempts to install
> > > callbacks from the application will fail with ENOTSUP.
> > > >
> > > > If it is not compiled with the DPDK application, no time will be
> > > spent trying to determine if any there are any callbacks to call.
> > > >
> > > > > BTW,  such change will allow us to make
> RTE_ETHDEV_RXTX_CALLBACKS
> > > > > internal for ethdev/PMD layer, which is a good thing from my
> > > > > perspective.
> > > >
> > > > If it can be done without degrading performance for applications
> not
> > > using callbacks.
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 5877a16971..bdeae96e57 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -167,6 +167,7 @@  F: app/test/test_errno.c
 F: app/test/test_lcores.c
 F: app/test/test_logs.c
 F: app/test/test_memcpy*
+F: app/test/test_parray.c
 F: app/test/test_per_lcore.c
 F: app/test/test_pflock.c
 F: app/test/test_prefetch.c
diff --git a/app/test/meson.build b/app/test/meson.build
index 08c82d3d23..23dc672c0f 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -90,6 +90,7 @@  test_sources = files(
         'test_metrics.c',
         'test_mcslock.c',
         'test_mp_secondary.c',
+        'test_parray.c',
         'test_per_lcore.c',
         'test_pflock.c',
         'test_pmd_perf.c',
@@ -230,6 +231,7 @@  fast_tests = [
         ['memzone_autotest', false],
         ['meter_autotest', true],
         ['multiprocess_autotest', false],
+        ['parray_autotest', true],
         ['per_lcore_autotest', true],
         ['pflock_autotest', true],
         ['prefetch_autotest', true],
diff --git a/app/test/test_parray.c b/app/test/test_parray.c
new file mode 100644
index 0000000000..f92783d9e8
--- /dev/null
+++ b/app/test/test_parray.c
@@ -0,0 +1,120 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdbool.h>
+#include <sys/resource.h>
+
+#include <rte_parray.h>
+#include <rte_lcore.h>
+#include <rte_errno.h>
+#include <rte_log.h>
+
+#include "test.h"
+
+RTE_LOG_REGISTER(test_parray_log, test.parray, INFO);
+#define LOG(level, ...) \
+	rte_log(RTE_LOG_ ## level, test_parray_log, RTE_FMT("parray test: " \
+		RTE_FMT_HEAD(__VA_ARGS__,) "\n", RTE_FMT_TAIL(__VA_ARGS__,)))
+
+static bool stop;
+
+static struct rte_parray array = RTE_PARRAY_INITIALIZER;
+typedef int elem_t; /* array of int pointers */
+
+static elem_t trash;
+
+static long
+get_context_switches(void)
+{
+	struct rusage thread_info;
+	long context_switches;
+
+	getrusage(RUSAGE_THREAD, &thread_info);
+	context_switches = thread_info.ru_nivcsw;
+	LOG(DEBUG, "%ld involuntary context switches on lcore %u",
+			context_switches, rte_lcore_id());
+
+	return context_switches;
+}
+
+static int
+reader(void *userdata __rte_unused)
+{
+	LOG(DEBUG, "%s on lcore %u", __func__, rte_lcore_id());
+	while (!stop) {
+		int32_t index;
+
+		RTE_PARRAY_FOREACH(&array, index)
+			trash = *RTE_PARRAY_P(elem_t, &array, index);
+	}
+	return 0;
+}
+
+static int
+test_parray(void)
+{
+	int iter;
+	int32_t index;
+	long context_switches;
+
+	stop = false;
+	rte_eal_mp_remote_launch(reader, NULL, SKIP_MAIN);
+	LOG(DEBUG, "writer on lcore %u", rte_lcore_id());
+
+	rte_parray_find_next(NULL, 0);
+	TEST_ASSERT_FAIL(rte_errno, "find from NULL did not fail");
+	rte_parray_find_next(&array, -1);
+	TEST_ASSERT_FAIL(rte_errno, "find from -1 did not fail");
+	rte_parray_find_next(&array, 0);
+	TEST_ASSERT_SUCCESS(rte_errno, "find from empty failed");
+
+	rte_parray_free(NULL, 0);
+	TEST_ASSERT_FAIL(rte_errno, "free in NULL did not fail");
+	rte_parray_free(&array, 0);
+	TEST_ASSERT_FAIL(rte_errno, "free out of range did not fail");
+
+	rte_parray_alloc(NULL, 0);
+	TEST_ASSERT_FAIL(rte_errno, "alloc in NULL did not fail");
+	for (iter = 0; iter < 127; iter++) {
+		index = rte_parray_alloc(&array, sizeof(elem_t));
+		TEST_ASSERT_SUCCESS(rte_errno, "alloc returned an error");
+		TEST_ASSERT(index >= 0, "alloc returned a negative index");
+		TEST_ASSERT_EQUAL(index, iter, "alloc returned wrong index");
+	}
+
+	rte_parray_free(&array, 0);
+	TEST_ASSERT_SUCCESS(rte_errno, "free returned an error");
+	rte_parray_free(&array, 0);
+	TEST_ASSERT_SUCCESS(rte_errno, "double free returned an error");
+
+	/* alloc should increase index if possible */
+	index = rte_parray_alloc(&array, sizeof(elem_t));
+	TEST_ASSERT_SUCCESS(rte_errno, "alloc after free returned an error");
+	TEST_ASSERT_EQUAL(index, 127, "alloc after free returned wrong index");
+	/* size should be 128, almost full, forcing next element to be 0 */
+	index = rte_parray_alloc(&array, sizeof(elem_t));
+	TEST_ASSERT_SUCCESS(rte_errno, "alloc freed 0 returned an error");
+	TEST_ASSERT_EQUAL(index, 0, "alloc freed 0 returned wrong index");
+
+	/* try more race with readers */
+	context_switches = get_context_switches();
+	for (iter = 0; iter < 99; iter++) {
+		for (index = 0; index < 9999; index++) {
+			rte_parray_alloc(&array, sizeof(elem_t));
+			TEST_ASSERT_SUCCESS(rte_errno, "alloc returned an error");
+		}
+		if (get_context_switches() > context_switches + 9)
+			break;
+	}
+
+	stop = true;
+	rte_eal_mp_wait_lcore();
+
+	rte_parray_free_all(&array);
+	TEST_ASSERT_SUCCESS(rte_errno, "free all returned an error");
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(parray_autotest, test_parray);
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index edfca77779..d44f325ad5 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -77,6 +77,7 @@  sources += files(
         'malloc_mp.c',
         'rte_keepalive.c',
         'rte_malloc.c',
+        'rte_parray.c',
         'rte_random.c',
         'rte_reciprocal.c',
         'rte_service.c',
diff --git a/lib/eal/common/rte_parray.c b/lib/eal/common/rte_parray.c
new file mode 100644
index 0000000000..5fac341773
--- /dev/null
+++ b/lib/eal/common/rte_parray.c
@@ -0,0 +1,161 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_atomic.h>
+#include <rte_errno.h>
+
+#include "rte_parray.h"
+
+#define PARRAY_DEFAULT_SIZE 32
+
+int32_t
+rte_parray_find_next(struct rte_parray *obj, int32_t index)
+{
+	if (obj == NULL || index < 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	pthread_mutex_lock(&obj->mutex);
+
+	while (index < obj->size && obj->array[index] == NULL)
+		index++;
+	if (index >= obj->size)
+		index = -1;
+
+	pthread_mutex_unlock(&obj->mutex);
+
+	rte_errno = 0;
+	return index;
+}
+
+static int32_t
+parray_find_next_free(const struct rte_parray *obj, int32_t index)
+{
+	while (index < obj->size && obj->array[index] != NULL)
+		index++;
+	if (index >= obj->size)
+		return -1;
+	return index;
+}
+
+static int
+parray_resize(struct rte_parray *obj)
+{
+	void **new_array;
+	int32_t new_size;
+	int32_t index;
+
+	if (unlikely(obj->size > INT32_MAX / 2))
+		return -1;
+
+	/* allocate a new array with bigger size */
+	new_size = RTE_MAX(PARRAY_DEFAULT_SIZE, obj->size * 2);
+	new_array = malloc(sizeof(void *) * new_size);
+	if (new_array == NULL)
+		return -1;
+
+	/* free array of a previous resize */
+	free(obj->old_array);
+	/* save current array for freeing on next resize */
+	obj->old_array = obj->array;
+
+	/* copy current array in the new one */
+	for (index = 0; index < obj->size; index++)
+		new_array[index] = obj->old_array[index];
+	/* initialize expanded part */
+	memset(new_array + index, 0, sizeof(void *) * (new_size - index));
+
+	/*
+	 * Array readers have no guard/barrier/lock synchronization protection,
+	 * that's why the ordering for array replacement is critical.
+	 */
+	/* new array must be initialized before replacing old array */
+	rte_atomic_thread_fence(__ATOMIC_RELEASE);
+	obj->array = new_array;
+	/* array must be replaced before updating the size */
+	rte_atomic_thread_fence(__ATOMIC_RELEASE);
+	obj->size = new_size;
+
+	return 0;
+}
+
+int32_t
+rte_parray_alloc(struct rte_parray *obj, size_t elem_size)
+{
+	int32_t index;
+	void *elem;
+
+	if (obj == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	pthread_mutex_lock(&obj->mutex);
+
+	if (obj->count == obj->size && parray_resize(obj) != 0) {
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+
+	elem = malloc(elem_size);
+	if (elem == NULL) {
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+
+	index = parray_find_next_free(obj, obj->last + 1);
+	if (index < 0)
+		index = parray_find_next_free(obj, 0);
+
+	obj->array[index] = elem;
+	obj->count++;
+	obj->last = index;
+
+	pthread_mutex_unlock(&obj->mutex);
+
+	rte_errno = 0;
+	return index;
+}
+
+void
+rte_parray_free(struct rte_parray *obj, int32_t index)
+{
+	if (obj == NULL || index < 0 || index > obj->last) {
+		rte_errno = EINVAL;
+		return;
+	}
+
+	pthread_mutex_lock(&obj->mutex);
+
+	if (obj->array[index] != NULL) {
+		free(obj->array[index]);
+		obj->array[index] = NULL;
+		obj->count--;
+	}
+
+	pthread_mutex_unlock(&obj->mutex);
+
+	rte_errno = 0;
+}
+
+void
+rte_parray_free_all(struct rte_parray *obj)
+{
+	int32_t index;
+	int first_errno = 0;
+
+	RTE_PARRAY_FOREACH(obj, index) {
+		rte_parray_free(obj, index);
+		if (rte_errno != 0 && first_errno == 0)
+			first_errno = rte_errno;
+	}
+	rte_errno = first_errno;
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index 88a9eba12f..7e563b004d 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -30,6 +30,7 @@  headers += files(
         'rte_malloc.h',
         'rte_memory.h',
         'rte_memzone.h',
+        'rte_parray.h',
         'rte_pci_dev_feature_defs.h',
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
diff --git a/lib/eal/include/rte_parray.h b/lib/eal/include/rte_parray.h
new file mode 100644
index 0000000000..b7637d03ef
--- /dev/null
+++ b/lib/eal/include/rte_parray.h
@@ -0,0 +1,138 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2021 NVIDIA Corporation & Affiliates
+ */
+
+#ifndef RTE_PARRAY_H
+#define RTE_PARRAY_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include <pthread.h>
+
+#include <rte_compat.h>
+
+/**
+ * @file
+ * Object containing a resizable array of pointers.
+ *
+ * The write operations (alloc/free) are protected by mutex.
+ * The read operation (dereference) is considered as fast path
+ * and is not directly protected.
+ *
+ * On resize, the array n-1 is kept to allow pending reads.
+ * After 2 resizes, the array n-2 is freed.
+ *
+ * Iterating (rte_parray_find_next) is safe during alloc/free.
+ *
+ * Freeing must be synchronized with readers:
+ * an element must not be accessed if it is being freed.
+ *
+ * @warning
+ * Because of above limitations, this API is for internal use.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/** Main object representing a dynamic array of pointers. */
+struct rte_parray {
+	/** Array of pointer to dynamically allocated struct. */
+	void **array;
+	/** Old array before resize, freed on next resize. */
+	void **old_array;
+	/* Lock for alloc/free operations. */
+	pthread_mutex_t mutex;
+	/** Current size of the full array. */
+	int32_t size;
+	/** Number of allocated elements. */
+	int32_t count;
+	/** Last allocated element. */
+	int32_t last;
+};
+
+/** Static initializer to assign. */
+#define RTE_PARRAY_INITIALIZER {NULL, NULL, PTHREAD_MUTEX_INITIALIZER, 0, 0, -1}
+
+/** Helper for access to the typed pointer of the element at index. */
+#define RTE_PARRAY_P(type, obj, index) ((type *)(obj)->array[index])
+
+/** Loop helper to iterate all elements. */
+#define RTE_PARRAY_FOREACH(obj, index) for ( \
+	index = rte_parray_find_next(obj, 0); \
+	index > 0; \
+	index = rte_parray_find_next(obj, index + 1))
+
+/**
+ * @warning
+ * This internal API may change without prior notice.
+ *
+ * Get the next pointer in the array.
+ *
+ * @param obj
+ *   Pointer to the main object.
+ * @param index
+ *   The initial index to start the research.
+ *
+ * @return
+ *   Index of the next allocated element,
+ *   -1 if there is none.
+ *   rte_errno is set to EINVAL if parameters are NULL or negative.
+ */
+__rte_internal
+int32_t rte_parray_find_next(struct rte_parray *obj, int32_t index);
+
+/**
+ * @warning
+ * This internal API may change without prior notice.
+ *
+ * Allocate an element and insert it into the array.
+ *
+ * @param obj
+ *   Pointer to the main object.
+ * @param elem_size
+ *   Number of bytes to allocate for the element.
+ *   Do nothing if requesting 0.
+ *
+ * @return
+ *   An index in the array, otherwise the negative rte_errno:
+ *   - EINVAL if array is NULL
+ *   - ENOMEM if out of space
+ */
+__rte_internal
+int32_t rte_parray_alloc(struct rte_parray *obj, size_t elem_size);
+
+/**
+ * @warning
+ * This internal API may change without prior notice.
+ *
+ * Free an element and remove it from the array.
+ *
+ * @param obj
+ *   Pointer to the main object.
+ * @param index
+ *   Index of the element to be freed.
+ *   Do nothing if not a valid element.
+ *
+ * rte_errno is set to EINVAL if a parameter is out of range.
+ */
+__rte_internal
+void rte_parray_free(struct rte_parray *obj, int32_t index);
+
+/**
+ * @warning
+ * This internal API may change without prior notice.
+ *
+ * Free all elements of an array.
+ *
+ * @param obj
+ *   Pointer to the main object.
+ */
+__rte_internal
+void rte_parray_free_all(struct rte_parray *obj);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_PARRAY_H */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..5fd13884b1 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -432,4 +432,8 @@  INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_parray_alloc;
+	rte_parray_find_next;
+	rte_parray_free;
+	rte_parray_free_all;
 };