[v6,1/9] eal: generic 64 bit counter

Message ID 20240517001302.65514-2-stephen@networkplumber.org (mailing list archive)
State Superseded
Delegated to: Thomas Monjalon
Headers
Series Generic 64 bit counters for SW drivers |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Stephen Hemminger May 17, 2024, 12:12 a.m. UTC
  This header implements 64 bit counters that are NOT atomic
but are safe against load/store splits on 32 bit platforms.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/include/meson.build   |  1 +
 lib/eal/include/rte_counter.h | 98 +++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)
 create mode 100644 lib/eal/include/rte_counter.h
  

Comments

Honnappa Nagarahalli May 17, 2024, 2:45 a.m. UTC | #1
> On May 16, 2024, at 7:12 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> 
> This header implements 64 bit counters that are NOT atomic
> but are safe against load/store splits on 32 bit platforms.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> lib/eal/include/meson.build   |  1 +
> lib/eal/include/rte_counter.h | 98 +++++++++++++++++++++++++++++++++++
> 2 files changed, 99 insertions(+)
> create mode 100644 lib/eal/include/rte_counter.h
> 
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index e94b056d46..c070dd0079 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -12,6 +12,7 @@ headers += files(
>         'rte_class.h',
>         'rte_common.h',
>         'rte_compat.h',
> +        'rte_counter.h',
>         'rte_debug.h',
>         'rte_dev.h',
>         'rte_devargs.h',
> diff --git a/lib/eal/include/rte_counter.h b/lib/eal/include/rte_counter.h
> new file mode 100644
> index 0000000000..d623195d63
> --- /dev/null
> +++ b/lib/eal/include/rte_counter.h
> @@ -0,0 +1,98 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) Stephen Hemminger <stephen@networkplumber.org>
> + */
> +
> +#ifndef _RTE_COUNTER_H_
> +#define _RTE_COUNTER_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * @file
> + * RTE Counter
> + *
> + * A counter is 64 bit value that is safe from split read/write
> + * on 32 bit platforms. It assumes that only one cpu at a time
If we are defining the counter in this manner, then implementation cannot be generic. I think architectures will have constraints if they have to ensure the 64b variables are not split.

I think we at least need the counter to be aligned on 8B boundary to have generic code.

> + * will update the counter, and another CPU may want to read it.
> + *
> + * This is a much weaker guarantee than full atomic variables
> + * but is faster since no locked operations are required for update.
> + */
> +
> +#ifdef RTE_ARCH_64
> +/*
> + * On a platform that can support native 64 bit type, no special handling.
> + * These are just wrapper around 64 bit value.
> + */
> +typedef uint64_t rte_counter64_t;
> +
> +/**
> + * Add value to counter.
> + */
> +__rte_experimental
> +static inline void
> +rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> +{
> + *counter += val;
> +}
> +
> +__rte_experimental
> +static inline uint64_t
> +rte_counter64_fetch(const rte_counter64_t *counter)
> +{
> + return *counter;
> +}
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_set(rte_counter64_t *counter, uint64_t val)
> +{
> + *counter = val;
> +}
> +
> +#else
> +
> +#include <rte_stdatomic.h>
> +
> +/*
> + * On a 32 bit platform need to use atomic to force the compler to not
> + * split 64 bit read/write.
> + */
> +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> +{
> + rte_atomic_fetch_add_explicit(counter, val, rte_memory_order_relaxed);
> +}
> +
> +__rte_experimental
> +static inline uint64_t
> +rte_counter64_fetch(const rte_counter64_t *counter)
> +{
> + return rte_atomic_load_explicit(counter, rte_memory_order_consume);
> +}
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_set(rte_counter64_t *counter, uint64_t val)
> +{
> + rte_atomic_store_explicit(counter, val, rte_memory_order_release);
> +}
> +#endif
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_reset(rte_counter64_t *counter)
> +{
> + rte_counter64_set(counter, 0);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_COUNTER_H_ */
> -- 
> 2.43.0
>
  
Stephen Hemminger May 17, 2024, 3:30 a.m. UTC | #2
On Fri, 17 May 2024 02:45:12 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > + * A counter is 64 bit value that is safe from split read/write
> > + * on 32 bit platforms. It assumes that only one cpu at a time  
> If we are defining the counter in this manner, then implementation cannot be generic. I think architectures will have constraints if they have to ensure the 64b variables are not split.
> 
> I think we at least need the counter to be aligned on 8B boundary to have generic code.

The C standard has always guaranteed that read and write to unsigned log will not be split.
Therefore if arch is 64 bit native there is no need for atomics
  
Honnappa Nagarahalli May 17, 2024, 4:26 a.m. UTC | #3
+ Bruce for feedback on x86 architecture 

> On May 16, 2024, at 10:30 PM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> 
> On Fri, 17 May 2024 02:45:12 +0000
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> 
>>> + * A counter is 64 bit value that is safe from split read/write
>>> + * on 32 bit platforms. It assumes that only one cpu at a time  
>> If we are defining the counter in this manner, then implementation cannot be generic. I think architectures will have constraints if they have to ensure the 64b variables are not split.
>> 
>> I think we at least need the counter to be aligned on 8B boundary to have generic code.
> 
> The C standard has always guaranteed that read and write to unsigned log will not be split.
As I understand, this is true only if the variable is an atomic variable. If not, there is no difference between atomic variables and non-atomic variables.

> Therefore if arch is 64 bit native there is no need for atomics
At least on ARM architecture, if the variable is not aligned on 8B boundary, the load or store are not atomic. I am sure it is the same on other architectures.
Bruce, any comments for x86?
  
Morten Brørup May 17, 2024, 6:44 a.m. UTC | #4
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 17 May 2024 06.27
> 
> + Bruce for feedback on x86 architecture
> 
> > On May 16, 2024, at 10:30 PM, Stephen Hemminger <stephen@networkplumber.org>
> wrote:
> >
> > On Fri, 17 May 2024 02:45:12 +0000
> > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> >
> >>> + * A counter is 64 bit value that is safe from split read/write
> >>> + * on 32 bit platforms. It assumes that only one cpu at a time
> >> If we are defining the counter in this manner, then implementation cannot
> be generic. I think architectures will have constraints if they have to ensure
> the 64b variables are not split.
> >>
> >> I think we at least need the counter to be aligned on 8B boundary to have
> generic code.
> >
> > The C standard has always guaranteed that read and write to unsigned log
> will not be split.
> As I understand, this is true only if the variable is an atomic variable. If
> not, there is no difference between atomic variables and non-atomic variables.
> 
> > Therefore if arch is 64 bit native there is no need for atomics
> At least on ARM architecture, if the variable is not aligned on 8B boundary,
> the load or store are not atomic. I am sure it is the same on other
> architectures.

I guess it depends on the architecture's natural alignment size and the compiler - especially on 32 bit architectures, where the natural alignment size is 4 bytes.

We could play it safe and add alignment to the counter type:

#include <stdalign.h>
#ifdef RTE_ARCH_64
#if alignof(uint64_t) < sizeof(uint64_t)
typedef alignas(8) uint64_t rte_counter64_t;
#else
typedef uint64_t rte_counter64_t;
#endif
#else
#if alignof(RTE_ATOMIC(uint64_t)) < sizeof(uint64_t)
typedef alignas(8) RTE_ATOMIC(uint64_t) rte_counter64_t;
#else
typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
#endif
#endif

This above is intentionally verbose, because some 32 bit architectures can implement 64 bit non-tearing counters without atomics [1], and the new 64 bit counter library should be prepared for accepting such optimizations.
[1]: https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smartserver.smartshare.dk/


Picking up on another branch of this discussion:
This 64 bit counter library is *not* an internal API. Applications should use it for their counters too.

> Bruce, any comments for x86?
  
Stephen Hemminger May 17, 2024, 3:05 p.m. UTC | #5
On Fri, 17 May 2024 08:44:42 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> I guess it depends on the architecture's natural alignment size and the compiler - especially on 32 bit architectures, where the natural alignment size is 4 bytes.
> 
> We could play it safe and add alignment to the counter type:
> 
> #include <stdalign.h>
> #ifdef RTE_ARCH_64
> #if alignof(uint64_t) < sizeof(uint64_t)

I don't think this case is possible, what architecture is that broken?

> typedef alignas(8) uint64_t rte_counter64_t;
> #else
> typedef uint64_t rte_counter64_t;
> #endif
> #else
> #if alignof(RTE_ATOMIC(uint64_t)) < sizeof(uint64_t)
> typedef alignas(8) RTE_ATOMIC(uint64_t) rte_counter64_t;
> #else
> typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> #endif
> #endif

The bigger question is how to detect 32 bit x86 being safe without atomic?
(and does it still matter).
  
Stephen Hemminger May 17, 2024, 3:07 p.m. UTC | #6
On Fri, 17 May 2024 04:26:58 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > 
> > The C standard has always guaranteed that read and write to unsigned log will not be split.  
> As I understand, this is true only if the variable is an atomic variable. If not, there is no difference between atomic variables and non-atomic variables.

Let me look.
It certainly falls under the "if a compiler did this it is crazy and would never be able to run Linux" per Linus.

> 
> > Therefore if arch is 64 bit native there is no need for atomics  
> At least on ARM architecture, if the variable is not aligned on 8B boundary, the load or store are not atomic. I am sure it is the same on other architectures.
> Bruce, any comments for x86?


This code needs to assume alignment to 8B. I'll document that.
But no runtime checks please.
  
Stephen Hemminger May 17, 2024, 4:18 p.m. UTC | #7
On Fri, 17 May 2024 08:44:42 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Friday, 17 May 2024 06.27
> > 
> > + Bruce for feedback on x86 architecture
> >   
> > > On May 16, 2024, at 10:30 PM, Stephen Hemminger <stephen@networkplumber.org>  
> > wrote:  
> > >
> > > On Fri, 17 May 2024 02:45:12 +0000
> > > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> > >  
> > >>> + * A counter is 64 bit value that is safe from split read/write
> > >>> + * on 32 bit platforms. It assumes that only one cpu at a time  
> > >> If we are defining the counter in this manner, then implementation cannot  
> > be generic. I think architectures will have constraints if they have to ensure
> > the 64b variables are not split.  
> > >>
> > >> I think we at least need the counter to be aligned on 8B boundary to have  
> > generic code.  
> > >
> > > The C standard has always guaranteed that read and write to unsigned log  
> > will not be split.
> > As I understand, this is true only if the variable is an atomic variable. If
> > not, there is no difference between atomic variables and non-atomic variables.
> >   
> > > Therefore if arch is 64 bit native there is no need for atomics  
> > At least on ARM architecture, if the variable is not aligned on 8B boundary,
> > the load or store are not atomic. I am sure it is the same on other
> > architectures.  

After reading this: Who's afraid of a big bad optimizing compiler?
 https://lwn.net/Articles/793253/

Looks like you are right, and atomic or read/write once is required.
Perhaps introducing rte_read_once and rte_write_once is good idea?
Several drivers are implementing it already.
  
Morten Brørup May 18, 2024, 2 p.m. UTC | #8
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, 17 May 2024 18.18
> 
> On Fri, 17 May 2024 08:44:42 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > Sent: Friday, 17 May 2024 06.27
> > >
> > > + Bruce for feedback on x86 architecture
> > >
> > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger
> <stephen@networkplumber.org>
> > > wrote:
> > > >
> > > > On Fri, 17 May 2024 02:45:12 +0000
> > > > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> > > >
> > > >>> + * A counter is 64 bit value that is safe from split read/write
> > > >>> + * on 32 bit platforms. It assumes that only one cpu at a time
> > > >> If we are defining the counter in this manner, then
> implementation cannot
> > > be generic. I think architectures will have constraints if they have
> to ensure
> > > the 64b variables are not split.
> > > >>
> > > >> I think we at least need the counter to be aligned on 8B boundary
> to have
> > > generic code.
> > > >
> > > > The C standard has always guaranteed that read and write to
> unsigned log
> > > will not be split.
> > > As I understand, this is true only if the variable is an atomic
> variable. If
> > > not, there is no difference between atomic variables and non-atomic
> variables.
> > >
> > > > Therefore if arch is 64 bit native there is no need for atomics
> > > At least on ARM architecture, if the variable is not aligned on 8B
> boundary,
> > > the load or store are not atomic. I am sure it is the same on other
> > > architectures.
> 
> After reading this: Who's afraid of a big bad optimizing compiler?
>  https://lwn.net/Articles/793253/

Very interesting article!

> 
> Looks like you are right, and atomic or read/write once is required.

I don't like the performance tradeoff (for 64 bit architectures) in the v7 patch.
For single-tread updated counters, we MUST have a high performance counter_add(), not using atomic read-modify-write.

IMO calling counter_fetch() MUST be possible from another thread.
This requires that the fast path thread stores the counter atomically (or using volatile), to ensure that the counter is not only kept in a CPU register, but stored in memory visible by other threads.

For counter_reset(), we have multiple options:
0. Don't support counter resetting. Not really on option?
1. Only allow calling counter_reset() in the fast path thread updating the counter. This introduces no further requirements.
2. Allow calling counter_reset() from another thread, thereby zeroing the "counter" variable. This introduces a requirement for the "counter" to be thread-safe, so counter_add() must atomically read-modify-write the counter, which has a performance cost.
3. Allow calling counter_reset() from another thread, and introduce an "offset" variable for counter_fetch() and counter_reset() to provide thread-safe fetch/reset from other threads, using the consume-release pattern.

I don't like option 2.
I consider counters too important and frequently used in the fast path, to compromise on performance for counters.

For counters updated by multiple fast path threads, atomic_fetch_add_explicit() of the "counter" variable seems unavoidable.

> Perhaps introducing rte_read_once and rte_write_once is good idea?
> Several drivers are implementing it already.

The read_once/write_once are easier to use, but they lack the flexibility (regarding barriers and locking) provided by their atomic_explicit alternatives, which will impact performance in some use cases.

We should strive for the highest possible performance, which means that we shouldn't introduce functions or design patterns preventing this.
Please note: Security vs. performance is another matter - we certainly don't want to promote insecure code for the benefit of performance. But for "ease of coding" vs. performance, I prefer performance.

That said, I agree that introducing rte_read/write_once functions for use by drivers to access hardware makes sense, to eliminate copy-paste variants in drivers.
But how can we prevent such functions from being used for other purposes, where atomic_explicit should be used?
  
Stephen Hemminger May 19, 2024, 3:13 p.m. UTC | #9
On Sat, 18 May 2024 16:00:55 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Friday, 17 May 2024 18.18
> > 
> > On Fri, 17 May 2024 08:44:42 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >   
> > > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > > Sent: Friday, 17 May 2024 06.27
> > > >
> > > > + Bruce for feedback on x86 architecture
> > > >  
> > > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger  
> > <stephen@networkplumber.org>  
> > > > wrote:  
> > > > >
> > > > > On Fri, 17 May 2024 02:45:12 +0000
> > > > > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> > > > >  
> > > > >>> + * A counter is 64 bit value that is safe from split read/write
> > > > >>> + * on 32 bit platforms. It assumes that only one cpu at a time  
> > > > >> If we are defining the counter in this manner, then  
> > implementation cannot  
> > > > be generic. I think architectures will have constraints if they have  
> > to ensure  
> > > > the 64b variables are not split.  
> > > > >>
> > > > >> I think we at least need the counter to be aligned on 8B boundary  
> > to have  
> > > > generic code.  
> > > > >
> > > > > The C standard has always guaranteed that read and write to  
> > unsigned log  
> > > > will not be split.
> > > > As I understand, this is true only if the variable is an atomic  
> > variable. If  
> > > > not, there is no difference between atomic variables and non-atomic  
> > variables.  
> > > >  
> > > > > Therefore if arch is 64 bit native there is no need for atomics  
> > > > At least on ARM architecture, if the variable is not aligned on 8B  
> > boundary,  
> > > > the load or store are not atomic. I am sure it is the same on other
> > > > architectures.  
> > 
> > After reading this: Who's afraid of a big bad optimizing compiler?
> >  https://lwn.net/Articles/793253/  
> 
> Very interesting article!
> 
> > 
> > Looks like you are right, and atomic or read/write once is required.  
> 
> I don't like the performance tradeoff (for 64 bit architectures) in the v7 patch.
> For single-tread updated counters, we MUST have a high performance counter_add(), not using atomic read-modify-write.

The fundamental issue is that for SW drivers, having a totally safe reset function requires
an atomic operation in the fast path for increment. Otherwise the following (highly unlikely) race
is possible:
	
	CPU A					CPU B
		load counter (value = X)
						store counter = 0
		store counter (value = X + 1)


> 
> IMO calling counter_fetch() MUST be possible from another thread.
> This requires that the fast path thread stores the counter atomically (or using volatile), to ensure that the counter is not only kept in a CPU register, but stored in memory visible by other threads.
> 
> For counter_reset(), we have multiple options:
> 0. Don't support counter resetting. Not really on option?

We could reject it in the SW drivers. But better not to.

> 1. Only allow calling counter_reset() in the fast path thread updating the counter. This introduces no further requirements.

Not a good restriction

> 2. Allow calling counter_reset() from another thread, thereby zeroing the "counter" variable. This introduces a requirement for the "counter" to be thread-safe, so counter_add() must atomically read-modify-write the counter, which has a performance cost.

> 3. Allow calling counter_reset() from another thread, and introduce an "offset" variable for counter_fetch() and counter_reset() to provide thread-safe fetch/reset from other threads, using the consume-release pattern.

Too confusing

> 
> I don't like option 2.
> I consider counters too important and frequently used in the fast path, to compromise on performance for counters.

Agree.

> 
> For counters updated by multiple fast path threads, atomic_fetch_add_explicit() of the "counter" variable seems unavoidable.

Not at all worried about overhead in slow (fetch) path.

> 
> > Perhaps introducing rte_read_once and rte_write_once is good idea?
> > Several drivers are implementing it already.  
> 
> The read_once/write_once are easier to use, but they lack the flexibility (regarding barriers and locking) provided by their atomic_explicit alternatives, which will impact performance in some use cases.

They solve the compiler reordering problem but do nothing about cpu ordering.
Also, there is no such increment.

> 
> We should strive for the highest possible performance, which means that we shouldn't introduce functions or design patterns preventing this.
> Please note: Security vs. performance is another matter - we certainly don't want to promote insecure code for the benefit of performance. But for "ease of coding" vs. performance, I prefer performance.
> 
> That said, I agree that introducing rte_read/write_once functions for use by drivers to access hardware makes sense, to eliminate copy-paste variants in drivers.
> But how can we prevent such functions from being used for other purposes, where atomic_explicit should be used?
> 

Looking at x86 result on godbolt shows that using atomic only adds a single locked operation.
Perhaps this can be ameliorated by doing bulk add at end of loop, like many drivers were already.
  
Morten Brørup May 19, 2024, 5:10 p.m. UTC | #10
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Sunday, 19 May 2024 17.14
> 
> On Sat, 18 May 2024 16:00:55 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Friday, 17 May 2024 18.18
> > >
> > > On Fri, 17 May 2024 08:44:42 +0200
> > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > >
> > > > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > > > Sent: Friday, 17 May 2024 06.27
> > > > >
> > > > > + Bruce for feedback on x86 architecture
> > > > >
> > > > > > On May 16, 2024, at 10:30 PM, Stephen Hemminger
> > > <stephen@networkplumber.org>
> > > > > wrote:
> > > > > >
> > > > > > On Fri, 17 May 2024 02:45:12 +0000
> > > > > > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> > > > > >
> > > > > >>> + * A counter is 64 bit value that is safe from split
> read/write
> > > > > >>> + * on 32 bit platforms. It assumes that only one cpu at a
> time
> > > > > >> If we are defining the counter in this manner, then
> > > implementation cannot
> > > > > be generic. I think architectures will have constraints if they
> have
> > > to ensure
> > > > > the 64b variables are not split.
> > > > > >>
> > > > > >> I think we at least need the counter to be aligned on 8B
> boundary
> > > to have
> > > > > generic code.
> > > > > >
> > > > > > The C standard has always guaranteed that read and write to
> > > unsigned log
> > > > > will not be split.
> > > > > As I understand, this is true only if the variable is an atomic
> > > variable. If
> > > > > not, there is no difference between atomic variables and non-
> atomic
> > > variables.
> > > > >
> > > > > > Therefore if arch is 64 bit native there is no need for
> atomics
> > > > > At least on ARM architecture, if the variable is not aligned on
> 8B
> > > boundary,
> > > > > the load or store are not atomic. I am sure it is the same on
> other
> > > > > architectures.
> > >
> > > After reading this: Who's afraid of a big bad optimizing compiler?
> > >  https://lwn.net/Articles/793253/
> >
> > Very interesting article!
> >
> > >
> > > Looks like you are right, and atomic or read/write once is required.
> >
> > I don't like the performance tradeoff (for 64 bit architectures) in
> the v7 patch.
> > For single-tread updated counters, we MUST have a high performance
> counter_add(), not using atomic read-modify-write.
> 
> The fundamental issue is that for SW drivers, having a totally safe
> reset function requires
> an atomic operation in the fast path for increment. Otherwise the
> following (highly unlikely) race
> is possible:
> 
> 	CPU A					CPU B
> 		load counter (value = X)
> 						store counter = 0
> 		store counter (value = X + 1)
> 

Yes, this is why I suggest the "offset" method, option 3.
Depending on which CPU wins the race, reset() will set "offset" to either X or X + 1 using your example here.
"offset" is atomic, so reading it cannot race writing it. And thus:
if counter = X was visible at reset(), fetch() will return counter - offset = X - X = 0, and
if counter = X + 1 was visible at reset(), fetch() will return counter - offset = (X + 1) - (X + 1) = 0.

> 
> >
> > IMO calling counter_fetch() MUST be possible from another thread.
> > This requires that the fast path thread stores the counter atomically
> (or using volatile), to ensure that the counter is not only kept in a
> CPU register, but stored in memory visible by other threads.
> >
> > For counter_reset(), we have multiple options:
> > 0. Don't support counter resetting. Not really on option?
> 
> We could reject it in the SW drivers. But better not to.

Agree.

> 
> > 1. Only allow calling counter_reset() in the fast path thread updating
> the counter. This introduces no further requirements.
> 
> Not a good restriction

Not good, but the alternatives seem to be worse.

> 
> > 2. Allow calling counter_reset() from another thread, thereby zeroing
> the "counter" variable. This introduces a requirement for the "counter"
> to be thread-safe, so counter_add() must atomically read-modify-write
> the counter, which has a performance cost.

Bad for performance.

> 
> > 3. Allow calling counter_reset() from another thread, and introduce an
> "offset" variable for counter_fetch() and counter_reset() to provide
> thread-safe fetch/reset from other threads, using the consume-release
> pattern.
> 
> Too confusing

Using offsets for pseudo-reset is a common design pattern.
And the performance is excellent.

Perhaps the Linux kernel doesn't use this design pattern, but that is not a valid reason to disqualify it.

> 
> >
> > I don't like option 2.
> > I consider counters too important and frequently used in the fast
> path, to compromise on performance for counters.
> 
> Agree.
> 
> >
> > For counters updated by multiple fast path threads,
> atomic_fetch_add_explicit() of the "counter" variable seems unavoidable.
> 
> Not at all worried about overhead in slow (fetch) path.

Agree. Fetch() is slow path, so performance is "nice to have", but nothing more.

> 
> >
> > > Perhaps introducing rte_read_once and rte_write_once is good idea?
> > > Several drivers are implementing it already.
> >
> > The read_once/write_once are easier to use, but they lack the
> flexibility (regarding barriers and locking) provided by their
> atomic_explicit alternatives, which will impact performance in some use
> cases.
> 
> They solve the compiler reordering problem but do nothing about cpu
> ordering.
> Also, there is no such increment.

Agree. They are designed for hardware access, not multithreading.

> 
> >
> > We should strive for the highest possible performance, which means
> that we shouldn't introduce functions or design patterns preventing
> this.
> > Please note: Security vs. performance is another matter - we certainly
> don't want to promote insecure code for the benefit of performance. But
> for "ease of coding" vs. performance, I prefer performance.
> >
> > That said, I agree that introducing rte_read/write_once functions for
> use by drivers to access hardware makes sense, to eliminate copy-paste
> variants in drivers.
> > But how can we prevent such functions from being used for other
> purposes, where atomic_explicit should be used?
> >
> 
> Looking at x86 result on godbolt shows that using atomic only adds a
> single locked operation.
> Perhaps this can be ameliorated by doing bulk add at end of loop, like
> many drivers were already.

Absolutely; whenever possible, local counters should be maintained inside the loop, and added to the public counters at the end of a loop.

Please note that application counters might be spread all over memory.
E.g. per-flow in a flow structure, per QoS class in a QoS class structure, per subscriber in a subscriber structure, etc. And a burst of packets might touch multiple of these. My point is: Atomic read-modify-write of counters will cause serious stalling, waiting for memory access.
  
Stephen Hemminger May 19, 2024, 10:49 p.m. UTC | #11
On Sun, 19 May 2024 19:10:30 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> Absolutely; whenever possible, local counters should be maintained inside the loop, and added to the public counters at the end of a loop.
> 
> Please note that application counters might be spread all over memory.
> E.g. per-flow in a flow structure, per QoS class in a QoS class structure, per subscriber in a subscriber structure, etc. And a burst of packets might touch multiple of these. My point is: Atomic read-modify-write of counters will cause serious stalling, waiting for memory access

If an application needs to keep up at DPDK possible speeds, then it needs to worry about its
cache access patterns. Last time I checked handling 10G 64 byte packets at line rate without loss
means a maximum of 2 cache misses. Very hard to do with any non trivial application.

Also, SW QoS above 1G is very hard to do with modern CPU's. Especially with multiple flows.
It maybe possible with something like FQ Codel which only keeps small amount of state.
  
Morten Brørup May 20, 2024, 7:57 a.m. UTC | #12
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Monday, 20 May 2024 00.49
> 
> On Sun, 19 May 2024 19:10:30 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > Absolutely; whenever possible, local counters should be maintained
> inside the loop, and added to the public counters at the end of a loop.
> >
> > Please note that application counters might be spread all over memory.
> > E.g. per-flow in a flow structure, per QoS class in a QoS class
> structure, per subscriber in a subscriber structure, etc. And a burst of
> packets might touch multiple of these. My point is: Atomic read-modify-
> write of counters will cause serious stalling, waiting for memory access
> 
> If an application needs to keep up at DPDK possible speeds, then it
> needs to worry about its
> cache access patterns. Last time I checked handling 10G 64 byte packets
> at line rate without loss
> means a maximum of 2 cache misses. Very hard to do with any non trivial
> application.

Yes, very hard.
Which is why I insist that counters must have the absolutely highest possible performance in the fast path.
Non-trivial applications are likely to maintain many instances of application specific counters.

I consider this patch in the series as the EAL library generic 64 bit counters, and for application use too. So I am reviewing with a much broader perspective than just SW drivers.
The SW drivers' use of these counters is not only an improvement of those drivers, it's also an excellent reference use case showing how to use this new EAL 64 bit counters library.

> 
> Also, SW QoS above 1G is very hard to do with modern CPU's. Especially
> with multiple flows.
> It maybe possible with something like FQ Codel which only keeps small
> amount of state.

Yep. Designing software for high performance is all about using optimal algorithms! :-)
  

Patch

diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..c070dd0079 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -12,6 +12,7 @@  headers += files(
         'rte_class.h',
         'rte_common.h',
         'rte_compat.h',
+        'rte_counter.h',
         'rte_debug.h',
         'rte_dev.h',
         'rte_devargs.h',
diff --git a/lib/eal/include/rte_counter.h b/lib/eal/include/rte_counter.h
new file mode 100644
index 0000000000..d623195d63
--- /dev/null
+++ b/lib/eal/include/rte_counter.h
@@ -0,0 +1,98 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) Stephen Hemminger <stephen@networkplumber.org>
+ */
+
+#ifndef _RTE_COUNTER_H_
+#define _RTE_COUNTER_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file
+ * RTE Counter
+ *
+ * A counter is 64 bit value that is safe from split read/write
+ * on 32 bit platforms. It assumes that only one cpu at a time
+ * will update the counter, and another CPU may want to read it.
+ *
+ * This is a much weaker guarantee than full atomic variables
+ * but is faster since no locked operations are required for update.
+ */
+
+#ifdef RTE_ARCH_64
+/*
+ * On a platform that can support native 64 bit type, no special handling.
+ * These are just wrapper around 64 bit value.
+ */
+typedef uint64_t rte_counter64_t;
+
+/**
+ * Add value to counter.
+ */
+__rte_experimental
+static inline void
+rte_counter64_add(rte_counter64_t *counter, uint32_t val)
+{
+	*counter += val;
+}
+
+__rte_experimental
+static inline uint64_t
+rte_counter64_fetch(const rte_counter64_t *counter)
+{
+	return *counter;
+}
+
+__rte_experimental
+static inline void
+rte_counter64_set(rte_counter64_t *counter, uint64_t val)
+{
+	*counter = val;
+}
+
+#else
+
+#include <rte_stdatomic.h>
+
+/*
+ * On a 32 bit platform need to use atomic to force the compler to not
+ * split 64 bit read/write.
+ */
+typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
+
+__rte_experimental
+static inline void
+rte_counter64_add(rte_counter64_t *counter, uint32_t val)
+{
+	rte_atomic_fetch_add_explicit(counter, val, rte_memory_order_relaxed);
+}
+
+__rte_experimental
+static inline uint64_t
+rte_counter64_fetch(const rte_counter64_t *counter)
+{
+	return rte_atomic_load_explicit(counter, rte_memory_order_consume);
+}
+
+__rte_experimental
+static inline void
+rte_counter64_set(rte_counter64_t *counter, uint64_t val)
+{
+	rte_atomic_store_explicit(counter, val, rte_memory_order_release);
+}
+#endif
+
+__rte_experimental
+static inline void
+rte_counter64_reset(rte_counter64_t *counter)
+{
+	rte_counter64_set(counter, 0);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_COUNTER_H_ */