[RFC,v3] net/af_packet: make stats reset reliable

Message ID 20240503154547.392069-1-ferruh.yigit@amd.com (mailing list archive)
State New
Delegated to: Ferruh Yigit
Headers
Series [RFC,v3] net/af_packet: make stats reset reliable |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-compile-amd64-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-sample-apps-testing success Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-compile-arm64-testing success Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Ferruh Yigit May 3, 2024, 3:45 p.m. UTC
For stats reset, use an offset instead of zeroing out actual stats values,
get_stats() displays diff between stats and offset.
This way stats only updated in datapath and offset only updated in stats
reset function. This makes stats reset function more reliable.

As stats only written by single thread, we can remove 'volatile' qualifier
which should improve the performance in datapath.

While updating around, 'igb_stats' parameter renamed as 'stats'.

Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
---
Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Morten Brørup <mb@smartsharesystems.com>

This update triggered by mail list discussion [1].

[1]
https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/

v2:
* Remove wrapping check for stats

v3:
* counter and offset put into same struct per stats
* Use atomic load / store for stats values
---
 drivers/net/af_packet/rte_eth_af_packet.c | 98 ++++++++++++++++-------
 1 file changed, 68 insertions(+), 30 deletions(-)
  

Comments

Stephen Hemminger May 3, 2024, 10 p.m. UTC | #1
On Fri, 3 May 2024 16:45:47 +0100
Ferruh Yigit <ferruh.yigit@amd.com> wrote:

> For stats reset, use an offset instead of zeroing out actual stats values,
> get_stats() displays diff between stats and offset.
> This way stats only updated in datapath and offset only updated in stats
> reset function. This makes stats reset function more reliable.
> 
> As stats only written by single thread, we can remove 'volatile' qualifier
> which should improve the performance in datapath.
> 
> While updating around, 'igb_stats' parameter renamed as 'stats'.
> 
> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
> ---
> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Morten Brørup <mb@smartsharesystems.com>
> 
> This update triggered by mail list discussion [1].
> 
> [1]
> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/


NAK

I did not hear a good argument why atomic or volatile was necessary in the first place.
Why?

Why is this driver special (a snowflake) compared to all the other drivers doing software
statistics (tap, virtio, xdp, ring, memif, netvsc, vmware)?
  
Ferruh Yigit May 7, 2024, 1:48 p.m. UTC | #2
On 5/3/2024 11:00 PM, Stephen Hemminger wrote:
> On Fri, 3 May 2024 16:45:47 +0100
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> 
>> For stats reset, use an offset instead of zeroing out actual stats values,
>> get_stats() displays diff between stats and offset.
>> This way stats only updated in datapath and offset only updated in stats
>> reset function. This makes stats reset function more reliable.
>>
>> As stats only written by single thread, we can remove 'volatile' qualifier
>> which should improve the performance in datapath.
>>
>> While updating around, 'igb_stats' parameter renamed as 'stats'.
>>
>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>> ---
>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>
>> This update triggered by mail list discussion [1].
>>
>> [1]
>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
> 
> 
> NAK
> 
> I did not hear a good argument why atomic or volatile was necessary in the first place.
> Why?
> 

Sure, the patch is done as RFC intentionally to discuss the approach.

Agree that volatile and atomics (fetch + add + store) is not required
for thread synchronization, as only one CPU updates stats.
Even this understanding is important because there are PMDs using full
atomics for stats update, like null PMD [1], this will help up clear them.


And there is a case, stats reset and stats updated in different threads
simultaneously, for this 'volatile' is not sufficient anyway and full
atomics is required. As this will cause performance impact we are
already saying stats update and reset can't happen at the same time [2].
With this update volatile and atomics are not required for this case too.
(Also using offset to increase stats reset reliability.)


In this patch volatile replaced with atomic load and atomic store (not
atomic fetch and add), to ensure that stats stored to memory and not
kept in device registers only.
With volatile, it is guaranteed that updated stats stored back to
memory, but without volatile and atomics I am not sure if this is
guaranteed. Practically I can see this working, but theoretically not
sure. This is similar concern with change in your patch that is casting
to volatile to ensure value read from memory [3].

Expectation is, only atomics load and store will have smaller
performance impact than volatile, ensuring memory load and store when
needed.


[1]
https://git.dpdk.org/dpdk/tree/drivers/net/null/rte_eth_null.c?h=v24.03#n105

[2]
https://patches.dpdk.org/project/dpdk/patch/20240425165308.1078454-1-ferruh.yigit@amd.com/

[3]
https://inbox.dpdk.org/dev/20240430154129.7347-1-stephen@networkplumber.org/
`#define READ_ONCE(var) (*((volatile typeof(var) *)(&(var))))`



> Why is this driver special (a snowflake) compared to all the other drivers doing software
> statistics (tap, virtio, xdp, ring, memif, netvsc, vmware)?
>

Nothing special at all, only discussion started based on af_packet
implementation. If we give a decision based on this RFC, same logic can
be followed with existing or new software PMDs.
  
Stephen Hemminger May 7, 2024, 2:52 p.m. UTC | #3
On Tue, 7 May 2024 14:48:51 +0100
Ferruh Yigit <ferruh.yigit@amd.com> wrote:

> On 5/3/2024 11:00 PM, Stephen Hemminger wrote:
> > On Fri, 3 May 2024 16:45:47 +0100
> > Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> >   
> >> For stats reset, use an offset instead of zeroing out actual stats values,
> >> get_stats() displays diff between stats and offset.
> >> This way stats only updated in datapath and offset only updated in stats
> >> reset function. This makes stats reset function more reliable.
> >>
> >> As stats only written by single thread, we can remove 'volatile' qualifier
> >> which should improve the performance in datapath.
> >>
> >> While updating around, 'igb_stats' parameter renamed as 'stats'.
> >>
> >> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
> >> ---
> >> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> Cc: Stephen Hemminger <stephen@networkplumber.org>
> >> Cc: Morten Brørup <mb@smartsharesystems.com>
> >>
> >> This update triggered by mail list discussion [1].
> >>
> >> [1]
> >> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/  
> > 
> > 
> > NAK
> > 
> > I did not hear a good argument why atomic or volatile was necessary in the first place.
> > Why?
> >   
> 
> Sure, the patch is done as RFC intentionally to discuss the approach.
> 
> Agree that volatile and atomics (fetch + add + store) is not required
> for thread synchronization, as only one CPU updates stats.
> Even this understanding is important because there are PMDs using full
> atomics for stats update, like null PMD [1], this will help up clear them.
> 
> 
> And there is a case, stats reset and stats updated in different threads
> simultaneously, for this 'volatile' is not sufficient anyway and full
> atomics is required. As this will cause performance impact we are
> already saying stats update and reset can't happen at the same time [2].
> With this update volatile and atomics are not required for this case too.
> (Also using offset to increase stats reset reliability.)
> 
> 
> In this patch volatile replaced with atomic load and atomic store (not
> atomic fetch and add), to ensure that stats stored to memory and not
> kept in device registers only.
> With volatile, it is guaranteed that updated stats stored back to
> memory, but without volatile and atomics I am not sure if this is
> guaranteed. Practically I can see this working, but theoretically not
> sure. This is similar concern with change in your patch that is casting
> to volatile to ensure value read from memory [3].
> 
> Expectation is, only atomics load and store will have smaller
> performance impact than volatile, ensuring memory load and store when
> needed.

The device register worry, can just be handled with a compiler barrier.
Does not need the stronger guarantee of atomic or volatile.
  
Morten Brørup May 7, 2024, 3:27 p.m. UTC | #4
> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> Sent: Friday, 3 May 2024 17.46
> 
> For stats reset, use an offset instead of zeroing out actual stats values,
> get_stats() displays diff between stats and offset.
> This way stats only updated in datapath and offset only updated in stats
> reset function. This makes stats reset function more reliable.
> 
> As stats only written by single thread, we can remove 'volatile' qualifier
> which should improve the performance in datapath.
> 
> While updating around, 'igb_stats' parameter renamed as 'stats'.
> 
> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
> ---
> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Morten Brørup <mb@smartsharesystems.com>
> 
> This update triggered by mail list discussion [1].
> 
> [1]
> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-
> 5f4dd3969f99@lysator.liu.se/
> 
> v2:
> * Remove wrapping check for stats
> 
> v3:
> * counter and offset put into same struct per stats
> * Use atomic load / store for stats values
> ---

Note: My comments below relate to software PMDs only.

Design for the following invariants:
1. "counter" may increase at any time. (So stopping forwarding is not required.)
2. "counter" may not decrease.
3. "offset" is always <= "counter".

So:

Stats_get() must read "offset" before "counter"; if "counter" races to increase in the mean time, it doesn't hurt. Stats_get() is a relatively "cold" function, so barriers etc. are acceptable.

Assuming that stats_add() lazy-writes "counter"; if stats_get() reads an old value, its result will be slightly off, but not negative.

Similarly for stats_reset(), which obviously reads "counter" before writing "offset"; if "counter" races to increase in the mean time, the too low "offset" will not cause negative stats from stats_get().


And a requested change for performance:

> +struct stats {
> +	uint64_t counter;
> +	uint64_t offset;
> +};

The "offset" is cold.
Stats_add(), which is the only hot function, only touches "counter".

Instead of having a struct with {counter, offset}, I strongly prefer having them separate.
E.g. as a struct defining the set of statistics (e.g. pkts, bytes, errors), instantiated once for the counters (in a hot zone of the device data structure) and once for the offsets (in a cold zone of the device data structure).
There could be variants of this "set of statistics" struct, e.g. one for RX and a different one for TX. (Each variant would be instantiated twice, once for counters, and once for offsets.)
  
Ferruh Yigit May 7, 2024, 5:27 p.m. UTC | #5
On 5/7/2024 3:52 PM, Stephen Hemminger wrote:
> On Tue, 7 May 2024 14:48:51 +0100
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> 
>> On 5/3/2024 11:00 PM, Stephen Hemminger wrote:
>>> On Fri, 3 May 2024 16:45:47 +0100
>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>   
>>>> For stats reset, use an offset instead of zeroing out actual stats values,
>>>> get_stats() displays diff between stats and offset.
>>>> This way stats only updated in datapath and offset only updated in stats
>>>> reset function. This makes stats reset function more reliable.
>>>>
>>>> As stats only written by single thread, we can remove 'volatile' qualifier
>>>> which should improve the performance in datapath.
>>>>
>>>> While updating around, 'igb_stats' parameter renamed as 'stats'.
>>>>
>>>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>>>> ---
>>>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>>>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>>>
>>>> This update triggered by mail list discussion [1].
>>>>
>>>> [1]
>>>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/  
>>>
>>>
>>> NAK
>>>
>>> I did not hear a good argument why atomic or volatile was necessary in the first place.
>>> Why?
>>>   
>>
>> Sure, the patch is done as RFC intentionally to discuss the approach.
>>
>> Agree that volatile and atomics (fetch + add + store) is not required
>> for thread synchronization, as only one CPU updates stats.
>> Even this understanding is important because there are PMDs using full
>> atomics for stats update, like null PMD [1], this will help up clear them.
>>
>>
>> And there is a case, stats reset and stats updated in different threads
>> simultaneously, for this 'volatile' is not sufficient anyway and full
>> atomics is required. As this will cause performance impact we are
>> already saying stats update and reset can't happen at the same time [2].
>> With this update volatile and atomics are not required for this case too.
>> (Also using offset to increase stats reset reliability.)
>>
>>
>> In this patch volatile replaced with atomic load and atomic store (not
>> atomic fetch and add), to ensure that stats stored to memory and not
>> kept in device registers only.
>> With volatile, it is guaranteed that updated stats stored back to
>> memory, but without volatile and atomics I am not sure if this is
>> guaranteed. Practically I can see this working, but theoretically not
>> sure. This is similar concern with change in your patch that is casting
>> to volatile to ensure value read from memory [3].
>>
>> Expectation is, only atomics load and store will have smaller
>> performance impact than volatile, ensuring memory load and store when
>> needed.
> 
> The device register worry, can just be handled with a compiler barrier.
> Does not need the stronger guarantee of atomic or volatile.
>

Based on Morten's email, counter being stored late to memory may not be
an issue, so may not need even compiler barrier, let me check again.
  
Ferruh Yigit May 7, 2024, 5:40 p.m. UTC | #6
On 5/7/2024 4:27 PM, Morten Brørup wrote:
>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
>> Sent: Friday, 3 May 2024 17.46
>>
>> For stats reset, use an offset instead of zeroing out actual stats values,
>> get_stats() displays diff between stats and offset.
>> This way stats only updated in datapath and offset only updated in stats
>> reset function. This makes stats reset function more reliable.
>>
>> As stats only written by single thread, we can remove 'volatile' qualifier
>> which should improve the performance in datapath.
>>
>> While updating around, 'igb_stats' parameter renamed as 'stats'.
>>
>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>> ---
>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>
>> This update triggered by mail list discussion [1].
>>
>> [1]
>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-
>> 5f4dd3969f99@lysator.liu.se/
>>
>> v2:
>> * Remove wrapping check for stats
>>
>> v3:
>> * counter and offset put into same struct per stats
>> * Use atomic load / store for stats values
>> ---
> 
> Note: My comments below relate to software PMDs only.
> 
> Design for the following invariants:
> 1. "counter" may increase at any time. (So stopping forwarding is not required.)
>

Mattias mentioned a case [1] that may end up 'offset > count', for being
safe side we may start with restriction.

[1]
https://inbox.dpdk.org/dev/20240425174617.2126159-1-ferruh.yigit@amd.com/T/#m29cd179228c164181d2bb7dea716dee6e91ab169

> 2. "counter" may not decrease.
> 3. "offset" is always <= "counter".
> 
> So:
> 
> Stats_get() must read "offset" before "counter"; if "counter" races to increase in the mean time, it doesn't hurt. Stats_get() is a relatively "cold" function, so barriers etc. are acceptable.
> 
> Assuming that stats_add() lazy-writes "counter"; if stats_get() reads an old value, its result will be slightly off, but not negative.
> 
> Similarly for stats_reset(), which obviously reads "counter" before writing "offset"; if "counter" races to increase in the mean time, the too low "offset" will not cause negative stats from stats_get().
> 

ack on above items.

> 
> And a requested change for performance:
> 
>> +struct stats {
>> +	uint64_t counter;
>> +	uint64_t offset;
>> +};
> 
> The "offset" is cold.
> Stats_add(), which is the only hot function, only touches "counter".
> 
> Instead of having a struct with {counter, offset}, I strongly prefer having them separate.
> E.g. as a struct defining the set of statistics (e.g. pkts, bytes, errors), instantiated once for the counters (in a hot zone of the device data structure) and once for the offsets (in a cold zone of the device data structure).
> There could be variants of this "set of statistics" struct, e.g. one for RX and a different one for TX. (Each variant would be instantiated twice, once for counters, and once for offsets.)
> 

Although having them together was logical, good point from performance
perspective, let me work on it.
  
Mattias Rönnblom May 8, 2024, 7:19 a.m. UTC | #7
On 2024-05-04 00:00, Stephen Hemminger wrote:
> On Fri, 3 May 2024 16:45:47 +0100
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> 
>> For stats reset, use an offset instead of zeroing out actual stats values,
>> get_stats() displays diff between stats and offset.
>> This way stats only updated in datapath and offset only updated in stats
>> reset function. This makes stats reset function more reliable.
>>
>> As stats only written by single thread, we can remove 'volatile' qualifier
>> which should improve the performance in datapath.
>>
>> While updating around, 'igb_stats' parameter renamed as 'stats'.
>>
>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>> ---
>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>
>> This update triggered by mail list discussion [1].
>>
>> [1]
>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
> 
> 
> NAK
> 
> I did not hear a good argument why atomic or volatile was necessary in the first place.
> Why?
> 

On the reader side, loads should be atomic.
On the writer side, stores should be atomic.

Updates (stores) should actually occur in a timely manner. The complete 
read-modify-write cycle need not be atomic, since we only have a single 
writer. All this for the per-lcore counter case.

If load or store tearing occurs, the counter values may occasionally 
take totally bogus values. I think that should be avoided. Especially 
since it will likely come at a very reasonable cost.

 From what it seems to me, load or store tearing may well occur. GCC may 
generate two 32-bit stores for a program-level 64-bit store on 32-bit 
x86. If you have constant and immediate-data store instructions, 
constant writes may also be end up teared. The kernel documentation has 
some example of this. Add LTO, it's not necessarily going to be all that 
clear what is storing-a-constant and what is not.

Maybe you care a little less if statistics are occasionally broken, or 
some transient, inconsistent state, but generally they should work, and 
they should never have some totally bogus values. So, statistics aren't 
snow flakes, mostly just business as usual.

We can't both have a culture that promotes C11-style parallel 
programming, or, at the extreme, push the C11 APIs as-is, and the say 
"and btw you don't have to care about the standard when it comes to 
statistics".

We could adopt the Linux kernel's rules, programming model, and APIs 
(ignoring legal issues). That would be very old school, maybe somewhat 
over-engineered for our purpose, include a fair amount of inline 
assembler, and also and may well depend on GCC or GCC-like compilers, 
just like what I believe the kernel does.

We could use something in-between, heavily inspired by C11 but still 
with an opportunity to work around compiler issues, library issues, and
extend the API for our use case.

I agree we shouldn't have to mark statistics _Atomic, or RTE_ATOMIC(), 
rte_atomic64_t, or rte_sometimes_atomic_and_sometimes_not64_t. Just 
keeping the usual C integer types seems like a better option to me.

> Why is this driver special (a snowflake) compared to all the other drivers doing software
> statistics (tap, virtio, xdp, ring, memif, netvsc, vmware)?

If a broken piece of code has been copied around, one place is going to 
be the first to be fixed.
  
Stephen Hemminger May 8, 2024, 3:23 p.m. UTC | #8
On Wed, 8 May 2024 09:19:02 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 2024-05-04 00:00, Stephen Hemminger wrote:
> > On Fri, 3 May 2024 16:45:47 +0100
> > Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> >   
> >> For stats reset, use an offset instead of zeroing out actual stats values,
> >> get_stats() displays diff between stats and offset.
> >> This way stats only updated in datapath and offset only updated in stats
> >> reset function. This makes stats reset function more reliable.
> >>
> >> As stats only written by single thread, we can remove 'volatile' qualifier
> >> which should improve the performance in datapath.
> >>
> >> While updating around, 'igb_stats' parameter renamed as 'stats'.
> >>
> >> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
> >> ---
> >> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> Cc: Stephen Hemminger <stephen@networkplumber.org>
> >> Cc: Morten Brørup <mb@smartsharesystems.com>
> >>
> >> This update triggered by mail list discussion [1].
> >>
> >> [1]
> >> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/  
> > 
> > 
> > NAK
> > 
> > I did not hear a good argument why atomic or volatile was necessary in the first place.
> > Why?
> >   
> 
> On the reader side, loads should be atomic.
> On the writer side, stores should be atomic.
> 
> Updates (stores) should actually occur in a timely manner. The complete 
> read-modify-write cycle need not be atomic, since we only have a single 
> writer. All this for the per-lcore counter case.
> 
> If load or store tearing occurs, the counter values may occasionally 
> take totally bogus values. I think that should be avoided. Especially 
> since it will likely come at a very reasonable cost.
> 
>  From what it seems to me, load or store tearing may well occur. GCC may 
> generate two 32-bit stores for a program-level 64-bit store on 32-bit 
> x86. If you have constant and immediate-data store instructions, 
> constant writes may also be end up teared. The kernel documentation has 
> some example of this. Add LTO, it's not necessarily going to be all that 
> clear what is storing-a-constant and what is not.
> 
> Maybe you care a little less if statistics are occasionally broken, or 
> some transient, inconsistent state, but generally they should work, and 
> they should never have some totally bogus values. So, statistics aren't 
> snow flakes, mostly just business as usual.
> 
> We can't both have a culture that promotes C11-style parallel 
> programming, or, at the extreme, push the C11 APIs as-is, and the say 
> "and btw you don't have to care about the standard when it comes to 
> statistics".
> 
> We could adopt the Linux kernel's rules, programming model, and APIs 
> (ignoring legal issues). That would be very old school, maybe somewhat 
> over-engineered for our purpose, include a fair amount of inline 
> assembler, and also and may well depend on GCC or GCC-like compilers, 
> just like what I believe the kernel does.
> 
> We could use something in-between, heavily inspired by C11 but still 
> with an opportunity to work around compiler issues, library issues, and
> extend the API for our use case.
> 
> I agree we shouldn't have to mark statistics _Atomic, or RTE_ATOMIC(), 
> rte_atomic64_t, or rte_sometimes_atomic_and_sometimes_not64_t. Just 
> keeping the usual C integer types seems like a better option to me.
> 
> > Why is this driver special (a snowflake) compared to all the other drivers doing software
> > statistics (tap, virtio, xdp, ring, memif, netvsc, vmware)?  
> 
> If a broken piece of code has been copied around, one place is going to 
> be the first to be fixed.


I dislike when any driver does something completely different than valid precedent.
No other driver in DPDK, Vpp, FreeBSD, Linux (and probably Windows) uses atomic for
updating statistics. We even got performance benefit at MS from removing atomic
increment of staistics in internal layers.

The idea of load tearing is crazy talk of integral types. It would break so many things.
It is the kind of stupid compiler thing that would send Linus on a rant and get
the GCC compiler writers in trouble. 

The DPDK has always favored performance over strict safety guard rails everywhere.
Switching to making every statistic an atomic operation is not in the spirit of
what is required. There is no strict guarantee necessary here.
  
Ferruh Yigit May 8, 2024, 7:48 p.m. UTC | #9
On 5/8/2024 4:23 PM, Stephen Hemminger wrote:
> On Wed, 8 May 2024 09:19:02 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> On 2024-05-04 00:00, Stephen Hemminger wrote:
>>> On Fri, 3 May 2024 16:45:47 +0100
>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>   
>>>> For stats reset, use an offset instead of zeroing out actual stats values,
>>>> get_stats() displays diff between stats and offset.
>>>> This way stats only updated in datapath and offset only updated in stats
>>>> reset function. This makes stats reset function more reliable.
>>>>
>>>> As stats only written by single thread, we can remove 'volatile' qualifier
>>>> which should improve the performance in datapath.
>>>>
>>>> While updating around, 'igb_stats' parameter renamed as 'stats'.
>>>>
>>>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>>>> ---
>>>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>>>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>>>
>>>> This update triggered by mail list discussion [1].
>>>>
>>>> [1]
>>>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/  
>>>
>>>
>>> NAK
>>>
>>> I did not hear a good argument why atomic or volatile was necessary in the first place.
>>> Why?
>>>   
>>
>> On the reader side, loads should be atomic.
>> On the writer side, stores should be atomic.
>>
>> Updates (stores) should actually occur in a timely manner. The complete 
>> read-modify-write cycle need not be atomic, since we only have a single 
>> writer. All this for the per-lcore counter case.
>>
>> If load or store tearing occurs, the counter values may occasionally 
>> take totally bogus values. I think that should be avoided. Especially 
>> since it will likely come at a very reasonable cost.
>>
>>  From what it seems to me, load or store tearing may well occur. GCC may 
>> generate two 32-bit stores for a program-level 64-bit store on 32-bit 
>> x86. If you have constant and immediate-data store instructions, 
>> constant writes may also be end up teared. The kernel documentation has 
>> some example of this. Add LTO, it's not necessarily going to be all that 
>> clear what is storing-a-constant and what is not.
>>
>> Maybe you care a little less if statistics are occasionally broken, or 
>> some transient, inconsistent state, but generally they should work, and 
>> they should never have some totally bogus values. So, statistics aren't 
>> snow flakes, mostly just business as usual.
>>
>> We can't both have a culture that promotes C11-style parallel 
>> programming, or, at the extreme, push the C11 APIs as-is, and the say 
>> "and btw you don't have to care about the standard when it comes to 
>> statistics".
>>
>> We could adopt the Linux kernel's rules, programming model, and APIs 
>> (ignoring legal issues). That would be very old school, maybe somewhat 
>> over-engineered for our purpose, include a fair amount of inline 
>> assembler, and also and may well depend on GCC or GCC-like compilers, 
>> just like what I believe the kernel does.
>>
>> We could use something in-between, heavily inspired by C11 but still 
>> with an opportunity to work around compiler issues, library issues, and
>> extend the API for our use case.
>>
>> I agree we shouldn't have to mark statistics _Atomic, or RTE_ATOMIC(), 
>> rte_atomic64_t, or rte_sometimes_atomic_and_sometimes_not64_t. Just 
>> keeping the usual C integer types seems like a better option to me.
>>
>>> Why is this driver special (a snowflake) compared to all the other drivers doing software
>>> statistics (tap, virtio, xdp, ring, memif, netvsc, vmware)?  
>>
>> If a broken piece of code has been copied around, one place is going to 
>> be the first to be fixed.
> 
> 
> I dislike when any driver does something completely different than valid precedent.
> No other driver in DPDK, Vpp, FreeBSD, Linux (and probably Windows) uses atomic for
> updating statistics. We even got performance benefit at MS from removing atomic
> increment of staistics in internal layers.
> 
> The idea of load tearing is crazy talk of integral types. It would break so many things.
> It is the kind of stupid compiler thing that would send Linus on a rant and get
> the GCC compiler writers in trouble. 
> 
> The DPDK has always favored performance over strict safety guard rails everywhere.
> Switching to making every statistic an atomic operation is not in the spirit of
> what is required. There is no strict guarantee necessary here.
> 

I kind of agree with Stephen.

Thanks Mattias, Morten & Stephen, it was informative discussion. But for
*SW drivers* stats update and reset is not core functionality and I
think we can be OK to get hit on corner cases, instead of
over-engineering or making code more complex.

I am for putting priority as following (from high to low):
- Datapath performance
- Stats get accuracy
- Stats reset accuracy

With the restriction that stat reset requires forwarding to stop, we can
even drop 'offset' logic.
And I am not sure if it is a real requirement that stats reset should be
supported during forwarding, although I can see it is convenient.
If we get this requirement in the future, we can focus on a solution.


As action,
I am planning to send a new version of this RFC that only removes the
'volatile' qualifier.
In next step we can remove atomic updates and volatile stat counters
from more SW drivers.

Thanks,
ferruh
  
Stephen Hemminger May 8, 2024, 8:54 p.m. UTC | #10
On Wed, 8 May 2024 20:48:06 +0100
Ferruh Yigit <ferruh.yigit@amd.com> wrote:

> > 
> > The idea of load tearing is crazy talk of integral types. It would break so many things.
> > It is the kind of stupid compiler thing that would send Linus on a rant and get
> > the GCC compiler writers in trouble. 
> > 
> > The DPDK has always favored performance over strict safety guard rails everywhere.
> > Switching to making every statistic an atomic operation is not in the spirit of
> > what is required. There is no strict guarantee necessary here.
> >   
> 
> I kind of agree with Stephen.
> 
> Thanks Mattias, Morten & Stephen, it was informative discussion. But for
> *SW drivers* stats update and reset is not core functionality and I
> think we can be OK to get hit on corner cases, instead of
> over-engineering or making code more complex.


I forgot the case of 64 bit values on 32 bit platforms!
Mostly because haven't cared about 32 bit for years...

The Linux kernel uses some wrappers to handle this.
On 64 bit platforms they become noop.
On 32 bit platform, they are protected by a seqlock and updates are
wrapped by the sequence count.

If we go this way, then doing similar Noop on 64 bit and atomic or seqlock
on 32 bit should be done, but in common helper.

Looking inside FreeBSD, it looks like that has changed over the years as well.

	if_inc_counter
		counter_u64_add
			atomic_add_64
But the counters are always per-cpu in this case. So although it does use
locked operation, will always be uncontended.
				

PS: Does DPDK still actually support 32 bit on x86? Can it be dropped this cycle?
  
Morten Brørup May 9, 2024, 7:43 a.m. UTC | #11
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 8 May 2024 22.54
> 
> On Wed, 8 May 2024 20:48:06 +0100
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> 
> > >
> > > The idea of load tearing is crazy talk of integral types. It would
> break so many things.
> > > It is the kind of stupid compiler thing that would send Linus on a
> rant and get
> > > the GCC compiler writers in trouble.
> > >
> > > The DPDK has always favored performance over strict safety guard
> rails everywhere.
> > > Switching to making every statistic an atomic operation is not in
> the spirit of
> > > what is required. There is no strict guarantee necessary here.
> > >
> >
> > I kind of agree with Stephen.
> >
> > Thanks Mattias, Morten & Stephen, it was informative discussion. But
> for
> > *SW drivers* stats update and reset is not core functionality and I
> > think we can be OK to get hit on corner cases, instead of
> > over-engineering or making code more complex.
> 
> 
> I forgot the case of 64 bit values on 32 bit platforms!
> Mostly because haven't cared about 32 bit for years...
> 
> The Linux kernel uses some wrappers to handle this.
> On 64 bit platforms they become noop.
> On 32 bit platform, they are protected by a seqlock and updates are
> wrapped by the sequence count.
> 
> If we go this way, then doing similar Noop on 64 bit and atomic or
> seqlock
> on 32 bit should be done, but in common helper.
> 
> Looking inside FreeBSD, it looks like that has changed over the years as
> well.
> 
> 	if_inc_counter
> 		counter_u64_add
> 			atomic_add_64
> But the counters are always per-cpu in this case. So although it does
> use
> locked operation, will always be uncontended.
> 
> 
> PS: Does DPDK still actually support 32 bit on x86? Can it be dropped
> this cycle?

We cannot drop 32 bit architecture support altogether.

But, unlike the Linux kernel, DPDK doesn't need to support ancient 32 bit architectures.
If the few 32 bit architectures supported by DPDK provide non-tearing 64 bit loads/stores, we don't need locks (in the fast path) for 64 bit counters.

In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit architecture) and 32 bit ARMv8.
I don't think DPDK support any other 32 bit architectures.


As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64 bit non-tearing load/store.

Looking at ARMv7-A documentation, this architecture offers 64 bit non-tearing load/store by using two 32-bit registers and double-word Exclusive load and store instructions, LDREXD and STREXD. I don't know how costly they are, performance wise.

Supporting 64 bit counters has much broader scope than SW drivers.
Providing a "DPDK standard" design pattern with some utility functions would be useful.

The af_packet driver could serve as a reference use case.
It maintains both per-thread (per-queue) counters and the dev->data->rx_mbuf_alloc_failed counter shared by multiple threads.
  
Bruce Richardson May 9, 2024, 9:29 a.m. UTC | #12
On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Wednesday, 8 May 2024 22.54
> > 
> > On Wed, 8 May 2024 20:48:06 +0100
> > Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> > 
> > > >
> > > > The idea of load tearing is crazy talk of integral types. It would
> > break so many things.
> > > > It is the kind of stupid compiler thing that would send Linus on a
> > rant and get
> > > > the GCC compiler writers in trouble.
> > > >
> > > > The DPDK has always favored performance over strict safety guard
> > rails everywhere.
> > > > Switching to making every statistic an atomic operation is not in
> > the spirit of
> > > > what is required. There is no strict guarantee necessary here.
> > > >
> > >
> > > I kind of agree with Stephen.
> > >
> > > Thanks Mattias, Morten & Stephen, it was informative discussion. But
> > for
> > > *SW drivers* stats update and reset is not core functionality and I
> > > think we can be OK to get hit on corner cases, instead of
> > > over-engineering or making code more complex.
> > 
> > 
> > I forgot the case of 64 bit values on 32 bit platforms!
> > Mostly because haven't cared about 32 bit for years...
> > 
> > The Linux kernel uses some wrappers to handle this.
> > On 64 bit platforms they become noop.
> > On 32 bit platform, they are protected by a seqlock and updates are
> > wrapped by the sequence count.
> > 
> > If we go this way, then doing similar Noop on 64 bit and atomic or
> > seqlock
> > on 32 bit should be done, but in common helper.
> > 
> > Looking inside FreeBSD, it looks like that has changed over the years as
> > well.
> > 
> > 	if_inc_counter
> > 		counter_u64_add
> > 			atomic_add_64
> > But the counters are always per-cpu in this case. So although it does
> > use
> > locked operation, will always be uncontended.
> > 
> > 
> > PS: Does DPDK still actually support 32 bit on x86? Can it be dropped
> > this cycle?
> 
> We cannot drop 32 bit architecture support altogether.
> 
> But, unlike the Linux kernel, DPDK doesn't need to support ancient 32 bit architectures.
> If the few 32 bit architectures supported by DPDK provide non-tearing 64 bit loads/stores, we don't need locks (in the fast path) for 64 bit counters.
> 
> In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit architecture) and 32 bit ARMv8.
> I don't think DPDK support any other 32 bit architectures.
> 
> 
> As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64 bit non-tearing load/store.
> 

Testing this a little in godbolt, I see gcc using xmm registers on 32-bit
when updating 64-bit counters, but clang doesn't seem to do so, but instead
does 2 stores when writing back the 64 value. (I tried with both volatile
and non-volatile 64-bit values, just to see if volatile would encourage
clang to do a single store).

GCC: https://godbolt.org/z/9eqKfT3hz
Clang: https://godbolt.org/z/PT5EqKn4c

/Bruce
  
Morten Brørup May 9, 2024, 11:37 a.m. UTC | #13
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 9 May 2024 11.30
> 
> On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Wednesday, 8 May 2024 22.54
> > >
> > > On Wed, 8 May 2024 20:48:06 +0100
> > > Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> > >
> > > > >
> > > > > The idea of load tearing is crazy talk of integral types. It
> would
> > > break so many things.
> > > > > It is the kind of stupid compiler thing that would send Linus on
> a
> > > rant and get
> > > > > the GCC compiler writers in trouble.
> > > > >
> > > > > The DPDK has always favored performance over strict safety guard
> > > rails everywhere.
> > > > > Switching to making every statistic an atomic operation is not
> in
> > > the spirit of
> > > > > what is required. There is no strict guarantee necessary here.
> > > > >
> > > >
> > > > I kind of agree with Stephen.
> > > >
> > > > Thanks Mattias, Morten & Stephen, it was informative discussion.
> But
> > > for
> > > > *SW drivers* stats update and reset is not core functionality and
> I
> > > > think we can be OK to get hit on corner cases, instead of
> > > > over-engineering or making code more complex.
> > >
> > >
> > > I forgot the case of 64 bit values on 32 bit platforms!
> > > Mostly because haven't cared about 32 bit for years...
> > >
> > > The Linux kernel uses some wrappers to handle this.
> > > On 64 bit platforms they become noop.
> > > On 32 bit platform, they are protected by a seqlock and updates are
> > > wrapped by the sequence count.
> > >
> > > If we go this way, then doing similar Noop on 64 bit and atomic or
> > > seqlock
> > > on 32 bit should be done, but in common helper.
> > >
> > > Looking inside FreeBSD, it looks like that has changed over the
> years as
> > > well.
> > >
> > > 	if_inc_counter
> > > 		counter_u64_add
> > > 			atomic_add_64
> > > But the counters are always per-cpu in this case. So although it
> does
> > > use
> > > locked operation, will always be uncontended.
> > >
> > >
> > > PS: Does DPDK still actually support 32 bit on x86? Can it be
> dropped
> > > this cycle?
> >
> > We cannot drop 32 bit architecture support altogether.
> >
> > But, unlike the Linux kernel, DPDK doesn't need to support ancient 32
> bit architectures.
> > If the few 32 bit architectures supported by DPDK provide non-tearing
> 64 bit loads/stores, we don't need locks (in the fast path) for 64 bit
> counters.
> >
> > In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit
> architecture) and 32 bit ARMv8.
> > I don't think DPDK support any other 32 bit architectures.
> >
> >
> > As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64
> bit non-tearing load/store.
> >
> 
> Testing this a little in godbolt, I see gcc using xmm registers on 32-
> bit
> when updating 64-bit counters, but clang doesn't seem to do so, but
> instead
> does 2 stores when writing back the 64 value. (I tried with both
> volatile
> and non-volatile 64-bit values, just to see if volatile would encourage
> clang to do a single store).
> 
> GCC: https://godbolt.org/z/9eqKfT3hz
> Clang: https://godbolt.org/z/PT5EqKn4c

Interesting.
I guess this can be fixed by manually implementing what GCC does.

I'm more concerned about finding a high-performance (in the fast path) 64 bit counter solution for 32 bit ARM.
  
Morten Brørup May 9, 2024, 2:19 p.m. UTC | #14
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, 9 May 2024 13.37
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Thursday, 9 May 2024 11.30
> >
> > On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
> > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > Sent: Wednesday, 8 May 2024 22.54
> > > >
> > > > On Wed, 8 May 2024 20:48:06 +0100
> > > > Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> > > >
> > > > > >
> > > > > > The idea of load tearing is crazy talk of integral types. It
> > would
> > > > break so many things.
> > > > > > It is the kind of stupid compiler thing that would send Linus
> on
> > a
> > > > rant and get
> > > > > > the GCC compiler writers in trouble.
> > > > > >
> > > > > > The DPDK has always favored performance over strict safety
> guard
> > > > rails everywhere.
> > > > > > Switching to making every statistic an atomic operation is not
> > in
> > > > the spirit of
> > > > > > what is required. There is no strict guarantee necessary here.
> > > > > >
> > > > >
> > > > > I kind of agree with Stephen.
> > > > >
> > > > > Thanks Mattias, Morten & Stephen, it was informative discussion.
> > But
> > > > for
> > > > > *SW drivers* stats update and reset is not core functionality
> and
> > I
> > > > > think we can be OK to get hit on corner cases, instead of
> > > > > over-engineering or making code more complex.
> > > >
> > > >
> > > > I forgot the case of 64 bit values on 32 bit platforms!
> > > > Mostly because haven't cared about 32 bit for years...
> > > >
> > > > The Linux kernel uses some wrappers to handle this.
> > > > On 64 bit platforms they become noop.
> > > > On 32 bit platform, they are protected by a seqlock and updates
> are
> > > > wrapped by the sequence count.
> > > >
> > > > If we go this way, then doing similar Noop on 64 bit and atomic or
> > > > seqlock
> > > > on 32 bit should be done, but in common helper.
> > > >
> > > > Looking inside FreeBSD, it looks like that has changed over the
> > years as
> > > > well.
> > > >
> > > > 	if_inc_counter
> > > > 		counter_u64_add
> > > > 			atomic_add_64
> > > > But the counters are always per-cpu in this case. So although it
> > does
> > > > use
> > > > locked operation, will always be uncontended.
> > > >
> > > >
> > > > PS: Does DPDK still actually support 32 bit on x86? Can it be
> > dropped
> > > > this cycle?
> > >
> > > We cannot drop 32 bit architecture support altogether.
> > >
> > > But, unlike the Linux kernel, DPDK doesn't need to support ancient
> 32
> > bit architectures.
> > > If the few 32 bit architectures supported by DPDK provide non-
> tearing
> > 64 bit loads/stores, we don't need locks (in the fast path) for 64 bit
> > counters.
> > >
> > > In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit
> > architecture) and 32 bit ARMv8.
> > > I don't think DPDK support any other 32 bit architectures.
> > >
> > >
> > > As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64
> > bit non-tearing load/store.
> > >
> >
> > Testing this a little in godbolt, I see gcc using xmm registers on 32-
> > bit
> > when updating 64-bit counters, but clang doesn't seem to do so, but
> > instead
> > does 2 stores when writing back the 64 value. (I tried with both
> > volatile
> > and non-volatile 64-bit values, just to see if volatile would
> encourage
> > clang to do a single store).
> >
> > GCC: https://godbolt.org/z/9eqKfT3hz
> > Clang: https://godbolt.org/z/PT5EqKn4c
> 
> Interesting.
> I guess this can be fixed by manually implementing what GCC does.
> 
> I'm more concerned about finding a high-performance (in the fast path)
> 64 bit counter solution for 32 bit ARM.

Reading up on the topic, and continuing Bruce's experiment on Godbolt, it is possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store Register Dual) instructions, which load/store 64 bit from memory into two registers at once.

Clang is emits more efficient code without volatile.
GCC requires volatile to use STRD.

Clang: https://godbolt.org/z/WjdTq6EKh
GCC: https://godbolt.org/z/qq9j7d4Ea

Summing it up, it is possible to implement non-tearing 64 bit high-performance (lockless, barrier-free) counters on the 32 bit architectures supported by DPDK.

But the implementation is both architecture and compiler specific.
So it seems a "64 bit counters" library would be handy. (Or a "non-tearing 64 bit integers" library, for support of the signed variant too; but I don't think we need that.)
We can use uint64_t as the underlying type and type cast in the library (where needed by the specific compiler/architecture), or introduce a new rte_ctr64_t type to ensure that accessor functions are always used and the developer doesn't have to worry about tearing on 32 bit architectures.

The most simple variant of such a library only provides load and store functions. The API would look like this:

uint64_t inline
rte_ctr64_get(const rte_ctr64_t *const ctr);

void inline
rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value);

And if some CPU offers special instructions for increment or addition, faster (regarding performance) and/or more compact (regarding instruction memory) than a sequence of load-add-store instructions:

void inline
rte_ctr64_inc(rte_ctr64_t *const ctr);

void inline
rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);

<feature creep>
And perhaps atomic variants of all these functions, with explicit and/or relaxed memory ordering, for counters shared by multiple threads.
</feature creep>
  
Stephen Hemminger May 10, 2024, 4:56 a.m. UTC | #15
On Thu, 9 May 2024 16:19:08 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Thursday, 9 May 2024 13.37
> >   
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Thursday, 9 May 2024 11.30
> > >
> > > On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:  
> > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > > Sent: Wednesday, 8 May 2024 22.54
> > > > >
> > > > > On Wed, 8 May 2024 20:48:06 +0100
> > > > > Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> > > > >  
> > > > > > >
> > > > > > > The idea of load tearing is crazy talk of integral types. It  
> > > would  
> > > > > break so many things.  
> > > > > > > It is the kind of stupid compiler thing that would send Linus  
> > on  
> > > a  
> > > > > rant and get  
> > > > > > > the GCC compiler writers in trouble.
> > > > > > >
> > > > > > > The DPDK has always favored performance over strict safety  
> > guard  
> > > > > rails everywhere.  
> > > > > > > Switching to making every statistic an atomic operation is not  
> > > in  
> > > > > the spirit of  
> > > > > > > what is required. There is no strict guarantee necessary here.
> > > > > > >  
> > > > > >
> > > > > > I kind of agree with Stephen.
> > > > > >
> > > > > > Thanks Mattias, Morten & Stephen, it was informative discussion.  
> > > But  
> > > > > for  
> > > > > > *SW drivers* stats update and reset is not core functionality  
> > and  
> > > I  
> > > > > > think we can be OK to get hit on corner cases, instead of
> > > > > > over-engineering or making code more complex.  
> > > > >
> > > > >
> > > > > I forgot the case of 64 bit values on 32 bit platforms!
> > > > > Mostly because haven't cared about 32 bit for years...
> > > > >
> > > > > The Linux kernel uses some wrappers to handle this.
> > > > > On 64 bit platforms they become noop.
> > > > > On 32 bit platform, they are protected by a seqlock and updates  
> > are  
> > > > > wrapped by the sequence count.
> > > > >
> > > > > If we go this way, then doing similar Noop on 64 bit and atomic or
> > > > > seqlock
> > > > > on 32 bit should be done, but in common helper.
> > > > >
> > > > > Looking inside FreeBSD, it looks like that has changed over the  
> > > years as  
> > > > > well.
> > > > >
> > > > > 	if_inc_counter
> > > > > 		counter_u64_add
> > > > > 			atomic_add_64
> > > > > But the counters are always per-cpu in this case. So although it  
> > > does  
> > > > > use
> > > > > locked operation, will always be uncontended.
> > > > >
> > > > >
> > > > > PS: Does DPDK still actually support 32 bit on x86? Can it be  
> > > dropped  
> > > > > this cycle?  
> > > >
> > > > We cannot drop 32 bit architecture support altogether.
> > > >
> > > > But, unlike the Linux kernel, DPDK doesn't need to support ancient  
> > 32  
> > > bit architectures.  
> > > > If the few 32 bit architectures supported by DPDK provide non-  
> > tearing  
> > > 64 bit loads/stores, we don't need locks (in the fast path) for 64 bit
> > > counters.  
> > > >
> > > > In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit  
> > > architecture) and 32 bit ARMv8.  
> > > > I don't think DPDK support any other 32 bit architectures.
> > > >
> > > >
> > > > As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64  
> > > bit non-tearing load/store.  
> > > >  
> > >
> > > Testing this a little in godbolt, I see gcc using xmm registers on 32-
> > > bit
> > > when updating 64-bit counters, but clang doesn't seem to do so, but
> > > instead
> > > does 2 stores when writing back the 64 value. (I tried with both
> > > volatile
> > > and non-volatile 64-bit values, just to see if volatile would  
> > encourage  
> > > clang to do a single store).
> > >
> > > GCC: https://godbolt.org/z/9eqKfT3hz
> > > Clang: https://godbolt.org/z/PT5EqKn4c  
> > 
> > Interesting.
> > I guess this can be fixed by manually implementing what GCC does.
> > 
> > I'm more concerned about finding a high-performance (in the fast path)
> > 64 bit counter solution for 32 bit ARM.  
> 
> Reading up on the topic, and continuing Bruce's experiment on Godbolt, it is possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store Register Dual) instructions, which load/store 64 bit from memory into two registers at once.
> 
> Clang is emits more efficient code without volatile.
> GCC requires volatile to use STRD.
> 
> Clang: https://godbolt.org/z/WjdTq6EKh
> GCC: https://godbolt.org/z/qq9j7d4Ea
> 
> Summing it up, it is possible to implement non-tearing 64 bit high-performance (lockless, barrier-free) counters on the 32 bit architectures supported by DPDK.
> 
> But the implementation is both architecture and compiler specific.
> So it seems a "64 bit counters" library would be handy. (Or a "non-tearing 64 bit integers" library, for support of the signed variant too; but I don't think we need that.)
> We can use uint64_t as the underlying type and type cast in the library (where needed by the specific compiler/architecture), or introduce a new rte_ctr64_t type to ensure that accessor functions are always used and the developer doesn't have to worry about tearing on 32 bit architectures.
> 
> The most simple variant of such a library only provides load and store functions. The API would look like this:
> 
> uint64_t inline
> rte_ctr64_get(const rte_ctr64_t *const ctr);
> 
> void inline
> rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value);
> 
> And if some CPU offers special instructions for increment or addition, faster (regarding performance) and/or more compact (regarding instruction memory) than a sequence of load-add-store instructions:
> 
> void inline
> rte_ctr64_inc(rte_ctr64_t *const ctr);
> 
> void inline
> rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);
> 
> <feature creep>
> And perhaps atomic variants of all these functions, with explicit and/or relaxed memory ordering, for counters shared by multiple threads.
> </feature creep>
> 


This kind of what I am experimenting with but...
Intend to keep the details of the counters read and update in one file and not as inlines.
  
Morten Brørup May 10, 2024, 9:14 a.m. UTC | #16
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, 10 May 2024 06.56
> 
> On Thu, 9 May 2024 16:19:08 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Thursday, 9 May 2024 13.37
> > >
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Thursday, 9 May 2024 11.30
> > > >
> > > > On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
> > > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > > > Sent: Wednesday, 8 May 2024 22.54
> > > > > >
> > > > > > On Wed, 8 May 2024 20:48:06 +0100
> > > > > > Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> > > > > >
> > > > > > > >
> > > > > > > > The idea of load tearing is crazy talk of integral types.
> It
> > > > would
> > > > > > break so many things.
> > > > > > > > It is the kind of stupid compiler thing that would send
> Linus
> > > on
> > > > a
> > > > > > rant and get
> > > > > > > > the GCC compiler writers in trouble.
> > > > > > > >
> > > > > > > > The DPDK has always favored performance over strict safety
> > > guard
> > > > > > rails everywhere.
> > > > > > > > Switching to making every statistic an atomic operation is
> not
> > > > in
> > > > > > the spirit of
> > > > > > > > what is required. There is no strict guarantee necessary
> here.
> > > > > > > >
> > > > > > >
> > > > > > > I kind of agree with Stephen.
> > > > > > >
> > > > > > > Thanks Mattias, Morten & Stephen, it was informative
> discussion.
> > > > But
> > > > > > for
> > > > > > > *SW drivers* stats update and reset is not core
> functionality
> > > and
> > > > I
> > > > > > > think we can be OK to get hit on corner cases, instead of
> > > > > > > over-engineering or making code more complex.
> > > > > >
> > > > > >
> > > > > > I forgot the case of 64 bit values on 32 bit platforms!
> > > > > > Mostly because haven't cared about 32 bit for years...
> > > > > >
> > > > > > The Linux kernel uses some wrappers to handle this.
> > > > > > On 64 bit platforms they become noop.
> > > > > > On 32 bit platform, they are protected by a seqlock and
> updates
> > > are
> > > > > > wrapped by the sequence count.
> > > > > >
> > > > > > If we go this way, then doing similar Noop on 64 bit and
> atomic or
> > > > > > seqlock
> > > > > > on 32 bit should be done, but in common helper.
> > > > > >
> > > > > > Looking inside FreeBSD, it looks like that has changed over
> the
> > > > years as
> > > > > > well.
> > > > > >
> > > > > > 	if_inc_counter
> > > > > > 		counter_u64_add
> > > > > > 			atomic_add_64
> > > > > > But the counters are always per-cpu in this case. So although
> it
> > > > does
> > > > > > use
> > > > > > locked operation, will always be uncontended.
> > > > > >
> > > > > >
> > > > > > PS: Does DPDK still actually support 32 bit on x86? Can it be
> > > > dropped
> > > > > > this cycle?
> > > > >
> > > > > We cannot drop 32 bit architecture support altogether.
> > > > >
> > > > > But, unlike the Linux kernel, DPDK doesn't need to support
> ancient
> > > 32
> > > > bit architectures.
> > > > > If the few 32 bit architectures supported by DPDK provide non-
> > > tearing
> > > > 64 bit loads/stores, we don't need locks (in the fast path) for 64
> bit
> > > > counters.
> > > > >
> > > > > In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit
> > > > architecture) and 32 bit ARMv8.
> > > > > I don't think DPDK support any other 32 bit architectures.
> > > > >
> > > > >
> > > > > As Mattias mentioned, 32 bit x86 can use xmm registers to
> provide 64
> > > > bit non-tearing load/store.
> > > > >
> > > >
> > > > Testing this a little in godbolt, I see gcc using xmm registers on
> 32-
> > > > bit
> > > > when updating 64-bit counters, but clang doesn't seem to do so,
> but
> > > > instead
> > > > does 2 stores when writing back the 64 value. (I tried with both
> > > > volatile
> > > > and non-volatile 64-bit values, just to see if volatile would
> > > encourage
> > > > clang to do a single store).
> > > >
> > > > GCC: https://godbolt.org/z/9eqKfT3hz
> > > > Clang: https://godbolt.org/z/PT5EqKn4c
> > >
> > > Interesting.
> > > I guess this can be fixed by manually implementing what GCC does.
> > >
> > > I'm more concerned about finding a high-performance (in the fast
> path)
> > > 64 bit counter solution for 32 bit ARM.
> >
> > Reading up on the topic, and continuing Bruce's experiment on Godbolt,
> it is possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store
> Register Dual) instructions, which load/store 64 bit from memory into
> two registers at once.
> >
> > Clang is emits more efficient code without volatile.
> > GCC requires volatile to use STRD.
> >
> > Clang: https://godbolt.org/z/WjdTq6EKh
> > GCC: https://godbolt.org/z/qq9j7d4Ea
> >
> > Summing it up, it is possible to implement non-tearing 64 bit high-
> performance (lockless, barrier-free) counters on the 32 bit
> architectures supported by DPDK.
> >
> > But the implementation is both architecture and compiler specific.
> > So it seems a "64 bit counters" library would be handy. (Or a "non-
> tearing 64 bit integers" library, for support of the signed variant too;
> but I don't think we need that.)
> > We can use uint64_t as the underlying type and type cast in the
> library (where needed by the specific compiler/architecture), or
> introduce a new rte_ctr64_t type to ensure that accessor functions are
> always used and the developer doesn't have to worry about tearing on 32
> bit architectures.
> >
> > The most simple variant of such a library only provides load and store
> functions. The API would look like this:
> >
> > uint64_t inline
> > rte_ctr64_get(const rte_ctr64_t *const ctr);
> >
> > void inline
> > rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value);
> >
> > And if some CPU offers special instructions for increment or addition,
> faster (regarding performance) and/or more compact (regarding
> instruction memory) than a sequence of load-add-store instructions:
> >
> > void inline
> > rte_ctr64_inc(rte_ctr64_t *const ctr);
> >
> > void inline
> > rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);

Note: 32 bit architectures might achieve higher performance if the "value" parameter to rte_ctr64_add() is unsigned long (or unsigned int) instead of uint64_t.

> >
> > <feature creep>
> > And perhaps atomic variants of all these functions, with explicit
> and/or relaxed memory ordering, for counters shared by multiple threads.
> > </feature creep>
> >
> 
> 
> This kind of what I am experimenting with but...

Excellent!

> Intend to keep the details of the counters read and update in one file
> and not as inlines.

Please note that traffic management applications maintain many counters (e.g. per-flow, per-subscriber and per-link packet and byte counters, some also per QoS class), so rte_ctr64_add() must have the highest performance technically possible.

For reference, the packet scheduling code in our application updates 28 statistics counters per burst of packets. (In addition to internal state variables for e.g. queue lenghts.)

Furthermore, our application processes and displays live statistics with one second granularity, using a separate thread. Although statistics processing is not part of the fast path, the sheer number of counters processed requires high performance read access to those counters.


<more feature creep>
Some groups of counters are maintained locally in the inner loop, and then added in bulk to the "public" statistics afterwards. Conceptually:

struct stats_per_prio {
	uint64_t	forwarded_bytes;
	uint64_t	forwarded_packets;
	uint64_t	marked_bytes;
	uint64_t	marked_packets;
	uint64_t	dropped_bytes;
	uint64_t	dropped_packets;
};

If this is a common design pattern in DPDK (drivers, libraries and/or applications), perhaps also provide a performance optimized function for architectures offering vector instructions:

void inline
rte_ctr64_add_bulk(
		rte_ctr64_t *const ctrs,
		const unsigned long *const values,
		const unsigned int n /* compile time constant */);

This slightly resembles the Linux kernel's design pattern, where counters are updated in bulk, protected by a common lock for the bulk update. (However, DPDK has no lock, so the motivation for optimizing for this design pattern is only "nice to have".)

PS:
Many DPDK applications are 64 bit only, including the SmartShare appliances, and can easily manage 64 bit counters without all this.
However, if the DPDK project still considers 32 bit architectures first class citizens, 64 bit counters should have high performance on 32 bit architectures too.
  
Mattias Rönnblom May 26, 2024, 7:03 a.m. UTC | #17
On 2024-05-09 13:37, Morten Brørup wrote:
>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>> Sent: Thursday, 9 May 2024 11.30
>>
>> On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
>>>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>>>> Sent: Wednesday, 8 May 2024 22.54
>>>>
>>>> On Wed, 8 May 2024 20:48:06 +0100
>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>>
>>>>>>
>>>>>> The idea of load tearing is crazy talk of integral types. It
>> would
>>>> break so many things.
>>>>>> It is the kind of stupid compiler thing that would send Linus on
>> a
>>>> rant and get
>>>>>> the GCC compiler writers in trouble.
>>>>>>
>>>>>> The DPDK has always favored performance over strict safety guard
>>>> rails everywhere.
>>>>>> Switching to making every statistic an atomic operation is not
>> in
>>>> the spirit of
>>>>>> what is required. There is no strict guarantee necessary here.
>>>>>>
>>>>>
>>>>> I kind of agree with Stephen.
>>>>>
>>>>> Thanks Mattias, Morten & Stephen, it was informative discussion.
>> But
>>>> for
>>>>> *SW drivers* stats update and reset is not core functionality and
>> I
>>>>> think we can be OK to get hit on corner cases, instead of
>>>>> over-engineering or making code more complex.
>>>>
>>>>
>>>> I forgot the case of 64 bit values on 32 bit platforms!
>>>> Mostly because haven't cared about 32 bit for years...
>>>>
>>>> The Linux kernel uses some wrappers to handle this.
>>>> On 64 bit platforms they become noop.
>>>> On 32 bit platform, they are protected by a seqlock and updates are
>>>> wrapped by the sequence count.
>>>>
>>>> If we go this way, then doing similar Noop on 64 bit and atomic or
>>>> seqlock
>>>> on 32 bit should be done, but in common helper.
>>>>
>>>> Looking inside FreeBSD, it looks like that has changed over the
>> years as
>>>> well.
>>>>
>>>> 	if_inc_counter
>>>> 		counter_u64_add
>>>> 			atomic_add_64
>>>> But the counters are always per-cpu in this case. So although it
>> does
>>>> use
>>>> locked operation, will always be uncontended.
>>>>
>>>>
>>>> PS: Does DPDK still actually support 32 bit on x86? Can it be
>> dropped
>>>> this cycle?
>>>
>>> We cannot drop 32 bit architecture support altogether.
>>>
>>> But, unlike the Linux kernel, DPDK doesn't need to support ancient 32
>> bit architectures.
>>> If the few 32 bit architectures supported by DPDK provide non-tearing
>> 64 bit loads/stores, we don't need locks (in the fast path) for 64 bit
>> counters.
>>>
>>> In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit
>> architecture) and 32 bit ARMv8.
>>> I don't think DPDK support any other 32 bit architectures.
>>>
>>>
>>> As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64
>> bit non-tearing load/store.
>>>
>>
>> Testing this a little in godbolt, I see gcc using xmm registers on 32-
>> bit
>> when updating 64-bit counters, but clang doesn't seem to do so, but
>> instead
>> does 2 stores when writing back the 64 value. (I tried with both
>> volatile
>> and non-volatile 64-bit values, just to see if volatile would encourage
>> clang to do a single store).
>>
>> GCC: https://godbolt.org/z/9eqKfT3hz
>> Clang: https://godbolt.org/z/PT5EqKn4c
> 
> Interesting.
> I guess this can be fixed by manually implementing what GCC does.
> 

If you want an atomic store, use __atomic_store_n() or the equivalent.

On GCC, the resulting code will be compact and use an XMM register for 
the store.

On clang, you get much more instructions including a conditional jump to 
a compiler library function. You also get a compiler warning about 
misaligned atomic may incur significant cost (no surprise). All this on 
32-bit x86. Interesting, you get all those instructions even when you 
explicitly tell the compiler the address should be assumed to be 
aligned. Seems like a bug, or at least a short-coming.

So GCC assumes atomic stores are aligned, which seems perfectly 
reasonable to me.

> I'm more concerned about finding a high-performance (in the fast path) 64 bit counter solution for 32 bit ARM.
>
  
Mattias Rönnblom May 26, 2024, 7:07 a.m. UTC | #18
On 2024-05-09 16:19, Morten Brørup wrote:
>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>> Sent: Thursday, 9 May 2024 13.37
>>
>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>> Sent: Thursday, 9 May 2024 11.30
>>>
>>> On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
>>>>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>>>>> Sent: Wednesday, 8 May 2024 22.54
>>>>>
>>>>> On Wed, 8 May 2024 20:48:06 +0100
>>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>>>
>>>>>>>
>>>>>>> The idea of load tearing is crazy talk of integral types. It
>>> would
>>>>> break so many things.
>>>>>>> It is the kind of stupid compiler thing that would send Linus
>> on
>>> a
>>>>> rant and get
>>>>>>> the GCC compiler writers in trouble.
>>>>>>>
>>>>>>> The DPDK has always favored performance over strict safety
>> guard
>>>>> rails everywhere.
>>>>>>> Switching to making every statistic an atomic operation is not
>>> in
>>>>> the spirit of
>>>>>>> what is required. There is no strict guarantee necessary here.
>>>>>>>
>>>>>>
>>>>>> I kind of agree with Stephen.
>>>>>>
>>>>>> Thanks Mattias, Morten & Stephen, it was informative discussion.
>>> But
>>>>> for
>>>>>> *SW drivers* stats update and reset is not core functionality
>> and
>>> I
>>>>>> think we can be OK to get hit on corner cases, instead of
>>>>>> over-engineering or making code more complex.
>>>>>
>>>>>
>>>>> I forgot the case of 64 bit values on 32 bit platforms!
>>>>> Mostly because haven't cared about 32 bit for years...
>>>>>
>>>>> The Linux kernel uses some wrappers to handle this.
>>>>> On 64 bit platforms they become noop.
>>>>> On 32 bit platform, they are protected by a seqlock and updates
>> are
>>>>> wrapped by the sequence count.
>>>>>
>>>>> If we go this way, then doing similar Noop on 64 bit and atomic or
>>>>> seqlock
>>>>> on 32 bit should be done, but in common helper.
>>>>>
>>>>> Looking inside FreeBSD, it looks like that has changed over the
>>> years as
>>>>> well.
>>>>>
>>>>> 	if_inc_counter
>>>>> 		counter_u64_add
>>>>> 			atomic_add_64
>>>>> But the counters are always per-cpu in this case. So although it
>>> does
>>>>> use
>>>>> locked operation, will always be uncontended.
>>>>>
>>>>>
>>>>> PS: Does DPDK still actually support 32 bit on x86? Can it be
>>> dropped
>>>>> this cycle?
>>>>
>>>> We cannot drop 32 bit architecture support altogether.
>>>>
>>>> But, unlike the Linux kernel, DPDK doesn't need to support ancient
>> 32
>>> bit architectures.
>>>> If the few 32 bit architectures supported by DPDK provide non-
>> tearing
>>> 64 bit loads/stores, we don't need locks (in the fast path) for 64 bit
>>> counters.
>>>>
>>>> In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit
>>> architecture) and 32 bit ARMv8.
>>>> I don't think DPDK support any other 32 bit architectures.
>>>>
>>>>
>>>> As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64
>>> bit non-tearing load/store.
>>>>
>>>
>>> Testing this a little in godbolt, I see gcc using xmm registers on 32-
>>> bit
>>> when updating 64-bit counters, but clang doesn't seem to do so, but
>>> instead
>>> does 2 stores when writing back the 64 value. (I tried with both
>>> volatile
>>> and non-volatile 64-bit values, just to see if volatile would
>> encourage
>>> clang to do a single store).
>>>
>>> GCC: https://godbolt.org/z/9eqKfT3hz
>>> Clang: https://godbolt.org/z/PT5EqKn4c
>>
>> Interesting.
>> I guess this can be fixed by manually implementing what GCC does.
>>
>> I'm more concerned about finding a high-performance (in the fast path)
>> 64 bit counter solution for 32 bit ARM.
> 
> Reading up on the topic, and continuing Bruce's experiment on Godbolt, it is possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store Register Dual) instructions, which load/store 64 bit from memory into two registers at once.
> 
> Clang is emits more efficient code without volatile.
> GCC requires volatile to use STRD.
> 
> Clang: https://godbolt.org/z/WjdTq6EKh
> GCC: https://godbolt.org/z/qq9j7d4Ea
> 
> Summing it up, it is possible to implement non-tearing 64 bit high-performance (lockless, barrier-free) counters on the 32 bit architectures supported by DPDK.
> 
> But the implementation is both architecture and compiler specific.
> So it seems a "64 bit counters" library would be handy. (Or a "non-tearing 64 bit integers" library, for support of the signed variant too; but I don't think we need that.)
> We can use uint64_t as the underlying type and type cast in the library (where needed by the specific compiler/architecture), or introduce a new rte_ctr64_t type to ensure that accessor functions are always used and the developer doesn't have to worry about tearing on 32 bit architectures.
> 
> The most simple variant of such a library only provides load and store functions. The API would look like this:
> 
> uint64_t inline
> rte_ctr64_get(const rte_ctr64_t *const ctr);
> 
> void inline
> rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value);
> 
> And if some CPU offers special instructions for increment or addition, faster (regarding performance) and/or more compact (regarding instruction memory) than a sequence of load-add-store instructions:
> 
> void inline
> rte_ctr64_inc(rte_ctr64_t *const ctr);
> 
> void inline
> rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);
> 
> <feature creep>
> And perhaps atomic variants of all these functions, with explicit and/or relaxed memory ordering, for counters shared by multiple threads.

Isn't rte_ctr64_get() already an atomic load with relaxed memory ordering?

> </feature creep>
>
  
Mattias Rönnblom May 26, 2024, 7:10 a.m. UTC | #19
On 2024-05-10 11:14, Morten Brørup wrote:
>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>> Sent: Friday, 10 May 2024 06.56
>>
>> On Thu, 9 May 2024 16:19:08 +0200
>> Morten Brørup <mb@smartsharesystems.com> wrote:
>>
>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>>>> Sent: Thursday, 9 May 2024 13.37
>>>>
>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>> Sent: Thursday, 9 May 2024 11.30
>>>>>
>>>>> On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
>>>>>>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>>>>>>> Sent: Wednesday, 8 May 2024 22.54
>>>>>>>
>>>>>>> On Wed, 8 May 2024 20:48:06 +0100
>>>>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>>>>>
>>>>>>>>>
>>>>>>>>> The idea of load tearing is crazy talk of integral types.
>> It
>>>>> would
>>>>>>> break so many things.
>>>>>>>>> It is the kind of stupid compiler thing that would send
>> Linus
>>>> on
>>>>> a
>>>>>>> rant and get
>>>>>>>>> the GCC compiler writers in trouble.
>>>>>>>>>
>>>>>>>>> The DPDK has always favored performance over strict safety
>>>> guard
>>>>>>> rails everywhere.
>>>>>>>>> Switching to making every statistic an atomic operation is
>> not
>>>>> in
>>>>>>> the spirit of
>>>>>>>>> what is required. There is no strict guarantee necessary
>> here.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I kind of agree with Stephen.
>>>>>>>>
>>>>>>>> Thanks Mattias, Morten & Stephen, it was informative
>> discussion.
>>>>> But
>>>>>>> for
>>>>>>>> *SW drivers* stats update and reset is not core
>> functionality
>>>> and
>>>>> I
>>>>>>>> think we can be OK to get hit on corner cases, instead of
>>>>>>>> over-engineering or making code more complex.
>>>>>>>
>>>>>>>
>>>>>>> I forgot the case of 64 bit values on 32 bit platforms!
>>>>>>> Mostly because haven't cared about 32 bit for years...
>>>>>>>
>>>>>>> The Linux kernel uses some wrappers to handle this.
>>>>>>> On 64 bit platforms they become noop.
>>>>>>> On 32 bit platform, they are protected by a seqlock and
>> updates
>>>> are
>>>>>>> wrapped by the sequence count.
>>>>>>>
>>>>>>> If we go this way, then doing similar Noop on 64 bit and
>> atomic or
>>>>>>> seqlock
>>>>>>> on 32 bit should be done, but in common helper.
>>>>>>>
>>>>>>> Looking inside FreeBSD, it looks like that has changed over
>> the
>>>>> years as
>>>>>>> well.
>>>>>>>
>>>>>>> 	if_inc_counter
>>>>>>> 		counter_u64_add
>>>>>>> 			atomic_add_64
>>>>>>> But the counters are always per-cpu in this case. So although
>> it
>>>>> does
>>>>>>> use
>>>>>>> locked operation, will always be uncontended.
>>>>>>>
>>>>>>>
>>>>>>> PS: Does DPDK still actually support 32 bit on x86? Can it be
>>>>> dropped
>>>>>>> this cycle?
>>>>>>
>>>>>> We cannot drop 32 bit architecture support altogether.
>>>>>>
>>>>>> But, unlike the Linux kernel, DPDK doesn't need to support
>> ancient
>>>> 32
>>>>> bit architectures.
>>>>>> If the few 32 bit architectures supported by DPDK provide non-
>>>> tearing
>>>>> 64 bit loads/stores, we don't need locks (in the fast path) for 64
>> bit
>>>>> counters.
>>>>>>
>>>>>> In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit
>>>>> architecture) and 32 bit ARMv8.
>>>>>> I don't think DPDK support any other 32 bit architectures.
>>>>>>
>>>>>>
>>>>>> As Mattias mentioned, 32 bit x86 can use xmm registers to
>> provide 64
>>>>> bit non-tearing load/store.
>>>>>>
>>>>>
>>>>> Testing this a little in godbolt, I see gcc using xmm registers on
>> 32-
>>>>> bit
>>>>> when updating 64-bit counters, but clang doesn't seem to do so,
>> but
>>>>> instead
>>>>> does 2 stores when writing back the 64 value. (I tried with both
>>>>> volatile
>>>>> and non-volatile 64-bit values, just to see if volatile would
>>>> encourage
>>>>> clang to do a single store).
>>>>>
>>>>> GCC: https://godbolt.org/z/9eqKfT3hz
>>>>> Clang: https://godbolt.org/z/PT5EqKn4c
>>>>
>>>> Interesting.
>>>> I guess this can be fixed by manually implementing what GCC does.
>>>>
>>>> I'm more concerned about finding a high-performance (in the fast
>> path)
>>>> 64 bit counter solution for 32 bit ARM.
>>>
>>> Reading up on the topic, and continuing Bruce's experiment on Godbolt,
>> it is possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store
>> Register Dual) instructions, which load/store 64 bit from memory into
>> two registers at once.
>>>
>>> Clang is emits more efficient code without volatile.
>>> GCC requires volatile to use STRD.
>>>
>>> Clang: https://godbolt.org/z/WjdTq6EKh
>>> GCC: https://godbolt.org/z/qq9j7d4Ea
>>>
>>> Summing it up, it is possible to implement non-tearing 64 bit high-
>> performance (lockless, barrier-free) counters on the 32 bit
>> architectures supported by DPDK.
>>>
>>> But the implementation is both architecture and compiler specific.
>>> So it seems a "64 bit counters" library would be handy. (Or a "non-
>> tearing 64 bit integers" library, for support of the signed variant too;
>> but I don't think we need that.)
>>> We can use uint64_t as the underlying type and type cast in the
>> library (where needed by the specific compiler/architecture), or
>> introduce a new rte_ctr64_t type to ensure that accessor functions are
>> always used and the developer doesn't have to worry about tearing on 32
>> bit architectures.
>>>
>>> The most simple variant of such a library only provides load and store
>> functions. The API would look like this:
>>>
>>> uint64_t inline
>>> rte_ctr64_get(const rte_ctr64_t *const ctr);
>>>
>>> void inline
>>> rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value);
>>>
>>> And if some CPU offers special instructions for increment or addition,
>> faster (regarding performance) and/or more compact (regarding
>> instruction memory) than a sequence of load-add-store instructions:
>>>
>>> void inline
>>> rte_ctr64_inc(rte_ctr64_t *const ctr);
>>>
>>> void inline
>>> rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);
> 
> Note: 32 bit architectures might achieve higher performance if the "value" parameter to rte_ctr64_add() is unsigned long (or unsigned int) instead of uint64_t.
> 

If the rte_ctr64_add() implementation is kept in the header file (or, 
always in case of LTO), I don't see how that could make a difference, 
assuming the operand is of a smaller (<64 bits) type.

>>>
>>> <feature creep>
>>> And perhaps atomic variants of all these functions, with explicit
>> and/or relaxed memory ordering, for counters shared by multiple threads.
>>> </feature creep>
>>>
>>
>>
>> This kind of what I am experimenting with but...
> 
> Excellent!
> 
>> Intend to keep the details of the counters read and update in one file
>> and not as inlines.
> 
> Please note that traffic management applications maintain many counters (e.g. per-flow, per-subscriber and per-link packet and byte counters, some also per QoS class), so rte_ctr64_add() must have the highest performance technically possible.
> 
> For reference, the packet scheduling code in our application updates 28 statistics counters per burst of packets. (In addition to internal state variables for e.g. queue lenghts.)
> 
> Furthermore, our application processes and displays live statistics with one second granularity, using a separate thread. Although statistics processing is not part of the fast path, the sheer number of counters processed requires high performance read access to those counters.
> 
> 
> <more feature creep>
> Some groups of counters are maintained locally in the inner loop, and then added in bulk to the "public" statistics afterwards. Conceptually:
> 
> struct stats_per_prio {
> 	uint64_t	forwarded_bytes;
> 	uint64_t	forwarded_packets;
> 	uint64_t	marked_bytes;
> 	uint64_t	marked_packets;
> 	uint64_t	dropped_bytes;
> 	uint64_t	dropped_packets;
> };
> 
> If this is a common design pattern in DPDK (drivers, libraries and/or applications), perhaps also provide a performance optimized function for architectures offering vector instructions:
> 
> void inline
> rte_ctr64_add_bulk(
> 		rte_ctr64_t *const ctrs,
> 		const unsigned long *const values,
> 		const unsigned int n /* compile time constant */);
> 
> This slightly resembles the Linux kernel's design pattern, where counters are updated in bulk, protected by a common lock for the bulk update. (However, DPDK has no lock, so the motivation for optimizing for this design pattern is only "nice to have".)
> 
> PS:
> Many DPDK applications are 64 bit only, including the SmartShare appliances, and can easily manage 64 bit counters without all this.
> However, if the DPDK project still considers 32 bit architectures first class citizens, 64 bit counters should have high performance on 32 bit architectures too.
>
  
Mattias Rönnblom May 26, 2024, 7:21 a.m. UTC | #20
On 2024-05-08 17:23, Stephen Hemminger wrote:
> On Wed, 8 May 2024 09:19:02 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> On 2024-05-04 00:00, Stephen Hemminger wrote:
>>> On Fri, 3 May 2024 16:45:47 +0100
>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>    
>>>> For stats reset, use an offset instead of zeroing out actual stats values,
>>>> get_stats() displays diff between stats and offset.
>>>> This way stats only updated in datapath and offset only updated in stats
>>>> reset function. This makes stats reset function more reliable.
>>>>
>>>> As stats only written by single thread, we can remove 'volatile' qualifier
>>>> which should improve the performance in datapath.
>>>>
>>>> While updating around, 'igb_stats' parameter renamed as 'stats'.
>>>>
>>>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>>>> ---
>>>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>>>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>>>
>>>> This update triggered by mail list discussion [1].
>>>>
>>>> [1]
>>>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
>>>
>>>
>>> NAK
>>>
>>> I did not hear a good argument why atomic or volatile was necessary in the first place.
>>> Why?
>>>    
>>
>> On the reader side, loads should be atomic.
>> On the writer side, stores should be atomic.
>>
>> Updates (stores) should actually occur in a timely manner. The complete
>> read-modify-write cycle need not be atomic, since we only have a single
>> writer. All this for the per-lcore counter case.
>>
>> If load or store tearing occurs, the counter values may occasionally
>> take totally bogus values. I think that should be avoided. Especially
>> since it will likely come at a very reasonable cost.
>>
>>   From what it seems to me, load or store tearing may well occur. GCC may
>> generate two 32-bit stores for a program-level 64-bit store on 32-bit
>> x86. If you have constant and immediate-data store instructions,
>> constant writes may also be end up teared. The kernel documentation has
>> some example of this. Add LTO, it's not necessarily going to be all that
>> clear what is storing-a-constant and what is not.
>>
>> Maybe you care a little less if statistics are occasionally broken, or
>> some transient, inconsistent state, but generally they should work, and
>> they should never have some totally bogus values. So, statistics aren't
>> snow flakes, mostly just business as usual.
>>
>> We can't both have a culture that promotes C11-style parallel
>> programming, or, at the extreme, push the C11 APIs as-is, and the say
>> "and btw you don't have to care about the standard when it comes to
>> statistics".
>>
>> We could adopt the Linux kernel's rules, programming model, and APIs
>> (ignoring legal issues). That would be very old school, maybe somewhat
>> over-engineered for our purpose, include a fair amount of inline
>> assembler, and also and may well depend on GCC or GCC-like compilers,
>> just like what I believe the kernel does.
>>
>> We could use something in-between, heavily inspired by C11 but still
>> with an opportunity to work around compiler issues, library issues, and
>> extend the API for our use case.
>>
>> I agree we shouldn't have to mark statistics _Atomic, or RTE_ATOMIC(),
>> rte_atomic64_t, or rte_sometimes_atomic_and_sometimes_not64_t. Just
>> keeping the usual C integer types seems like a better option to me.
>>
>>> Why is this driver special (a snowflake) compared to all the other drivers doing software
>>> statistics (tap, virtio, xdp, ring, memif, netvsc, vmware)?
>>
>> If a broken piece of code has been copied around, one place is going to
>> be the first to be fixed.
> 
> 
> I dislike when any driver does something completely different than valid precedent.
> No other driver in DPDK, Vpp, FreeBSD, Linux (and probably Windows) uses atomic for
> updating statistics. We even got performance benefit at MS from removing atomic
> increment of staistics in internal layers.
> 

All of those are using atomic stores when updating the statistics, I'm 
sure. Assuring a store being atomic is one thing, and assuring the whole 
read-modify-write cycle is atomic is a completely different (and very 
much more expensive) thing.

> The idea of load tearing is crazy talk of integral types. It would break so many things.
> It is the kind of stupid compiler thing that would send Linus on a rant and get
> the GCC compiler writers in trouble.
> 

On 32-bit x86, store tearing for 64-bit integral types is the order of 
the day.

For <64 bit types, I agree. The only cases (like the one listed in the 
kernel documentation), are going to be rare indeed.

> The DPDK has always favored performance over strict safety guard rails everywhere.
> Switching to making every statistic an atomic operation is not in the spirit of
> what is required. There is no strict guarantee necessary here.
> 

I think we agree but just use different terminology.
  

Patch

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index 6b7b16f3486d..ebef1cb06450 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -6,6 +6,7 @@ 
  * All rights reserved.
  */
 
+#include <rte_atomic.h>
 #include <rte_common.h>
 #include <rte_string_fns.h>
 #include <rte_mbuf.h>
@@ -40,6 +41,11 @@ 
 #define DFLT_FRAME_SIZE		(1 << 11)
 #define DFLT_FRAME_COUNT	(1 << 9)
 
+struct stats {
+	uint64_t counter;
+	uint64_t offset;
+};
+
 struct __rte_cache_aligned pkt_rx_queue {
 	int sockfd;
 
@@ -52,8 +58,8 @@  struct __rte_cache_aligned pkt_rx_queue {
 	uint16_t in_port;
 	uint8_t vlan_strip;
 
-	volatile unsigned long rx_pkts;
-	volatile unsigned long rx_bytes;
+	struct stats rx_pkts;
+	struct stats rx_bytes;
 };
 
 struct __rte_cache_aligned pkt_tx_queue {
@@ -65,9 +71,9 @@  struct __rte_cache_aligned pkt_tx_queue {
 	unsigned int framecount;
 	unsigned int framenum;
 
-	volatile unsigned long tx_pkts;
-	volatile unsigned long err_pkts;
-	volatile unsigned long tx_bytes;
+	struct stats tx_pkts;
+	struct stats err_pkts;
+	struct stats tx_bytes;
 };
 
 struct pmd_internals {
@@ -111,6 +117,34 @@  RTE_LOG_REGISTER_DEFAULT(af_packet_logtype, NOTICE);
 	rte_log(RTE_LOG_ ## level, af_packet_logtype, \
 		"%s(): " fmt ":%s\n", __func__, ##args, strerror(errno))
 
+static inline uint64_t
+stats_get(struct stats *s)
+{
+	uint64_t counter = rte_atomic_load_explicit(&s->counter,
+			rte_memory_order_relaxed);
+	uint64_t offset = rte_atomic_load_explicit(&s->offset,
+			rte_memory_order_relaxed);
+	return counter - offset;
+}
+
+static inline void
+stats_add(struct stats *s, uint16_t n)
+{
+	uint64_t counter = s->counter;
+	counter += n;
+	rte_atomic_store_explicit(&s->counter, counter,
+			rte_memory_order_relaxed);
+}
+
+static inline void
+stats_reset(struct stats *s)
+{
+	uint64_t counter = rte_atomic_load_explicit(&s->counter,
+			rte_memory_order_relaxed);
+	rte_atomic_store_explicit(&s->offset, counter,
+			rte_memory_order_relaxed);
+}
+
 static uint16_t
 eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
@@ -169,8 +203,8 @@  eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		num_rx_bytes += mbuf->pkt_len;
 	}
 	pkt_q->framenum = framenum;
-	pkt_q->rx_pkts += num_rx;
-	pkt_q->rx_bytes += num_rx_bytes;
+	stats_add(&pkt_q->rx_pkts, num_rx);
+	stats_add(&pkt_q->rx_bytes, num_rx_bytes);
 	return num_rx;
 }
 
@@ -305,9 +339,9 @@  eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	}
 
 	pkt_q->framenum = framenum;
-	pkt_q->tx_pkts += num_tx;
-	pkt_q->err_pkts += i - num_tx;
-	pkt_q->tx_bytes += num_tx_bytes;
+	stats_add(&pkt_q->tx_pkts, num_tx);
+	stats_add(&pkt_q->err_pkts, i - num_tx);
+	stats_add(&pkt_q->tx_bytes, num_tx_bytes);
 	return i;
 }
 
@@ -387,7 +421,7 @@  eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 }
 
 static int
-eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
+eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
 {
 	unsigned i, imax;
 	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
@@ -397,27 +431,29 @@  eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
 	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
 	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
 	for (i = 0; i < imax; i++) {
-		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
-		igb_stats->q_ibytes[i] = internal->rx_queue[i].rx_bytes;
-		rx_total += igb_stats->q_ipackets[i];
-		rx_bytes_total += igb_stats->q_ibytes[i];
+		struct pkt_rx_queue *rxq = &internal->rx_queue[i];
+		stats->q_ipackets[i] = stats_get(&rxq->rx_pkts);
+		stats->q_ibytes[i] = stats_get(&rxq->rx_bytes);
+		rx_total += stats->q_ipackets[i];
+		rx_bytes_total += stats->q_ibytes[i];
 	}
 
 	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
 	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
 	for (i = 0; i < imax; i++) {
-		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
-		igb_stats->q_obytes[i] = internal->tx_queue[i].tx_bytes;
-		tx_total += igb_stats->q_opackets[i];
-		tx_err_total += internal->tx_queue[i].err_pkts;
-		tx_bytes_total += igb_stats->q_obytes[i];
+		struct pkt_tx_queue *txq = &internal->tx_queue[i];
+		stats->q_opackets[i] = stats_get(&txq->tx_pkts);
+		stats->q_obytes[i] = stats_get(&txq->tx_bytes);
+		tx_total += stats->q_opackets[i];
+		tx_err_total += stats_get(&txq->err_pkts);
+		tx_bytes_total += stats->q_obytes[i];
 	}
 
-	igb_stats->ipackets = rx_total;
-	igb_stats->ibytes = rx_bytes_total;
-	igb_stats->opackets = tx_total;
-	igb_stats->oerrors = tx_err_total;
-	igb_stats->obytes = tx_bytes_total;
+	stats->ipackets = rx_total;
+	stats->ibytes = rx_bytes_total;
+	stats->opackets = tx_total;
+	stats->oerrors = tx_err_total;
+	stats->obytes = tx_bytes_total;
 	return 0;
 }
 
@@ -428,14 +464,16 @@  eth_stats_reset(struct rte_eth_dev *dev)
 	struct pmd_internals *internal = dev->data->dev_private;
 
 	for (i = 0; i < internal->nb_queues; i++) {
-		internal->rx_queue[i].rx_pkts = 0;
-		internal->rx_queue[i].rx_bytes = 0;
+		struct pkt_rx_queue *rxq = &internal->rx_queue[i];
+		stats_reset(&rxq->rx_pkts);
+		stats_reset(&rxq->rx_bytes);
 	}
 
 	for (i = 0; i < internal->nb_queues; i++) {
-		internal->tx_queue[i].tx_pkts = 0;
-		internal->tx_queue[i].err_pkts = 0;
-		internal->tx_queue[i].tx_bytes = 0;
+		struct pkt_tx_queue *txq = &internal->tx_queue[i];
+		stats_reset(&txq->tx_pkts);
+		stats_reset(&txq->err_pkts);
+		stats_reset(&txq->tx_bytes);
 	}
 
 	return 0;