[2/2] service: fix potential stats race-condition on MT services

Message ID 20220708125645.3141464-2-harry.van.haaren@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [1/2] test/service: add perf measurements for with stats mode |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/github-robot: build success github build: passed
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/intel-Testing success Testing PASS

Commit Message

Van Haaren, Harry July 8, 2022, 12:56 p.m. UTC
  This commit fixes a potential racey-add that could occur if
multiple service-lcores were executing the same MT-safe service
at the same time, with service statistics collection enabled.

Because multiple threads can run and execute the service, the
stats values can have multiple writer threads, resulting in the
requirement of using atomic addition for correctness.

Note that when a MT unsafe service is executed, a spinlock is
held, so the stats increments are protected. This fact is used
to avoid executing atomic add instructions when not required.

This patch causes a 1.25x increase in cycle-cost for polling a
MT safe service when statistics are enabled. No change was seen
for MT unsafe services, or when statistics are disabled.

Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Suggested-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---
---
 lib/eal/common/rte_service.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)
  

Comments

Morten Brørup July 8, 2022, 1:23 p.m. UTC | #1
> From: Harry van Haaren [mailto:harry.van.haaren@intel.com]
> Sent: Friday, 8 July 2022 14.57
> 
> This commit fixes a potential racey-add that could occur if
> multiple service-lcores were executing the same MT-safe service
> at the same time, with service statistics collection enabled.
> 
> Because multiple threads can run and execute the service, the
> stats values can have multiple writer threads, resulting in the
> requirement of using atomic addition for correctness.
> 
> Note that when a MT unsafe service is executed, a spinlock is
> held, so the stats increments are protected. This fact is used
> to avoid executing atomic add instructions when not required.
> 
> This patch causes a 1.25x increase in cycle-cost for polling a
> MT safe service when statistics are enabled. No change was seen
> for MT unsafe services, or when statistics are disabled.
> 
> Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> ---

[...]

> +		if (service_mt_safe(s)) {
> +			__atomic_fetch_add(&s->cycles_spent, cycles,
> __ATOMIC_RELAXED);
> +			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> +		} else {
> +			s->cycles_spent += cycles;
> +			s->calls++;
> +		}

Have you considered the performance cost of the __atomic_fetch_add(__ATOMIC_RELAXED) versus the performance cost of the branch to compare if the service is MT safe? It might be cheaper to just always use the atomic addition. I don't know, just mentioning that the compare-and-branch also has a cost.

I'm not familiar with the DPDK services library, so perhaps MT safe and MT unsafe services are never mixed, in which case the branch will always take the same path, so branch prediction will eliminate the cost of branching.
  
Van Haaren, Harry July 8, 2022, 1:44 p.m. UTC | #2
> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Friday, July 8, 2022 2:23 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Cc: mattias.ronnblom <mattias.ronnblom@ericsson.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT services

<snip commit message, focus on performance data>

> > This patch causes a 1.25x increase in cycle-cost for polling a
> > MT safe service when statistics are enabled. No change was seen
> > for MT unsafe services, or when statistics are disabled.
> >
> > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> >
> > ---
> 
> [...]
> 
> > +		if (service_mt_safe(s)) {
> > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > __ATOMIC_RELAXED);
> > +			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> > +		} else {
> > +			s->cycles_spent += cycles;
> > +			s->calls++;
> > +		}
> 
> Have you considered the performance cost of the
> __atomic_fetch_add(__ATOMIC_RELAXED) versus the performance cost of the
> branch to compare if the service is MT safe? It might be cheaper to just always use
> the atomic addition. I don't know, just mentioning that the compare-and-branch also
> has a cost.

Great question!

> I'm not familiar with the DPDK services library, so perhaps MT safe and MT unsafe
> services are never mixed, in which case the branch will always take the same path,
> so branch prediction will eliminate the cost of branching.

MT safe & unsafe can be mixed yes, so you're right, there may be mis-predicts. Note that
assuming a service is actually doing something useful, there's likely quite a few branches
between each call.. so unknown how fresh/accurate the branch history will be.

The common case is for many services to be "mt unsafe" (think polling an ethdev queue).
In this case, it is nice to try reduce cost. Given this is likely the highest quantity of services,
we'd like the performance here to be reduced the least. The branch method achieves that.

I did benchmark the "always use atomic" case, and it caused a ~30cycle hit in the "mt unsafe" case,
where the atomic is not required (and hence the performance hit can be avoided by branching).
Given branch-misses are handled between 15-20 cycles (uarch dependent), attempting to avoid the
atomic still makes sense from cycle-cost perspective too I think..

I did spend the morning benchmarking solutions (and hence the patch split,
to allow easy benchmarking before & after), so thanks for asking!

Regards, -Harry
  
Mattias Rönnblom July 8, 2022, 1:48 p.m. UTC | #3
On 2022-07-08 14:56, Harry van Haaren wrote:
> This commit fixes a potential racey-add that could occur if
> multiple service-lcores were executing the same MT-safe service
> at the same time, with service statistics collection enabled.
> 
> Because multiple threads can run and execute the service, the
> stats values can have multiple writer threads, resulting in the
> requirement of using atomic addition for correctness.
> 
> Note that when a MT unsafe service is executed, a spinlock is
> held, so the stats increments are protected. This fact is used
> to avoid executing atomic add instructions when not required.
> 
> This patch causes a 1.25x increase in cycle-cost for polling a
> MT safe service when statistics are enabled. No change was seen
> for MT unsafe services, or when statistics are disabled.
> 
> Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> ---
> ---
>   lib/eal/common/rte_service.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> index ef31b1f63c..f045e74ef3 100644
> --- a/lib/eal/common/rte_service.c
> +++ b/lib/eal/common/rte_service.c
> @@ -363,9 +363,15 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
>   		uint64_t start = rte_rdtsc();
>   		s->spec.callback(userdata);
>   		uint64_t end = rte_rdtsc();
> -		s->cycles_spent += end - start;
> +		uint64_t cycles = end - start;
>   		cs->calls_per_service[service_idx]++;
> -		s->calls++;
> +		if (service_mt_safe(s)) {
> +			__atomic_fetch_add(&s->cycles_spent, cycles, __ATOMIC_RELAXED);
> +			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> +		} else {
> +			s->cycles_spent += cycles;
> +			s->calls++;
> +		}
>   	} else
>   		s->spec.callback(userdata);
>   }

Reviewed-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Thanks for the fix Harry.
  
Morten Brørup July 8, 2022, 2:14 p.m. UTC | #4
> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> Sent: Friday, 8 July 2022 15.45
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Friday, July 8, 2022 2:23 PM
> 
> <snip commit message, focus on performance data>
> 
> > > This patch causes a 1.25x increase in cycle-cost for polling a
> > > MT safe service when statistics are enabled. No change was seen
> > > for MT unsafe services, or when statistics are disabled.
> > >
> > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > >
> > > ---
> >
> > [...]
> >
> > > +		if (service_mt_safe(s)) {
> > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > __ATOMIC_RELAXED);
> > > +			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> > > +		} else {
> > > +			s->cycles_spent += cycles;
> > > +			s->calls++;
> > > +		}
> >
> > Have you considered the performance cost of the
> > __atomic_fetch_add(__ATOMIC_RELAXED) versus the performance cost of
> the
> > branch to compare if the service is MT safe? It might be cheaper to
> just always use
> > the atomic addition. I don't know, just mentioning that the compare-
> and-branch also
> > has a cost.
> 
> Great question!
> 
> > I'm not familiar with the DPDK services library, so perhaps MT safe
> and MT unsafe
> > services are never mixed, in which case the branch will always take
> the same path,
> > so branch prediction will eliminate the cost of branching.
> 
> MT safe & unsafe can be mixed yes, so you're right, there may be mis-
> predicts. Note that
> assuming a service is actually doing something useful, there's likely
> quite a few branches
> between each call.. so unknown how fresh/accurate the branch history
> will be.
> 
> The common case is for many services to be "mt unsafe" (think polling
> an ethdev queue).
> In this case, it is nice to try reduce cost. Given this is likely the
> highest quantity of services,
> we'd like the performance here to be reduced the least. The branch
> method achieves that.
> 
> I did benchmark the "always use atomic" case, and it caused a ~30cycle
> hit in the "mt unsafe" case,
> where the atomic is not required (and hence the performance hit can be
> avoided by branching).
> Given branch-misses are handled between 15-20 cycles (uarch dependent),
> attempting to avoid the
> atomic still makes sense from cycle-cost perspective too I think..
> 
> I did spend the morning benchmarking solutions (and hence the patch
> split,
> to allow easy benchmarking before & after), so thanks for asking!
> 
> Regards, -Harry

Thank you for elaborating, Harry. I am impressed with the considerations you have put into this, and have no further concerns.

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
  
Honnappa Nagarahalli July 8, 2022, 3:16 p.m. UTC | #5
<snip>
> 
> This commit fixes a potential racey-add that could occur if multiple service-
> lcores were executing the same MT-safe service at the same time, with
> service statistics collection enabled.
> 
> Because multiple threads can run and execute the service, the stats values
> can have multiple writer threads, resulting in the requirement of using
> atomic addition for correctness.
> 
> Note that when a MT unsafe service is executed, a spinlock is held, so the
> stats increments are protected. This fact is used to avoid executing atomic
> add instructions when not required.
> 
> This patch causes a 1.25x increase in cycle-cost for polling a MT safe service
> when statistics are enabled. No change was seen for MT unsafe services, or
> when statistics are disabled.
> 
> Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> ---
> ---
>  lib/eal/common/rte_service.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> index ef31b1f63c..f045e74ef3 100644
> --- a/lib/eal/common/rte_service.c
> +++ b/lib/eal/common/rte_service.c
> @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> rte_service_spec_impl *s,
>  		uint64_t start = rte_rdtsc();
>  		s->spec.callback(userdata);
>  		uint64_t end = rte_rdtsc();
> -		s->cycles_spent += end - start;
> +		uint64_t cycles = end - start;
>  		cs->calls_per_service[service_idx]++;
> -		s->calls++;
> +		if (service_mt_safe(s)) {
> +			__atomic_fetch_add(&s->cycles_spent, cycles,
> __ATOMIC_RELAXED);
> +			__atomic_fetch_add(&s->calls, 1,
> __ATOMIC_RELAXED);
> +		} else {
> +			s->cycles_spent += cycles;
> +			s->calls++;
This is still a problem from a reader perspective. It is possible that the writes could be split while a reader is reading the stats. These need to be atomic adds.

> +		}
>  	} else
>  		s->spec.callback(userdata);
>  }
> --
> 2.32.0
  
Van Haaren, Harry July 8, 2022, 3:31 p.m. UTC | #6
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Friday, July 8, 2022 4:16 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Cc: mattias.ronnblom <mattias.ronnblom@ericsson.com>; Morten Brørup
> <mb@smartsharesystems.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT services

<snip previous discussions>

> > diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> > index ef31b1f63c..f045e74ef3 100644
> > --- a/lib/eal/common/rte_service.c
> > +++ b/lib/eal/common/rte_service.c
> > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > rte_service_spec_impl *s,
> >  		uint64_t start = rte_rdtsc();
> >  		s->spec.callback(userdata);
> >  		uint64_t end = rte_rdtsc();
> > -		s->cycles_spent += end - start;
> > +		uint64_t cycles = end - start;
> >  		cs->calls_per_service[service_idx]++;
> > -		s->calls++;
> > +		if (service_mt_safe(s)) {
> > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > __ATOMIC_RELAXED);
> > +			__atomic_fetch_add(&s->calls, 1,
> > __ATOMIC_RELAXED);
> > +		} else {
> > +			s->cycles_spent += cycles;
> > +			s->calls++;
> This is still a problem from a reader perspective. It is possible that the writes could be
> split while a reader is reading the stats. These need to be atomic adds.

Thanks for pointing out; I do "think" in x86 in terms of load/store tearing; and on x86
naturally aligned load/stores will not tear. Apologies for missing the ARM angle here.

I'm not sure how to best encode the difference between tearing & "locked instructions"
to make things multi-writer safe. But they're not the same thing, and I'd prefer not pay
the penalty for LOCK instructions (multi-writer) only to satisfy the non-tearing requirements.

Is there an rte_atomic-* type that is guaranteed as non-tearing?

In that case, changing the type of the calls/cycles_spent variables to such a type to ensure "non-tearing"
single-reader, single-writer behaviour is enough, instead of forcing __atomic_fetch_add() everywhere?
  
Bruce Richardson July 8, 2022, 4:21 p.m. UTC | #7
On Fri, Jul 08, 2022 at 03:31:01PM +0000, Van Haaren, Harry wrote:
> > -----Original Message-----
> > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Sent: Friday, July 8, 2022 4:16 PM
> > To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> > Cc: mattias.ronnblom <mattias.ronnblom@ericsson.com>; Morten Brørup
> > <mb@smartsharesystems.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> 
> <snip previous discussions>
> 
> > > diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> > > index ef31b1f63c..f045e74ef3 100644
> > > --- a/lib/eal/common/rte_service.c
> > > +++ b/lib/eal/common/rte_service.c
> > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > rte_service_spec_impl *s,
> > >  		uint64_t start = rte_rdtsc();
> > >  		s->spec.callback(userdata);
> > >  		uint64_t end = rte_rdtsc();
> > > -		s->cycles_spent += end - start;
> > > +		uint64_t cycles = end - start;
> > >  		cs->calls_per_service[service_idx]++;
> > > -		s->calls++;
> > > +		if (service_mt_safe(s)) {
> > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > __ATOMIC_RELAXED);
> > > +			__atomic_fetch_add(&s->calls, 1,
> > > __ATOMIC_RELAXED);
> > > +		} else {
> > > +			s->cycles_spent += cycles;
> > > +			s->calls++;
> > This is still a problem from a reader perspective. It is possible that the writes could be
> > split while a reader is reading the stats. These need to be atomic adds.
> 
> Thanks for pointing out; I do "think" in x86 in terms of load/store tearing; and on x86
> naturally aligned load/stores will not tear. Apologies for missing the ARM angle here.
> 
> I'm not sure how to best encode the difference between tearing & "locked instructions"
> to make things multi-writer safe. But they're not the same thing, and I'd prefer not pay
> the penalty for LOCK instructions (multi-writer) only to satisfy the non-tearing requirements.
> 
> Is there an rte_atomic-* type that is guaranteed as non-tearing?
> 
> In that case, changing the type of the calls/cycles_spent variables to such a type to ensure "non-tearing"
> single-reader, single-writer behaviour is enough, instead of forcing __atomic_fetch_add() everywhere?

Regular read, increment and then atomic store should work without locks
where alignment is correct on most architectures, and doing the store
atomically should prevent any tearing.
  
Morten Brørup July 8, 2022, 4:29 p.m. UTC | #8
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 8 July 2022 17.16
> 
> <snip>
> >
> > This commit fixes a potential racey-add that could occur if multiple
> service-
> > lcores were executing the same MT-safe service at the same time, with
> > service statistics collection enabled.
> >
> > Because multiple threads can run and execute the service, the stats
> values
> > can have multiple writer threads, resulting in the requirement of
> using
> > atomic addition for correctness.
> >
> > Note that when a MT unsafe service is executed, a spinlock is held,
> so the
> > stats increments are protected. This fact is used to avoid executing
> atomic
> > add instructions when not required.
> >
> > This patch causes a 1.25x increase in cycle-cost for polling a MT
> safe service
> > when statistics are enabled. No change was seen for MT unsafe
> services, or
> > when statistics are disabled.
> >
> > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> >
> > ---
> > ---
> >  lib/eal/common/rte_service.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/eal/common/rte_service.c
> b/lib/eal/common/rte_service.c
> > index ef31b1f63c..f045e74ef3 100644
> > --- a/lib/eal/common/rte_service.c
> > +++ b/lib/eal/common/rte_service.c
> > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > rte_service_spec_impl *s,
> >  		uint64_t start = rte_rdtsc();
> >  		s->spec.callback(userdata);
> >  		uint64_t end = rte_rdtsc();
> > -		s->cycles_spent += end - start;
> > +		uint64_t cycles = end - start;
> >  		cs->calls_per_service[service_idx]++;
> > -		s->calls++;
> > +		if (service_mt_safe(s)) {
> > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > __ATOMIC_RELAXED);
> > +			__atomic_fetch_add(&s->calls, 1,
> > __ATOMIC_RELAXED);
> > +		} else {
> > +			s->cycles_spent += cycles;
> > +			s->calls++;
> This is still a problem from a reader perspective. It is possible that
> the writes could be split while a reader is reading the stats. These
> need to be atomic adds.

I don't understand what you suggest can go wrong here, Honnappa. If you talking about 64 bit counters on 32 bit architectures, then I understand the problem (and have many years of direct experience with it myself). Otherwise, I hope you can elaborate or direct me to educational material about the issue, considering this a learning opportunity. :-)

> 
> > +		}
> >  	} else
> >  		s->spec.callback(userdata);
> >  }
> > --
> > 2.32.0
  
Honnappa Nagarahalli July 8, 2022, 4:33 p.m. UTC | #9
> > <snip previous discussions>
> >
> > > > diff --git a/lib/eal/common/rte_service.c
> > > > b/lib/eal/common/rte_service.c index ef31b1f63c..f045e74ef3 100644
> > > > --- a/lib/eal/common/rte_service.c
> > > > +++ b/lib/eal/common/rte_service.c
> > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > > rte_service_spec_impl *s,
> > > >  		uint64_t start = rte_rdtsc();
> > > >  		s->spec.callback(userdata);
> > > >  		uint64_t end = rte_rdtsc();
> > > > -		s->cycles_spent += end - start;
> > > > +		uint64_t cycles = end - start;
> > > >  		cs->calls_per_service[service_idx]++;
> > > > -		s->calls++;
> > > > +		if (service_mt_safe(s)) {
> > > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > > __ATOMIC_RELAXED);
> > > > +			__atomic_fetch_add(&s->calls, 1,
> > > > __ATOMIC_RELAXED);
> > > > +		} else {
> > > > +			s->cycles_spent += cycles;
> > > > +			s->calls++;
> > > This is still a problem from a reader perspective. It is possible
> > > that the writes could be split while a reader is reading the stats. These
> need to be atomic adds.
> >
> > Thanks for pointing out; I do "think" in x86 in terms of load/store
> > tearing; and on x86 naturally aligned load/stores will not tear. Apologies for
> missing the ARM angle here.
Arm architecture has similar things as well. I believe compiler does not provide any guarantees that it will only generate non-tearing instructions. Refer to a discussion in the past [1] [2] where it was thought that the compiler would generate a non-tearing store (this is a slightly different scenario).

[1] http://inbox.dpdk.org/dev/d5d563ab-0411-3faf-39ec-4994f2bc9f6f@intel.com/
[2] Refer to commit '316095eb41ed22da808b5826404c398917c83c89'

> >
> > I'm not sure how to best encode the difference between tearing & "locked
> instructions"
> > to make things multi-writer safe. But they're not the same thing, and
> > I'd prefer not pay the penalty for LOCK instructions (multi-writer) only to
> satisfy the non-tearing requirements.
> >
> > Is there an rte_atomic-* type that is guaranteed as non-tearing?
Nothing that I know of.

> >
> > In that case, changing the type of the calls/cycles_spent variables to such a
> type to ensure "non-tearing"
> > single-reader, single-writer behaviour is enough, instead of forcing
> __atomic_fetch_add() everywhere?
> 
> Regular read, increment and then atomic store should work without locks
> where alignment is correct on most architectures, and doing the store
> atomically should prevent any tearing.
Agree, we could do this, will provide more flexibility for the micro-architecture to work with. Good to understand the perf benefits vs complexity and the branch cost.
  
Honnappa Nagarahalli July 8, 2022, 4:45 p.m. UTC | #10
<snip>
> > >
> > > This commit fixes a potential racey-add that could occur if multiple
> > service-
> > > lcores were executing the same MT-safe service at the same time,
> > > with service statistics collection enabled.
> > >
> > > Because multiple threads can run and execute the service, the stats
> > values
> > > can have multiple writer threads, resulting in the requirement of
> > using
> > > atomic addition for correctness.
> > >
> > > Note that when a MT unsafe service is executed, a spinlock is held,
> > so the
> > > stats increments are protected. This fact is used to avoid executing
> > atomic
> > > add instructions when not required.
> > >
> > > This patch causes a 1.25x increase in cycle-cost for polling a MT
> > safe service
> > > when statistics are enabled. No change was seen for MT unsafe
> > services, or
> > > when statistics are disabled.
> > >
> > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > >
> > > ---
> > > ---
> > >  lib/eal/common/rte_service.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/eal/common/rte_service.c
> > b/lib/eal/common/rte_service.c
> > > index ef31b1f63c..f045e74ef3 100644
> > > --- a/lib/eal/common/rte_service.c
> > > +++ b/lib/eal/common/rte_service.c
> > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > rte_service_spec_impl *s,
> > >  		uint64_t start = rte_rdtsc();
> > >  		s->spec.callback(userdata);
> > >  		uint64_t end = rte_rdtsc();
> > > -		s->cycles_spent += end - start;
> > > +		uint64_t cycles = end - start;
> > >  		cs->calls_per_service[service_idx]++;
> > > -		s->calls++;
> > > +		if (service_mt_safe(s)) {
> > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > __ATOMIC_RELAXED);
> > > +			__atomic_fetch_add(&s->calls, 1,
> > > __ATOMIC_RELAXED);
> > > +		} else {
> > > +			s->cycles_spent += cycles;
> > > +			s->calls++;
> > This is still a problem from a reader perspective. It is possible that
> > the writes could be split while a reader is reading the stats. These
> > need to be atomic adds.
> 
> I don't understand what you suggest can go wrong here, Honnappa. If you
> talking about 64 bit counters on 32 bit architectures, then I understand the
> problem (and have many years of direct experience with it myself).
> Otherwise, I hope you can elaborate or direct me to educational material
> about the issue, considering this a learning opportunity. :-)
I am thinking of the case where the 64b write is split into two 32b (or more) write operations either by the compiler or the micro-architecture. If this were to happen, it causes race conditions with the reader.

As far as I understand, the compiler does not provide any guarantees on generating non-tearing stores unless an atomic builtin/function is used. If we have to ensure the micro-architecture does not generate split writes, we need to be careful that future code additions do not change the alignment of the stats.

> 
> >
> > > +		}
> > >  	} else
> > >  		s->spec.callback(userdata);
> > >  }
> > > --
> > > 2.32.0
  
Morten Brørup July 8, 2022, 5:22 p.m. UTC | #11
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 8 July 2022 18.46
> 
> <snip>
> > > >
> > > > This commit fixes a potential racey-add that could occur if
> multiple
> > > service-
> > > > lcores were executing the same MT-safe service at the same time,
> > > > with service statistics collection enabled.
> > > >
> > > > Because multiple threads can run and execute the service, the
> stats
> > > values
> > > > can have multiple writer threads, resulting in the requirement of
> > > using
> > > > atomic addition for correctness.
> > > >
> > > > Note that when a MT unsafe service is executed, a spinlock is
> held,
> > > so the
> > > > stats increments are protected. This fact is used to avoid
> executing
> > > atomic
> > > > add instructions when not required.
> > > >
> > > > This patch causes a 1.25x increase in cycle-cost for polling a MT
> > > safe service
> > > > when statistics are enabled. No change was seen for MT unsafe
> > > services, or
> > > > when statistics are disabled.
> > > >
> > > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > >
> > > > ---
> > > > ---
> > > >  lib/eal/common/rte_service.c | 10 ++++++++--
> > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/eal/common/rte_service.c
> > > b/lib/eal/common/rte_service.c
> > > > index ef31b1f63c..f045e74ef3 100644
> > > > --- a/lib/eal/common/rte_service.c
> > > > +++ b/lib/eal/common/rte_service.c
> > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > > rte_service_spec_impl *s,
> > > >  		uint64_t start = rte_rdtsc();
> > > >  		s->spec.callback(userdata);
> > > >  		uint64_t end = rte_rdtsc();
> > > > -		s->cycles_spent += end - start;
> > > > +		uint64_t cycles = end - start;
> > > >  		cs->calls_per_service[service_idx]++;
> > > > -		s->calls++;
> > > > +		if (service_mt_safe(s)) {
> > > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > > __ATOMIC_RELAXED);
> > > > +			__atomic_fetch_add(&s->calls, 1,
> > > > __ATOMIC_RELAXED);
> > > > +		} else {
> > > > +			s->cycles_spent += cycles;
> > > > +			s->calls++;
> > > This is still a problem from a reader perspective. It is possible
> that
> > > the writes could be split while a reader is reading the stats.
> These
> > > need to be atomic adds.
> >
> > I don't understand what you suggest can go wrong here, Honnappa. If
> you
> > talking about 64 bit counters on 32 bit architectures, then I
> understand the
> > problem (and have many years of direct experience with it myself).
> > Otherwise, I hope you can elaborate or direct me to educational
> material
> > about the issue, considering this a learning opportunity. :-)
> I am thinking of the case where the 64b write is split into two 32b (or
> more) write operations either by the compiler or the micro-
> architecture. If this were to happen, it causes race conditions with
> the reader.
> 
> As far as I understand, the compiler does not provide any guarantees on
> generating non-tearing stores unless an atomic builtin/function is
> used.

This seems like a generic problem for all 64b statistics counters in DPDK, and any other C code using 64 bit counters. Being a generic C problem, there is probably a generic solution to it.

Is any compiler going to do something that stupid (i.e. tearing a store into multiple write operations) to a simple 64b counter on any 64 bit architecture (assuming the counter is 64b aligned)? Otherwise, we might only need to take special precautions for 32 bit architectures.

> If we have to ensure the micro-architecture does not generate
> split writes, we need to be careful that future code additions do not
> change the alignment of the stats.

Unless the structure containing the stats counters is packed, the contained 64b counters will be 64b aligned (on 64 bit architecture). So we should not worry about alignment, except perhaps on 32 bit architectures.
  
Honnappa Nagarahalli July 8, 2022, 5:39 p.m. UTC | #12
<snip>
> > > > >
> > > > > This commit fixes a potential racey-add that could occur if
> > multiple
> > > > service-
> > > > > lcores were executing the same MT-safe service at the same time,
> > > > > with service statistics collection enabled.
> > > > >
> > > > > Because multiple threads can run and execute the service, the
> > stats
> > > > values
> > > > > can have multiple writer threads, resulting in the requirement
> > > > > of
> > > > using
> > > > > atomic addition for correctness.
> > > > >
> > > > > Note that when a MT unsafe service is executed, a spinlock is
> > held,
> > > > so the
> > > > > stats increments are protected. This fact is used to avoid
> > executing
> > > > atomic
> > > > > add instructions when not required.
> > > > >
> > > > > This patch causes a 1.25x increase in cycle-cost for polling a
> > > > > MT
> > > > safe service
> > > > > when statistics are enabled. No change was seen for MT unsafe
> > > > services, or
> > > > > when statistics are disabled.
> > > > >
> > > > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > > > Suggested-by: Honnappa Nagarahalli
> > > > > <Honnappa.Nagarahalli@arm.com>
> > > > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > > >
> > > > > ---
> > > > > ---
> > > > >  lib/eal/common/rte_service.c | 10 ++++++++--
> > > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/eal/common/rte_service.c
> > > > b/lib/eal/common/rte_service.c
> > > > > index ef31b1f63c..f045e74ef3 100644
> > > > > --- a/lib/eal/common/rte_service.c
> > > > > +++ b/lib/eal/common/rte_service.c
> > > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > > > rte_service_spec_impl *s,
> > > > >  		uint64_t start = rte_rdtsc();
> > > > >  		s->spec.callback(userdata);
> > > > >  		uint64_t end = rte_rdtsc();
> > > > > -		s->cycles_spent += end - start;
> > > > > +		uint64_t cycles = end - start;
> > > > >  		cs->calls_per_service[service_idx]++;
> > > > > -		s->calls++;
> > > > > +		if (service_mt_safe(s)) {
> > > > > +			__atomic_fetch_add(&s->cycles_spent,
> cycles,
> > > > > __ATOMIC_RELAXED);
> > > > > +			__atomic_fetch_add(&s->calls, 1,
> > > > > __ATOMIC_RELAXED);
> > > > > +		} else {
> > > > > +			s->cycles_spent += cycles;
> > > > > +			s->calls++;
> > > > This is still a problem from a reader perspective. It is possible
> > that
> > > > the writes could be split while a reader is reading the stats.
> > These
> > > > need to be atomic adds.
> > >
> > > I don't understand what you suggest can go wrong here, Honnappa. If
> > you
> > > talking about 64 bit counters on 32 bit architectures, then I
> > understand the
> > > problem (and have many years of direct experience with it myself).
> > > Otherwise, I hope you can elaborate or direct me to educational
> > material
> > > about the issue, considering this a learning opportunity. :-)
> > I am thinking of the case where the 64b write is split into two 32b
> > (or
> > more) write operations either by the compiler or the micro-
> > architecture. If this were to happen, it causes race conditions with
> > the reader.
> >
> > As far as I understand, the compiler does not provide any guarantees
> > on generating non-tearing stores unless an atomic builtin/function is
> > used.
> 
> This seems like a generic problem for all 64b statistics counters in DPDK, and
> any other C code using 64 bit counters. Being a generic C problem, there is
> probably a generic solution to it.
Browsing through the code, I see similar problems elsewhere.

> 
> Is any compiler going to do something that stupid (i.e. tearing a store into
> multiple write operations) to a simple 64b counter on any 64 bit architecture
> (assuming the counter is 64b aligned)? Otherwise, we might only need to
> take special precautions for 32 bit architectures.
It is always a debate on who is stupid, compiler or programmer 😊

Though not the same case, you can look at this discussion where compiler generated torn stores [1] when we all thought it has been generating a 64b store.

[1] http://inbox.dpdk.org/dev/d5d563ab-0411-3faf-39ec-4994f2bc9f6f@intel.com/

> 
> > If we have to ensure the micro-architecture does not generate split
> > writes, we need to be careful that future code additions do not change
> > the alignment of the stats.
> 
> Unless the structure containing the stats counters is packed, the contained
> 64b counters will be 64b aligned (on 64 bit architecture). So we should not
> worry about alignment, except perhaps on 32 bit architectures.
Agree, future code changes need to be aware of these issues and DPDK supports 32b architectures.
  
Morten Brørup July 8, 2022, 6:08 p.m. UTC | #13
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 8 July 2022 19.40
> 
> <snip>
> > > > > >
> > > > > > This commit fixes a potential racey-add that could occur if
> > > multiple
> > > > > service-
> > > > > > lcores were executing the same MT-safe service at the same
> time,
> > > > > > with service statistics collection enabled.
> > > > > >
> > > > > > Because multiple threads can run and execute the service, the
> > > stats
> > > > > values
> > > > > > can have multiple writer threads, resulting in the
> requirement
> > > > > > of
> > > > > using
> > > > > > atomic addition for correctness.
> > > > > >
> > > > > > Note that when a MT unsafe service is executed, a spinlock is
> > > held,
> > > > > so the
> > > > > > stats increments are protected. This fact is used to avoid
> > > executing
> > > > > atomic
> > > > > > add instructions when not required.
> > > > > >
> > > > > > This patch causes a 1.25x increase in cycle-cost for polling
> a
> > > > > > MT
> > > > > safe service
> > > > > > when statistics are enabled. No change was seen for MT unsafe
> > > > > services, or
> > > > > > when statistics are disabled.
> > > > > >
> > > > > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > > > > Suggested-by: Honnappa Nagarahalli
> > > > > > <Honnappa.Nagarahalli@arm.com>
> > > > > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > > > >
> > > > > > ---
> > > > > > ---
> > > > > >  lib/eal/common/rte_service.c | 10 ++++++++--
> > > > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/eal/common/rte_service.c
> > > > > b/lib/eal/common/rte_service.c
> > > > > > index ef31b1f63c..f045e74ef3 100644
> > > > > > --- a/lib/eal/common/rte_service.c
> > > > > > +++ b/lib/eal/common/rte_service.c
> > > > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > > > > rte_service_spec_impl *s,
> > > > > >  		uint64_t start = rte_rdtsc();
> > > > > >  		s->spec.callback(userdata);
> > > > > >  		uint64_t end = rte_rdtsc();
> > > > > > -		s->cycles_spent += end - start;
> > > > > > +		uint64_t cycles = end - start;
> > > > > >  		cs->calls_per_service[service_idx]++;
> > > > > > -		s->calls++;
> > > > > > +		if (service_mt_safe(s)) {
> > > > > > +			__atomic_fetch_add(&s->cycles_spent,
> > cycles,
> > > > > > __ATOMIC_RELAXED);
> > > > > > +			__atomic_fetch_add(&s->calls, 1,
> > > > > > __ATOMIC_RELAXED);
> > > > > > +		} else {
> > > > > > +			s->cycles_spent += cycles;
> > > > > > +			s->calls++;
> > > > > This is still a problem from a reader perspective. It is
> possible
> > > that
> > > > > the writes could be split while a reader is reading the stats.
> > > These
> > > > > need to be atomic adds.
> > > >
> > > > I don't understand what you suggest can go wrong here, Honnappa.
> If
> > > you
> > > > talking about 64 bit counters on 32 bit architectures, then I
> > > understand the
> > > > problem (and have many years of direct experience with it
> myself).
> > > > Otherwise, I hope you can elaborate or direct me to educational
> > > material
> > > > about the issue, considering this a learning opportunity. :-)
> > > I am thinking of the case where the 64b write is split into two 32b
> > > (or
> > > more) write operations either by the compiler or the micro-
> > > architecture. If this were to happen, it causes race conditions
> with
> > > the reader.
> > >
> > > As far as I understand, the compiler does not provide any
> guarantees
> > > on generating non-tearing stores unless an atomic builtin/function
> is
> > > used.
> >
> > This seems like a generic problem for all 64b statistics counters in
> DPDK, and
> > any other C code using 64 bit counters. Being a generic C problem,
> there is
> > probably a generic solution to it.
> Browsing through the code, I see similar problems elsewhere.
> 
> >
> > Is any compiler going to do something that stupid (i.e. tearing a
> store into
> > multiple write operations) to a simple 64b counter on any 64 bit
> architecture
> > (assuming the counter is 64b aligned)? Otherwise, we might only need
> to
> > take special precautions for 32 bit architectures.
> It is always a debate on who is stupid, compiler or programmer 😊

Compilers will never stop surprising me. Thankfully, they are not so unreliable and full of bugs as they were 25 years ago. :-)

> 
> Though not the same case, you can look at this discussion where
> compiler generated torn stores [1] when we all thought it has been
> generating a 64b store.
> 
> [1] http://inbox.dpdk.org/dev/d5d563ab-0411-3faf-39ec-
> 4994f2bc9f6f@intel.com/

Good reference.

Technically, this sets a bunch of fields in the rte_lpm_tbl_entry structure (which happens to be 32b in total size), so it is not completely unreasonable for the compiler to store those fields individually. I wonder if using a union to cast the rte_lpm_tbl_entry struct to uint32_t (and ensuring 32b alignment) would have solved the problem, and the __atomic_store() could be avoided?

> 
> >
> > > If we have to ensure the micro-architecture does not generate split
> > > writes, we need to be careful that future code additions do not
> change
> > > the alignment of the stats.
> >
> > Unless the structure containing the stats counters is packed, the
> contained
> > 64b counters will be 64b aligned (on 64 bit architecture). So we
> should not
> > worry about alignment, except perhaps on 32 bit architectures.
> Agree, future code changes need to be aware of these issues and DPDK
> supports 32b architectures.
  
Mattias Rönnblom July 8, 2022, 8:02 p.m. UTC | #14
On 2022-07-08 18:21, Bruce Richardson wrote:
> On Fri, Jul 08, 2022 at 03:31:01PM +0000, Van Haaren, Harry wrote:
>>> -----Original Message-----
>>> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
>>> Sent: Friday, July 8, 2022 4:16 PM
>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
>>> Cc: mattias.ronnblom <mattias.ronnblom@ericsson.com>; Morten Brørup
>>> <mb@smartsharesystems.com>; nd <nd@arm.com>; nd <nd@arm.com>
>>> Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
>>
>> <snip previous discussions>
>>
>>>> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
>>>> index ef31b1f63c..f045e74ef3 100644
>>>> --- a/lib/eal/common/rte_service.c
>>>> +++ b/lib/eal/common/rte_service.c
>>>> @@ -363,9 +363,15 @@ service_runner_do_callback(struct
>>>> rte_service_spec_impl *s,
>>>>   		uint64_t start = rte_rdtsc();
>>>>   		s->spec.callback(userdata);
>>>>   		uint64_t end = rte_rdtsc();
>>>> -		s->cycles_spent += end - start;
>>>> +		uint64_t cycles = end - start;
>>>>   		cs->calls_per_service[service_idx]++;
>>>> -		s->calls++;
>>>> +		if (service_mt_safe(s)) {
>>>> +			__atomic_fetch_add(&s->cycles_spent, cycles,
>>>> __ATOMIC_RELAXED);
>>>> +			__atomic_fetch_add(&s->calls, 1,
>>>> __ATOMIC_RELAXED);
>>>> +		} else {
>>>> +			s->cycles_spent += cycles;
>>>> +			s->calls++;
>>> This is still a problem from a reader perspective. It is possible that the writes could be
>>> split while a reader is reading the stats. These need to be atomic adds.
>>
>> Thanks for pointing out; I do "think" in x86 in terms of load/store tearing; and on x86
>> naturally aligned load/stores will not tear. Apologies for missing the ARM angle here.
>>
>> I'm not sure how to best encode the difference between tearing & "locked instructions"
>> to make things multi-writer safe. But they're not the same thing, and I'd prefer not pay
>> the penalty for LOCK instructions (multi-writer) only to satisfy the non-tearing requirements.
>>
>> Is there an rte_atomic-* type that is guaranteed as non-tearing?
>>
>> In that case, changing the type of the calls/cycles_spent variables to such a type to ensure "non-tearing"
>> single-reader, single-writer behaviour is enough, instead of forcing __atomic_fetch_add() everywhere?
> 
> Regular read, increment and then atomic store should work without locks
> where alignment is correct on most architectures, and doing the store
> atomically should prevent any tearing.

This is a good pattern, and provides a noticeable performance benefit 
compared to using an atomic add (which is an overkill in single-writer 
scenarios), in my experience. The DPDK seqcount implementation 
increments the count in this manner.
  

Patch

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index ef31b1f63c..f045e74ef3 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -363,9 +363,15 @@  service_runner_do_callback(struct rte_service_spec_impl *s,
 		uint64_t start = rte_rdtsc();
 		s->spec.callback(userdata);
 		uint64_t end = rte_rdtsc();
-		s->cycles_spent += end - start;
+		uint64_t cycles = end - start;
 		cs->calls_per_service[service_idx]++;
-		s->calls++;
+		if (service_mt_safe(s)) {
+			__atomic_fetch_add(&s->cycles_spent, cycles, __ATOMIC_RELAXED);
+			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
+		} else {
+			s->cycles_spent += cycles;
+			s->calls++;
+		}
 	} else
 		s->spec.callback(userdata);
 }