[v3,4/8] test/mcslock: use compiler atomics for lcores sync
Checks
Commit Message
Convert rte_atomic usages to compiler atomic built-ins for lcores
sync in mcslock testcases.
Signed-off-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
app/test/test_mcslock.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
Comments
Hi Joyce,
On Mon, Jul 19, 2021 at 10:51:21PM -0500, Joyce Kong wrote:
> Convert rte_atomic usages to compiler atomic built-ins for lcores
> sync in mcslock testcases.
>
> Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Acked-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
> app/test/test_mcslock.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/app/test/test_mcslock.c b/app/test/test_mcslock.c
> index 80eaecc90a..52e45e7e2a 100644
> --- a/app/test/test_mcslock.c
> +++ b/app/test/test_mcslock.c
> @@ -17,7 +17,6 @@
> #include <rte_lcore.h>
> #include <rte_cycles.h>
> #include <rte_mcslock.h>
> -#include <rte_atomic.h>
>
> #include "test.h"
>
> @@ -43,7 +42,7 @@ rte_mcslock_t *p_ml_perf;
>
> static unsigned int count;
>
> -static rte_atomic32_t synchro;
> +static uint32_t synchro;
>
> static int
> test_mcslock_per_core(__rte_unused void *arg)
> @@ -76,8 +75,7 @@ load_loop_fn(void *func_param)
> rte_mcslock_t ml_perf_me;
>
> /* wait synchro */
> - while (rte_atomic32_read(&synchro) == 0)
> - ;
> + rte_wait_until_equal_32(&synchro, 1, __ATOMIC_RELAXED);
>
> begin = rte_get_timer_cycles();
> while (lcount < MAX_LOOP) {
> @@ -102,15 +100,15 @@ test_mcslock_perf(void)
> const unsigned int lcore = rte_lcore_id();
>
> printf("\nTest with no lock on single core...\n");
> - rte_atomic32_set(&synchro, 1);
> + __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
> load_loop_fn(&lock);
> printf("Core [%u] Cost Time = %"PRIu64" us\n",
> lcore, time_count[lcore]);
> memset(time_count, 0, sizeof(time_count));
>
> printf("\nTest with lock on single core...\n");
> + __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
> lock = 1;
> - rte_atomic32_set(&synchro, 1);
nit: is there a reason for moving this line?
> load_loop_fn(&lock);
> printf("Core [%u] Cost Time = %"PRIu64" us\n",
> lcore, time_count[lcore]);
> @@ -118,11 +116,11 @@ test_mcslock_perf(void)
>
> printf("\nTest with lock on %u cores...\n", (rte_lcore_count()));
>
> - rte_atomic32_set(&synchro, 0);
> + __atomic_store_n(&synchro, 0, __ATOMIC_RELAXED);
> rte_eal_mp_remote_launch(load_loop_fn, &lock, SKIP_MAIN);
>
> /* start synchro and launch test on main */
> - rte_atomic32_set(&synchro, 1);
> + __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
> load_loop_fn(&lock);
I have a more general question. Please forgive my ignorance about the
C++11 atomic builtins and memory model. Both gcc manual and C11 standard
are not that easy to understand :)
In all the patches of this patchset, __ATOMIC_RELAXED is used. My
understanding is that it does not add any inter-thread ordering
constraint. I suppose that in this particular case, we rely on
the call to rte_eal_mp_remote_launch() being a compiler barrier,
and the function itself to be a memory barrier. This ensures that
worker threads sees synchro=0 until it is set to 1 by the master.
Is it correct?
What is the reason for using the atomic API here? Wouldn't a standard
affectation work too? (I mean "synchro = 1;")
>
> rte_eal_mp_wait_lcore();
> --
> 2.17.1
>
Hi Olivier,
> -----Original Message-----
> From: Olivier Matz <olivier.matz@6wind.com>
> Sent: Wednesday, July 28, 2021 5:57 PM
> To: Joyce Kong <Joyce.Kong@arm.com>
> Cc: thomas@monjalon.net; david.marchand@redhat.com;
> roretzla@linux.microsoft.com; stephen@networkplumber.org;
> andrew.rybchenko@oktetlabs.ru; harry.van.haaren@intel.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; dev@dpdk.org; nd <nd@arm.com>
> Subject: Re: [PATCH v3 4/8] test/mcslock: use compiler atomics for lcores
> sync
>
> Hi Joyce,
>
> On Mon, Jul 19, 2021 at 10:51:21PM -0500, Joyce Kong wrote:
> > Convert rte_atomic usages to compiler atomic built-ins for lcores sync
> > in mcslock testcases.
> >
> > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Acked-by: Stephen Hemminger <stephen@networkplumber.org>
> > ---
> > app/test/test_mcslock.c | 14 ++++++--------
> > 1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/app/test/test_mcslock.c b/app/test/test_mcslock.c index
> > 80eaecc90a..52e45e7e2a 100644
> > --- a/app/test/test_mcslock.c
> > +++ b/app/test/test_mcslock.c
> > @@ -17,7 +17,6 @@
> > #include <rte_lcore.h>
> > #include <rte_cycles.h>
> > #include <rte_mcslock.h>
> > -#include <rte_atomic.h>
> >
> > #include "test.h"
> >
> > @@ -43,7 +42,7 @@ rte_mcslock_t *p_ml_perf;
> >
> > static unsigned int count;
> >
> > -static rte_atomic32_t synchro;
> > +static uint32_t synchro;
> >
> > static int
> > test_mcslock_per_core(__rte_unused void *arg) @@ -76,8 +75,7 @@
> > load_loop_fn(void *func_param)
> > rte_mcslock_t ml_perf_me;
> >
> > /* wait synchro */
> > - while (rte_atomic32_read(&synchro) == 0)
> > - ;
> > + rte_wait_until_equal_32(&synchro, 1, __ATOMIC_RELAXED);
> >
> > begin = rte_get_timer_cycles();
> > while (lcount < MAX_LOOP) {
> > @@ -102,15 +100,15 @@ test_mcslock_perf(void)
> > const unsigned int lcore = rte_lcore_id();
> >
> > printf("\nTest with no lock on single core...\n");
> > - rte_atomic32_set(&synchro, 1);
> > + __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
> > load_loop_fn(&lock);
> > printf("Core [%u] Cost Time = %"PRIu64" us\n",
> > lcore, time_count[lcore]);
> > memset(time_count, 0, sizeof(time_count));
> >
> > printf("\nTest with lock on single core...\n");
> > + __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
> > lock = 1;
> > - rte_atomic32_set(&synchro, 1);
>
> nit: is there a reason for moving this line?
I meant to use __atomic_store_n() instead of rte_atomic32_set() to set synchro,
but put the operation to the line up 'lock=1' by mistake, will change it.
> >
> > load_loop_fn(&lock);
> > printf("Core [%u] Cost Time = %"PRIu64" us\n",
> > lcore, time_count[lcore]);
> > @@ -118,11 +116,11 @@ test_mcslock_perf(void)
> >
> > printf("\nTest with lock on %u cores...\n", (rte_lcore_count()));
> >
> > - rte_atomic32_set(&synchro, 0);
> > + __atomic_store_n(&synchro, 0, __ATOMIC_RELAXED);
> > rte_eal_mp_remote_launch(load_loop_fn, &lock, SKIP_MAIN);
> >
> > /* start synchro and launch test on main */
> > - rte_atomic32_set(&synchro, 1);
> > + __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
> > load_loop_fn(&lock);
>
> I have a more general question. Please forgive my ignorance about the
> C++11 atomic builtins and memory model. Both gcc manual and C11
> standard
> are not that easy to understand :)
>
> In all the patches of this patchset, __ATOMIC_RELAXED is used. My
> understanding is that it does not add any inter-thread ordering constraint. I
> suppose that in this particular case, we rely on the call to
> rte_eal_mp_remote_launch() being a compiler barrier, and the function itself
> to be a memory barrier. This ensures that worker threads sees synchro=0
> until it is set to 1 by the master.
> Is it correct?
>
Yes, you are right. __ATOMIC_RELAXED would introduce no barrier, and the worker
threads would sync with master thread by 'synchro'.
> What is the reason for using the atomic API here? Wouldn't a standard
> affectation work too? (I mean "synchro = 1;")
>
Here, __atomic_store_n(__ATOMIC_RELAXED) is used to ensure worker threads
see 'synchro=1' after it is changed by the master. And a standard affection can not
ensure worker threads get the new value.
>
> >
> > rte_eal_mp_wait_lcore();
> > --
> > 2.17.1
> >
On Thu, Jul 29, 2021 at 07:19:13AM +0000, Joyce Kong wrote:
> Hi Olivier,
>
> > -----Original Message-----
> > From: Olivier Matz <olivier.matz@6wind.com>
> > Sent: Wednesday, July 28, 2021 5:57 PM
> > To: Joyce Kong <Joyce.Kong@arm.com>
> > Cc: thomas@monjalon.net; david.marchand@redhat.com;
> > roretzla@linux.microsoft.com; stephen@networkplumber.org;
> > andrew.rybchenko@oktetlabs.ru; harry.van.haaren@intel.com; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; dev@dpdk.org; nd <nd@arm.com>
> > Subject: Re: [PATCH v3 4/8] test/mcslock: use compiler atomics for lcores
> > sync
> >
> > Hi Joyce,
> >
> > On Mon, Jul 19, 2021 at 10:51:21PM -0500, Joyce Kong wrote:
> > > Convert rte_atomic usages to compiler atomic built-ins for lcores sync
> > > in mcslock testcases.
> > >
> > > Signed-off-by: Joyce Kong <joyce.kong@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Acked-by: Stephen Hemminger <stephen@networkplumber.org>
> > > ---
> > > app/test/test_mcslock.c | 14 ++++++--------
> > > 1 file changed, 6 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/app/test/test_mcslock.c b/app/test/test_mcslock.c index
> > > 80eaecc90a..52e45e7e2a 100644
> > > --- a/app/test/test_mcslock.c
> > > +++ b/app/test/test_mcslock.c
> > > @@ -17,7 +17,6 @@
> > > #include <rte_lcore.h>
> > > #include <rte_cycles.h>
> > > #include <rte_mcslock.h>
> > > -#include <rte_atomic.h>
> > >
> > > #include "test.h"
> > >
> > > @@ -43,7 +42,7 @@ rte_mcslock_t *p_ml_perf;
> > >
> > > static unsigned int count;
> > >
> > > -static rte_atomic32_t synchro;
> > > +static uint32_t synchro;
> > >
> > > static int
> > > test_mcslock_per_core(__rte_unused void *arg) @@ -76,8 +75,7 @@
> > > load_loop_fn(void *func_param)
> > > rte_mcslock_t ml_perf_me;
> > >
> > > /* wait synchro */
> > > - while (rte_atomic32_read(&synchro) == 0)
> > > - ;
> > > + rte_wait_until_equal_32(&synchro, 1, __ATOMIC_RELAXED);
> > >
> > > begin = rte_get_timer_cycles();
> > > while (lcount < MAX_LOOP) {
> > > @@ -102,15 +100,15 @@ test_mcslock_perf(void)
> > > const unsigned int lcore = rte_lcore_id();
> > >
> > > printf("\nTest with no lock on single core...\n");
> > > - rte_atomic32_set(&synchro, 1);
> > > + __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
> > > load_loop_fn(&lock);
> > > printf("Core [%u] Cost Time = %"PRIu64" us\n",
> > > lcore, time_count[lcore]);
> > > memset(time_count, 0, sizeof(time_count));
> > >
> > > printf("\nTest with lock on single core...\n");
> > > + __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
> > > lock = 1;
> > > - rte_atomic32_set(&synchro, 1);
> >
> > nit: is there a reason for moving this line?
>
> I meant to use __atomic_store_n() instead of rte_atomic32_set() to set synchro,
> but put the operation to the line up 'lock=1' by mistake, will change it.
>
> > >
> > > load_loop_fn(&lock);
> > > printf("Core [%u] Cost Time = %"PRIu64" us\n",
> > > lcore, time_count[lcore]);
> > > @@ -118,11 +116,11 @@ test_mcslock_perf(void)
> > >
> > > printf("\nTest with lock on %u cores...\n", (rte_lcore_count()));
> > >
> > > - rte_atomic32_set(&synchro, 0);
> > > + __atomic_store_n(&synchro, 0, __ATOMIC_RELAXED);
> > > rte_eal_mp_remote_launch(load_loop_fn, &lock, SKIP_MAIN);
> > >
> > > /* start synchro and launch test on main */
> > > - rte_atomic32_set(&synchro, 1);
> > > + __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
> > > load_loop_fn(&lock);
> >
> > I have a more general question. Please forgive my ignorance about the
> > C++11 atomic builtins and memory model. Both gcc manual and C11
> > standard
> > are not that easy to understand :)
> >
> > In all the patches of this patchset, __ATOMIC_RELAXED is used. My
> > understanding is that it does not add any inter-thread ordering constraint. I
> > suppose that in this particular case, we rely on the call to
> > rte_eal_mp_remote_launch() being a compiler barrier, and the function itself
> > to be a memory barrier. This ensures that worker threads sees synchro=0
> > until it is set to 1 by the master.
> > Is it correct?
> >
>
> Yes, you are right. __ATOMIC_RELAXED would introduce no barrier, and the worker
> threads would sync with master thread by 'synchro'.
>
> > What is the reason for using the atomic API here? Wouldn't a standard
> > affectation work too? (I mean "synchro = 1;")
> >
>
> Here, __atomic_store_n(__ATOMIC_RELAXED) is used to ensure worker threads
> see 'synchro=1' after it is changed by the master. And a standard affection can not
> ensure worker threads get the new value.
So, if I understand correctly, using __atomic_store() acts as if the variable is
volatile, and this is indeed needed to ensure visibility from other worker
threads.
I did some tests to convince myself: https://godbolt.org/z/3qWYeneGf
Thank you for the clarification.
> >
> > >
> > > rte_eal_mp_wait_lcore();
> > > --
> > > 2.17.1
> > >
@@ -17,7 +17,6 @@
#include <rte_lcore.h>
#include <rte_cycles.h>
#include <rte_mcslock.h>
-#include <rte_atomic.h>
#include "test.h"
@@ -43,7 +42,7 @@ rte_mcslock_t *p_ml_perf;
static unsigned int count;
-static rte_atomic32_t synchro;
+static uint32_t synchro;
static int
test_mcslock_per_core(__rte_unused void *arg)
@@ -76,8 +75,7 @@ load_loop_fn(void *func_param)
rte_mcslock_t ml_perf_me;
/* wait synchro */
- while (rte_atomic32_read(&synchro) == 0)
- ;
+ rte_wait_until_equal_32(&synchro, 1, __ATOMIC_RELAXED);
begin = rte_get_timer_cycles();
while (lcount < MAX_LOOP) {
@@ -102,15 +100,15 @@ test_mcslock_perf(void)
const unsigned int lcore = rte_lcore_id();
printf("\nTest with no lock on single core...\n");
- rte_atomic32_set(&synchro, 1);
+ __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
load_loop_fn(&lock);
printf("Core [%u] Cost Time = %"PRIu64" us\n",
lcore, time_count[lcore]);
memset(time_count, 0, sizeof(time_count));
printf("\nTest with lock on single core...\n");
+ __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
lock = 1;
- rte_atomic32_set(&synchro, 1);
load_loop_fn(&lock);
printf("Core [%u] Cost Time = %"PRIu64" us\n",
lcore, time_count[lcore]);
@@ -118,11 +116,11 @@ test_mcslock_perf(void)
printf("\nTest with lock on %u cores...\n", (rte_lcore_count()));
- rte_atomic32_set(&synchro, 0);
+ __atomic_store_n(&synchro, 0, __ATOMIC_RELAXED);
rte_eal_mp_remote_launch(load_loop_fn, &lock, SKIP_MAIN);
/* start synchro and launch test on main */
- rte_atomic32_set(&synchro, 1);
+ __atomic_store_n(&synchro, 1, __ATOMIC_RELAXED);
load_loop_fn(&lock);
rte_eal_mp_wait_lcore();