[v6,2/2] test/service: fix race condition on stopping lcore
Checks
Commit Message
This commit fixes a potential race condition in the tests
where the lcore running a service would increment a counter
that was already reset by the test-suite thread. The resulting
race-condition incremented value could cause CI failures, as
indicated by DPDK's CI.
This patch fixes the race-condition by making use of the
added rte_service_lcore_active() API, which indicates when
a service-core is no longer in the service-core polling loop.
The unit test makes use of the above function to detect when
all statistics increments are done in the service-core thread,
and then the unit test continues finalizing and checking state.
Fixes: f28f3594ded2 ("service: add attribute API")
Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
v6:
- Fix CI issue on C99 style loop initializer (David)
v4:
- Update test to new _may_be_ style API (Honnappa)
- Add reviewed by from ML
v3:
- Refactor while() to for() to simplify (Harry)
- Use SERVICE_DELAY instead of magic const 1 (Phil)
- Add Phil's reviewed by tag from ML
v2:
Thanks for discussion on v1, this v2 fixup for the CI
including previous feedback on ML.
---
app/test/test_service_cores.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
Comments
On Mon, Sep 14, 2020 at 4:30 PM Harry van Haaren
<harry.van.haaren@intel.com> wrote:
>
> This commit fixes a potential race condition in the tests
> where the lcore running a service would increment a counter
> that was already reset by the test-suite thread. The resulting
> race-condition incremented value could cause CI failures, as
> indicated by DPDK's CI.
>
> This patch fixes the race-condition by making use of the
> added rte_service_lcore_active() API, which indicates when
> a service-core is no longer in the service-core polling loop.
>
> The unit test makes use of the above function to detect when
> all statistics increments are done in the service-core thread,
> and then the unit test continues finalizing and checking state.
>
> Fixes: f28f3594ded2 ("service: add attribute API")
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Series applied, thanks.
Hello Harry,
Long time no see :-)
On Mon, Sep 21, 2020 at 4:51 PM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Mon, Sep 14, 2020 at 4:30 PM Harry van Haaren
> <harry.van.haaren@intel.com> wrote:
> >
> > This commit fixes a potential race condition in the tests
> > where the lcore running a service would increment a counter
> > that was already reset by the test-suite thread. The resulting
> > race-condition incremented value could cause CI failures, as
> > indicated by DPDK's CI.
> >
> > This patch fixes the race-condition by making use of the
> > added rte_service_lcore_active() API, which indicates when
> > a service-core is no longer in the service-core polling loop.
> >
> > The unit test makes use of the above function to detect when
> > all statistics increments are done in the service-core thread,
> > and then the unit test continues finalizing and checking state.
> >
> > Fixes: f28f3594ded2 ("service: add attribute API")
> >
> > Reported-by: David Marchand <david.marchand@redhat.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>
We probably need a followup fix for:
https://travis-ci.com/github/DPDK/dpdk/jobs/398954463#L10088
The race is in service_attr_get where we look at/reset spent cycles
while a service lcore is still running.
Quoting this test code:
rte_service_lcore_stop(slcore_id);
TEST_ASSERT_EQUAL(0, rte_service_attr_get(id, attr_calls, &attr_value),
"Valid attr_get() call didn't return success");
TEST_ASSERT_EQUAL(1, (attr_value > 0),
"attr_get() call didn't get call count (zero)");
TEST_ASSERT_EQUAL(0, rte_service_attr_reset_all(id),
"Valid attr_reset_all() return success");
TEST_ASSERT_EQUAL(0, rte_service_attr_get(id, attr_id, &attr_value),
"Valid attr_get() call didn't return success");
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Tuesday, October 13, 2020 8:45 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev <dev@dpdk.org>; Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Phil Yang <phil.yang@arm.com>; Aaron
> Conole <aconole@redhat.com>
> Subject: Re: [PATCH v6 2/2] test/service: fix race condition on stopping lcore
>
> Hello Harry,
>
> Long time no see :-)
>
> On Mon, Sep 21, 2020 at 4:51 PM David Marchand
> <david.marchand@redhat.com> wrote:
> >
> > On Mon, Sep 14, 2020 at 4:30 PM Harry van Haaren
> > <harry.van.haaren@intel.com> wrote:
> > >
> > > This commit fixes a potential race condition in the tests
> > > where the lcore running a service would increment a counter
> > > that was already reset by the test-suite thread. The resulting
> > > race-condition incremented value could cause CI failures, as
> > > indicated by DPDK's CI.
> > >
> > > This patch fixes the race-condition by making use of the
> > > added rte_service_lcore_active() API, which indicates when
> > > a service-core is no longer in the service-core polling loop.
> > >
> > > The unit test makes use of the above function to detect when
> > > all statistics increments are done in the service-core thread,
> > > and then the unit test continues finalizing and checking state.
> > >
> > > Fixes: f28f3594ded2 ("service: add attribute API")
> > >
> > > Reported-by: David Marchand <david.marchand@redhat.com>
> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
>
> We probably need a followup fix for:
> https://travis-ci.com/github/DPDK/dpdk/jobs/398954463#L10088
>
>
> The race is in service_attr_get where we look at/reset spent cycles
> while a service lcore is still running.
> Quoting this test code:
>
> rte_service_lcore_stop(slcore_id);
/* TODO: implement wait for slcore_id to stop polling here */
> TEST_ASSERT_EQUAL(0, rte_service_attr_get(id, attr_calls, &attr_value),
> "Valid attr_get() call didn't return success");
> TEST_ASSERT_EQUAL(1, (attr_value > 0),
> "attr_get() call didn't get call count (zero)");
>
> TEST_ASSERT_EQUAL(0, rte_service_attr_reset_all(id),
> "Valid attr_reset_all() return success");
>
> TEST_ASSERT_EQUAL(0, rte_service_attr_get(id, attr_id, &attr_value),
> "Valid attr_get() call didn't return success");
Based on the output you provided ( https://travis-ci.com/github/DPDK/dpdk/jobs/398954463#L10088 )
and the above, indeed it seems that the core may not have stopped yet (race-cond).
Will send a patch - thanks for reporting with details, -Harry
@@ -362,6 +362,9 @@ service_lcore_attr_get(void)
"Service core add did not return zero");
TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 1),
"Enabling valid service and core failed");
+ /* Ensure service is not active before starting */
+ TEST_ASSERT_EQUAL(0, rte_service_lcore_may_be_active(slcore_id),
+ "Not-active service core reported as active");
TEST_ASSERT_EQUAL(0, rte_service_lcore_start(slcore_id),
"Starting service core failed");
@@ -382,7 +385,23 @@ service_lcore_attr_get(void)
lcore_attr_id, &lcore_attr_value),
"Invalid lcore attr didn't return -EINVAL");
- rte_service_lcore_stop(slcore_id);
+ /* Ensure service is active */
+ TEST_ASSERT_EQUAL(1, rte_service_lcore_may_be_active(slcore_id),
+ "Active service core reported as not-active");
+
+ TEST_ASSERT_EQUAL(0, rte_service_map_lcore_set(id, slcore_id, 0),
+ "Disabling valid service and core failed");
+ TEST_ASSERT_EQUAL(0, rte_service_lcore_stop(slcore_id),
+ "Failed to stop service lcore");
+
+ /* Wait until service lcore not active, or for 100x SERVICE_DELAY */
+ int i;
+ for (i = 0; rte_service_lcore_may_be_active(slcore_id) == 1 &&
+ i < 100; i++)
+ rte_delay_ms(SERVICE_DELAY);
+
+ TEST_ASSERT_EQUAL(0, rte_service_lcore_may_be_active(slcore_id),
+ "Service lcore not stopped after waiting.");
TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
"Valid lcore_attr_reset_all() didn't return success");