[v2] test/service: fix spurious failures by extending timeout

Message ID 20221006082813.579255-1-harry.van.haaren@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v2] test/service: fix spurious failures by extending timeout |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/github-robot: build success github build: passed

Commit Message

Van Haaren, Harry Oct. 6, 2022, 8:28 a.m. UTC
  This commit extends the timeout for service_may_be_active()
from 100ms to 1000ms. Local testing on a idle and loaded system
(compiling DPDK with all cores) always completes after 1 ms.

The wait time for a service-lcore to finish is also extended
from 100ms to 1000ms.

The same timeout waiting code was duplicated in two tests, and
is now refactored to a standalone function avoiding duplication.

Reported-by: David Marchand <david.marchand@redhat.com>
Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---

Apologies for the quick respin noise; only the first diff-section
is added, no changes to the rest of the patch.

v2:
- v1 addressed only testcase 15 issue, v2 also addresses test
  case 5, which has an service-lcore wait code-path.

---
 app/test/test_service_cores.c | 47 ++++++++++++++++-------------------
 1 file changed, 22 insertions(+), 25 deletions(-)
  

Comments

David Marchand Oct. 6, 2022, 8:39 a.m. UTC | #1
On Thu, Oct 6, 2022 at 10:28 AM Harry van Haaren
<harry.van.haaren@intel.com> wrote:
>
> This commit extends the timeout for service_may_be_active()
> from 100ms to 1000ms. Local testing on a idle and loaded system
> (compiling DPDK with all cores) always completes after 1 ms.
>
> The wait time for a service-lcore to finish is also extended
> from 100ms to 1000ms.
>
> The same timeout waiting code was duplicated in two tests, and
> is now refactored to a standalone function avoiding duplication.
>
> Reported-by: David Marchand <david.marchand@redhat.com>
> Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

Just to be sure, do we want such a timeout in the test logic itself?
Is it that you want to make sure that the synchronisation happens in a
"reasonable" (subject to discussion ;-)) amount of time?

Otherwise, the unit tests run in the CI are themselves subject to a
10s x mutiplier timeout (-t meson test option).
And then I would rely on this overall timeout.
  
Mattias Rönnblom Oct. 6, 2022, 8:54 a.m. UTC | #2
On 2022-10-06 10:39, David Marchand wrote:
> On Thu, Oct 6, 2022 at 10:28 AM Harry van Haaren
> <harry.van.haaren@intel.com> wrote:
>>
>> This commit extends the timeout for service_may_be_active()
>> from 100ms to 1000ms. Local testing on a idle and loaded system
>> (compiling DPDK with all cores) always completes after 1 ms.
>>
>> The wait time for a service-lcore to finish is also extended
>> from 100ms to 1000ms.
>>
>> The same timeout waiting code was duplicated in two tests, and
>> is now refactored to a standalone function avoiding duplication.
>>
>> Reported-by: David Marchand <david.marchand@redhat.com>
>> Suggested-by: Mattias Ronnblom <mattias.ronnblom@ericsson.com>
>> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> Just to be sure, do we want such a timeout in the test logic itself?

I think it depends on how quickly you want to produce a failure, and 
also if there are some follow-up tests in the same autotest that you 
want to proceed with, regardless of the outcome.

> Is it that you want to make sure that the synchronisation happens in a
> "reasonable" (subject to discussion ;-)) amount of time?
> 
> Otherwise, the unit tests run in the CI are themselves subject to a
> 10s x mutiplier timeout (-t meson test option).
> And then I would rely on this overall timeout.
> 
>
  

Patch

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index 359b6dcd8b..4b147bd64c 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -123,14 +123,14 @@  unregister_all(void)
 	return TEST_SUCCESS;
 }
 
-/* Wait until service lcore not active, or for 100x SERVICE_DELAY */
+/* Wait until service lcore not active, or for N times SERVICE_DELAY */
 static void
 wait_slcore_inactive(uint32_t slcore_id)
 {
 	int i;
 
 	for (i = 0; rte_service_lcore_may_be_active(slcore_id) == 1 &&
-			i < 100; i++)
+			i < 1000; i++)
 		rte_delay_ms(SERVICE_DELAY);
 }
 
@@ -921,12 +921,26 @@  service_lcore_start_stop(void)
 	return unregister_all();
 }
 
+static int
+service_ensure_stopped_with_timeout(uint32_t sid)
+{
+	/* give the service time to stop running */
+	int32_t timeout_ms = 1000;
+	int i;
+	for (i = 0; i < timeout_ms; i++) {
+		if (!rte_service_may_be_active(sid))
+			break;
+		rte_delay_ms(SERVICE_DELAY);
+	}
+
+	return rte_service_may_be_active(sid);
+}
+
 /* stop a service and wait for it to become inactive */
 static int
 service_may_be_active(void)
 {
 	const uint32_t sid = 0;
-	int i;
 
 	/* expected failure cases */
 	TEST_ASSERT_EQUAL(-EINVAL, rte_service_may_be_active(10000),
@@ -946,19 +960,11 @@  service_may_be_active(void)
 	TEST_ASSERT_EQUAL(1, service_lcore_running_check(),
 			"Service core expected to poll service but it didn't");
 
-	/* stop the service */
+	/* stop the service, and wait for not-active with timeout */
 	TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
 			"Error: Service stop returned non-zero");
-
-	/* give the service 100ms to stop running */
-	for (i = 0; i < 100; i++) {
-		if (!rte_service_may_be_active(sid))
-			break;
-		rte_delay_ms(SERVICE_DELAY);
-	}
-
-	TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
-			  "Error: Service not stopped after 100ms");
+	TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
+			  "Error: Service not stopped after timeout period.");
 
 	return unregister_all();
 }
@@ -972,7 +978,6 @@  service_active_two_cores(void)
 		return TEST_SKIPPED;
 
 	const uint32_t sid = 0;
-	int i;
 
 	uint32_t lcore = rte_get_next_lcore(/* start core */ -1,
 					    /* skip main */ 1,
@@ -1002,16 +1007,8 @@  service_active_two_cores(void)
 	/* stop the service */
 	TEST_ASSERT_EQUAL(0, rte_service_runstate_set(sid, 0),
 			"Error: Service stop returned non-zero");
-
-	/* give the service 100ms to stop running */
-	for (i = 0; i < 100; i++) {
-		if (!rte_service_may_be_active(sid))
-			break;
-		rte_delay_ms(SERVICE_DELAY);
-	}
-
-	TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
-			  "Error: Service not stopped after 100ms");
+	TEST_ASSERT_EQUAL(0, service_ensure_stopped_with_timeout(sid),
+			  "Error: Service not stopped after timeout period.");
 
 	return unregister_all();
 }