[3/6] service: reduce average case service core overhead

Message ID 20220906161352.296110-3-mattias.ronnblom@ericsson.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series [1/6] service: reduce statistics overhead for parallel services |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Mattias Rönnblom Sept. 6, 2022, 4:13 p.m. UTC
  Optimize service loop so that the starting point is the lowest-indexed
service mapped to the lcore in question, and terminate the loop at the
highest-indexed service.

While the worst case latency remains the same, this patch
significantly reduces the service framework overhead for the average
case. In particular, scenarios where an lcore only runs a single
service, or multiple services which id values are close (e.g., three
services with ids 17, 18 and 22), show significant improvements.

The worse case is a where the lcore two services mapped to it; one
with service id 0 and the other with id 63.

On a service lcore serving a single service, the service loop overhead
is reduced from ~190 core clock cycles to ~46. (On an Intel Cascade
Lake generation Xeon.) On weakly ordered CPUs, the gain is larger,
since the loop included load-acquire atomic operations.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)
  

Comments

Van Haaren, Harry Oct. 3, 2022, 1:33 p.m. UTC | #1
> -----Original Message-----
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Sent: Tuesday, September 6, 2022 5:14 PM
> To: Van; Haaren; Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Morten Brørup <mb@smartsharesystems.com>; nd <nd@arm.com>;
> mattias.ronnblom <mattias.ronnblom@ericsson.com>
> Subject: [PATCH 3/6] service: reduce average case service core overhead
> 
> Optimize service loop so that the starting point is the lowest-indexed
> service mapped to the lcore in question, and terminate the loop at the
> highest-indexed service.
> 
> While the worst case latency remains the same, this patch
> significantly reduces the service framework overhead for the average
> case. In particular, scenarios where an lcore only runs a single
> service, or multiple services which id values are close (e.g., three
> services with ids 17, 18 and 22), show significant improvements.
> 
> The worse case is a where the lcore two services mapped to it; one
> with service id 0 and the other with id 63.

I like the optimization - nice work. There is one caveat, that with the
builtin_ctz() call, RTE_SERVICE_NUM_MAX *must* be 64 or lower.
Today it is defined as 64, but we must ensure that this value cannot
be changed "by accident" without explicit compilation failures and a
comment explaining that fact.

There are likely options around making it runtime-dynamic, but I don't
think the complexity is justified: suggest we use compile-time check
BUILD_BUG_ON() and error if its > 64?

Note in rte_service_component_register(), we *re-use* IDs when they
become available, so we can have up to 64 active services at a time, but
the can register/unregister more times than that. This is a very unlikely
usage of the services API to continually register-unregister services.

With the BUILD_BUG_ON() around the 64 MAX value with a comment:
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>


> On a service lcore serving a single service, the service loop overhead
> is reduced from ~190 core clock cycles to ~46. (On an Intel Cascade
> Lake generation Xeon.) On weakly ordered CPUs, the gain is larger,
> since the loop included load-acquire atomic operations.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  lib/eal/common/rte_service.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> index 87df04e3ac..4cac866792 100644
> --- a/lib/eal/common/rte_service.c
> +++ b/lib/eal/common/rte_service.c
> @@ -464,7 +464,6 @@ static int32_t
>  service_runner_func(void *arg)
>  {
>  	RTE_SET_USED(arg);
> -	uint32_t i;
>  	const int lcore = rte_lcore_id();
>  	struct core_state *cs = &lcore_states[lcore];
> 
> @@ -478,10 +477,17 @@ service_runner_func(void *arg)
>  			RUNSTATE_RUNNING) {
> 
>  		const uint64_t service_mask = cs->service_mask;
> +		uint8_t start_id;
> +		uint8_t end_id;
> +		uint8_t i;
> 
> -		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
> -			if (!service_registered(i))
> -				continue;
> +		if (service_mask == 0)
> +			continue;
> +
> +		start_id = __builtin_ctzl(service_mask);
> +		end_id = 64 - __builtin_clzl(service_mask);
> +
> +		for (i = start_id; i < end_id; i++) {
>  			/* return value ignored as no change to code flow */
>  			service_run(i, cs, service_mask, service_get(i), 1);
>  		}
  
Mattias Rönnblom Oct. 3, 2022, 2:32 p.m. UTC | #2
On 2022-10-03 15:33, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Sent: Tuesday, September 6, 2022 5:14 PM
>> To: Van; Haaren; Van Haaren, Harry <harry.van.haaren@intel.com>
>> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
>> Morten Brørup <mb@smartsharesystems.com>; nd <nd@arm.com>;
>> mattias.ronnblom <mattias.ronnblom@ericsson.com>
>> Subject: [PATCH 3/6] service: reduce average case service core overhead
>>
>> Optimize service loop so that the starting point is the lowest-indexed
>> service mapped to the lcore in question, and terminate the loop at the
>> highest-indexed service.
>>
>> While the worst case latency remains the same, this patch
>> significantly reduces the service framework overhead for the average
>> case. In particular, scenarios where an lcore only runs a single
>> service, or multiple services which id values are close (e.g., three
>> services with ids 17, 18 and 22), show significant improvements.
>>
>> The worse case is a where the lcore two services mapped to it; one
>> with service id 0 and the other with id 63.
> 
> I like the optimization - nice work. There is one caveat, that with the
> builtin_ctz() call, RTE_SERVICE_NUM_MAX *must* be 64 or lower.
> Today it is defined as 64, but we must ensure that this value cannot
> be changed "by accident" without explicit compilation failures and a
> comment explaining that fact.
>  > There are likely options around making it runtime-dynamic, but I don't
> think the complexity is justified: suggest we use compile-time check
> BUILD_BUG_ON() and error if its > 64?
> 

Sounds like a good idea. The limitations is not new though; the use of 
an uint64_t-based bitmask limits the services to 64 already.

> Note in rte_service_component_register(), we *re-use* IDs when they
> become available, so we can have up to 64 active services at a time, but
> the can register/unregister more times than that. This is a very unlikely
> usage of the services API to continually register-unregister services.
> 
> With the BUILD_BUG_ON() around the 64 MAX value with a comment:
> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
Thanks for your reviews Harry.

> 
>> On a service lcore serving a single service, the service loop overhead
>> is reduced from ~190 core clock cycles to ~46. (On an Intel Cascade
>> Lake generation Xeon.) On weakly ordered CPUs, the gain is larger,
>> since the loop included load-acquire atomic operations.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   lib/eal/common/rte_service.c | 14 ++++++++++----
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
>> index 87df04e3ac..4cac866792 100644
>> --- a/lib/eal/common/rte_service.c
>> +++ b/lib/eal/common/rte_service.c
>> @@ -464,7 +464,6 @@ static int32_t
>>   service_runner_func(void *arg)
>>   {
>>   	RTE_SET_USED(arg);
>> -	uint32_t i;
>>   	const int lcore = rte_lcore_id();
>>   	struct core_state *cs = &lcore_states[lcore];
>>
>> @@ -478,10 +477,17 @@ service_runner_func(void *arg)
>>   			RUNSTATE_RUNNING) {
>>
>>   		const uint64_t service_mask = cs->service_mask;
>> +		uint8_t start_id;
>> +		uint8_t end_id;
>> +		uint8_t i;
>>
>> -		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
>> -			if (!service_registered(i))
>> -				continue;
>> +		if (service_mask == 0)
>> +			continue;
>> +
>> +		start_id = __builtin_ctzl(service_mask);
>> +		end_id = 64 - __builtin_clzl(service_mask);
>> +
>> +		for (i = start_id; i < end_id; i++) {
>>   			/* return value ignored as no change to code flow */
>>   			service_run(i, cs, service_mask, service_get(i), 1);
>>   		}
>
  

Patch

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 87df04e3ac..4cac866792 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -464,7 +464,6 @@  static int32_t
 service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
-	uint32_t i;
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
@@ -478,10 +477,17 @@  service_runner_func(void *arg)
 			RUNSTATE_RUNNING) {
 
 		const uint64_t service_mask = cs->service_mask;
+		uint8_t start_id;
+		uint8_t end_id;
+		uint8_t i;
 
-		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-			if (!service_registered(i))
-				continue;
+		if (service_mask == 0)
+			continue;
+
+		start_id = __builtin_ctzl(service_mask);
+		end_id = 64 - __builtin_clzl(service_mask);
+
+		for (i = start_id; i < end_id; i++) {
 			/* return value ignored as no change to code flow */
 			service_run(i, cs, service_mask, service_get(i), 1);
 		}