[v1] examples/power: fix oob frequency oscillations

Message ID 20190724131803.30066-1-david.hunt@intel.com (mailing list archive)
State Accepted, archived
Delegated to: Thomas Monjalon
Headers
Series [v1] examples/power: fix oob frequency oscillations |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-Compile-Testing success Compile Testing PASS
ci/Intel-compilation success Compilation OK
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/intel-Performance-Testing success Performance Testing PASS

Commit Message

Hunt, David July 24, 2019, 1:18 p.m. UTC
  The branch ratio algorithm in the vm_power_manager sample application
can be very sensitive at patricular loads in a workload, causing
oscillations between min and max frequency. For example, if a
workload is at 50%, scaling up may change the ratio
enough that it immediately thinks it needs to scale down again.

This patch introduces a sliding window recording the scale up/down
direction for the last 32 samples, and scales up if any samples indicate
we should scale up, otherwise scale down. Each core has it's own window.

Fixes: 4b1a631b8a8a ("examples/vm_power: add oob monitoring functions")
Cc: stable@dpdk.org
Signed-off-by: David Hunt <david.hunt@intel.com>
---
 examples/vm_power_manager/oob_monitor_x86.c | 34 +++++++++++++++++++--
 examples/vm_power_manager/power_manager.c   |  3 +-
 examples/vm_power_manager/power_manager.h   | 12 ++++++++
 3 files changed, 44 insertions(+), 5 deletions(-)
  

Comments

Anatoly Burakov July 26, 2019, 10:15 a.m. UTC | #1
On 24-Jul-19 2:18 PM, David Hunt wrote:
> The branch ratio algorithm in the vm_power_manager sample application
> can be very sensitive at patricular loads in a workload, causing
> oscillations between min and max frequency. For example, if a
> workload is at 50%, scaling up may change the ratio
> enough that it immediately thinks it needs to scale down again.
> 
> This patch introduces a sliding window recording the scale up/down
> direction for the last 32 samples, and scales up if any samples indicate
> we should scale up, otherwise scale down. Each core has it's own window.
> 
> Fixes: 4b1a631b8a8a ("examples/vm_power: add oob monitoring functions")
> Cc: stable@dpdk.org
> Signed-off-by: David Hunt <david.hunt@intel.com>
> ---
>   examples/vm_power_manager/oob_monitor_x86.c | 34 +++++++++++++++++++--
>   examples/vm_power_manager/power_manager.c   |  3 +-
>   examples/vm_power_manager/power_manager.h   | 12 ++++++++
>   3 files changed, 44 insertions(+), 5 deletions(-)
> 
> diff --git a/examples/vm_power_manager/oob_monitor_x86.c b/examples/vm_power_manager/oob_monitor_x86.c
> index ebd96b205..aecfcb2eb 100644
> --- a/examples/vm_power_manager/oob_monitor_x86.c
> +++ b/examples/vm_power_manager/oob_monitor_x86.c
> @@ -39,6 +39,7 @@ apply_policy(int core)
>   	int64_t hits_diff, miss_diff;
>   	float ratio;
>   	int ret;
> +	int freq_window_idx, up_count = 0, i;
>   
>   	g_active = 0;
>   	ci = get_core_info();
> @@ -101,10 +102,37 @@ apply_policy(int core)
>   
>   	ratio = (float)miss_diff * (float)100 / (float)hits_diff;
>   
> -	if (ratio < ci->branch_ratio_threshold)
> -		power_manager_scale_core_min(core);
> +	/*
> +	 * Store the last few directions that the ratio indicates
> +	 * we should take. If there's on 'up', then we scale up

One?

> +	 * quickly. If all indicate 'down', only then do we scale
> +	 * down. Each core_details struct has it's own array.
> +	 */
> +	freq_window_idx = ci->cd[core].freq_window_idx;
> +	if (ratio > ci->branch_ratio_threshold)
> +		ci->cd[core].freq_directions[freq_window_idx] = 1;
>   	else
> -		power_manager_scale_core_max(core);
> +		ci->cd[core].freq_directions[freq_window_idx] = 0;
> +
> +	freq_window_idx++;
> +	freq_window_idx = freq_window_idx & (FREQ_WINDOW_SIZE-1);
> +	ci->cd[core].freq_window_idx = freq_window_idx;
> +
> +	up_count = 0;
> +	for (i = 0; i < FREQ_WINDOW_SIZE; i++)
> +		up_count +=  ci->cd[core].freq_directions[i];
> +
> +	if (up_count == 0) {
> +		if (ci->cd[core].freq_state != FREQ_MIN) {
> +			power_manager_scale_core_min(core);
> +			ci->cd[core].freq_state = FREQ_MIN;
> +		}
> +	} else {
> +		if (ci->cd[core].freq_state != FREQ_MAX) {
> +			power_manager_scale_core_max(core);
> +			ci->cd[core].freq_state = FREQ_MAX;
> +		}
> +	}

So it's biased towards scaling up quickly, but it's doing that over a 
period. Please correct me if i'm wrong as i'm not really familiar with 
this codebase, but, assuming the window size is long enough, you could 
be missing opportunities to scale down? For example, if you get a short 
burst of 1's followed by a long burst of zeroes, you're not scaling down 
until you go through the entire buffer and overwrite all of the values. 
I guess that's the point of oscillation prevention, but maybe you could 
improve the "scale-up" part by only checking a few recent values, rather 
than the entire buffer?
  
Thomas Monjalon Aug. 6, 2019, 11:18 a.m. UTC | #2
26/07/2019 12:15, Burakov, Anatoly:
> So it's biased towards scaling up quickly, but it's doing that over a 
> period. Please correct me if i'm wrong as i'm not really familiar with 
> this codebase, but, assuming the window size is long enough, you could 
> be missing opportunities to scale down? For example, if you get a short 
> burst of 1's followed by a long burst of zeroes, you're not scaling down 
> until you go through the entire buffer and overwrite all of the values. 
> I guess that's the point of oscillation prevention, but maybe you could 
> improve the "scale-up" part by only checking a few recent values, rather 
> than the entire buffer?

This patch is deferred to 19.11.
  
Thomas Monjalon Oct. 27, 2019, 6:35 p.m. UTC | #3
06/08/2019 13:18, Thomas Monjalon:
> 26/07/2019 12:15, Burakov, Anatoly:
> > So it's biased towards scaling up quickly, but it's doing that over a 
> > period. Please correct me if i'm wrong as i'm not really familiar with 
> > this codebase, but, assuming the window size is long enough, you could 
> > be missing opportunities to scale down? For example, if you get a short 
> > burst of 1's followed by a long burst of zeroes, you're not scaling down 
> > until you go through the entire buffer and overwrite all of the values. 
> > I guess that's the point of oscillation prevention, but maybe you could 
> > improve the "scale-up" part by only checking a few recent values, rather 
> > than the entire buffer?
> 
> This patch is deferred to 19.11.

Any news for this patch?
  
Hunt, David Oct. 29, 2019, 2:05 p.m. UTC | #4
On 27/10/2019 18:35, Thomas Monjalon wrote:
> 06/08/2019 13:18, Thomas Monjalon:
>> 26/07/2019 12:15, Burakov, Anatoly:
>>> So it's biased towards scaling up quickly, but it's doing that over a
>>> period. Please correct me if i'm wrong as i'm not really familiar with
>>> this codebase, but, assuming the window size is long enough, you could
>>> be missing opportunities to scale down? For example, if you get a short
>>> burst of 1's followed by a long burst of zeroes, you're not scaling down
>>> until you go through the entire buffer and overwrite all of the values.
>>> I guess that's the point of oscillation prevention, but maybe you could
>>> improve the "scale-up" part by only checking a few recent values, rather
>>> than the entire buffer?
>> This patch is deferred to 19.11.
> Any news for this patch?
>
The algorithm was intended to be biased (strongly) towards the scale-up, 
for performance reasons. If there is a single "scale-up" in the entire 
array, then we stay up until the entire array agrees that we can scale 
down. If the user wants to relax this, then simply reduce the size of 
the array, which will have the same affect. But I had tested it with an 
array size of 32, and that gave the best results for my use cases.
  
Thomas Monjalon Nov. 1, 2019, 11 p.m. UTC | #5
29/10/2019 15:05, Hunt, David:
> On 27/10/2019 18:35, Thomas Monjalon wrote:
> > 06/08/2019 13:18, Thomas Monjalon:
> >> 26/07/2019 12:15, Burakov, Anatoly:
> >>> So it's biased towards scaling up quickly, but it's doing that over a
> >>> period. Please correct me if i'm wrong as i'm not really familiar with
> >>> this codebase, but, assuming the window size is long enough, you could
> >>> be missing opportunities to scale down? For example, if you get a short
> >>> burst of 1's followed by a long burst of zeroes, you're not scaling down
> >>> until you go through the entire buffer and overwrite all of the values.
> >>> I guess that's the point of oscillation prevention, but maybe you could
> >>> improve the "scale-up" part by only checking a few recent values, rather
> >>> than the entire buffer?
> >> This patch is deferred to 19.11.
> > Any news for this patch?
> >
> The algorithm was intended to be biased (strongly) towards the scale-up, 
> for performance reasons. If there is a single "scale-up" in the entire 
> array, then we stay up until the entire array agrees that we can scale 
> down. If the user wants to relax this, then simply reduce the size of 
> the array, which will have the same affect. But I had tested it with an 
> array size of 32, and that gave the best results for my use cases.

I'm not sure to understand. The patch is rejected?
  
Anatoly Burakov Nov. 4, 2019, 10:16 a.m. UTC | #6
On 01-Nov-19 11:00 PM, Thomas Monjalon wrote:
> 29/10/2019 15:05, Hunt, David:
>> On 27/10/2019 18:35, Thomas Monjalon wrote:
>>> 06/08/2019 13:18, Thomas Monjalon:
>>>> 26/07/2019 12:15, Burakov, Anatoly:
>>>>> So it's biased towards scaling up quickly, but it's doing that over a
>>>>> period. Please correct me if i'm wrong as i'm not really familiar with
>>>>> this codebase, but, assuming the window size is long enough, you could
>>>>> be missing opportunities to scale down? For example, if you get a short
>>>>> burst of 1's followed by a long burst of zeroes, you're not scaling down
>>>>> until you go through the entire buffer and overwrite all of the values.
>>>>> I guess that's the point of oscillation prevention, but maybe you could
>>>>> improve the "scale-up" part by only checking a few recent values, rather
>>>>> than the entire buffer?
>>>> This patch is deferred to 19.11.
>>> Any news for this patch?
>>>
>> The algorithm was intended to be biased (strongly) towards the scale-up,
>> for performance reasons. If there is a single "scale-up" in the entire
>> array, then we stay up until the entire array agrees that we can scale
>> down. If the user wants to relax this, then simply reduce the size of
>> the array, which will have the same affect. But I had tested it with an
>> array size of 32, and that gave the best results for my use cases.
> 
> I'm not sure to understand. The patch is rejected?
> 

I believe he was responding to my question about the algorithm's bias. 
Now that the matter is resolved,

Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
  
Thomas Monjalon Nov. 12, 2019, 7:23 a.m. UTC | #7
24/07/2019 15:18, David Hunt:
> The branch ratio algorithm in the vm_power_manager sample application
> can be very sensitive at patricular loads in a workload, causing
> oscillations between min and max frequency. For example, if a
> workload is at 50%, scaling up may change the ratio
> enough that it immediately thinks it needs to scale down again.
> 
> This patch introduces a sliding window recording the scale up/down
> direction for the last 32 samples, and scales up if any samples indicate
> we should scale up, otherwise scale down. Each core has it's own window.
> 
> Fixes: 4b1a631b8a8a ("examples/vm_power: add oob monitoring functions")
> Cc: stable@dpdk.org
> Signed-off-by: David Hunt <david.hunt@intel.com>

Applied, thanks
  

Patch

diff --git a/examples/vm_power_manager/oob_monitor_x86.c b/examples/vm_power_manager/oob_monitor_x86.c
index ebd96b205..aecfcb2eb 100644
--- a/examples/vm_power_manager/oob_monitor_x86.c
+++ b/examples/vm_power_manager/oob_monitor_x86.c
@@ -39,6 +39,7 @@  apply_policy(int core)
 	int64_t hits_diff, miss_diff;
 	float ratio;
 	int ret;
+	int freq_window_idx, up_count = 0, i;
 
 	g_active = 0;
 	ci = get_core_info();
@@ -101,10 +102,37 @@  apply_policy(int core)
 
 	ratio = (float)miss_diff * (float)100 / (float)hits_diff;
 
-	if (ratio < ci->branch_ratio_threshold)
-		power_manager_scale_core_min(core);
+	/*
+	 * Store the last few directions that the ratio indicates
+	 * we should take. If there's on 'up', then we scale up
+	 * quickly. If all indicate 'down', only then do we scale
+	 * down. Each core_details struct has it's own array.
+	 */
+	freq_window_idx = ci->cd[core].freq_window_idx;
+	if (ratio > ci->branch_ratio_threshold)
+		ci->cd[core].freq_directions[freq_window_idx] = 1;
 	else
-		power_manager_scale_core_max(core);
+		ci->cd[core].freq_directions[freq_window_idx] = 0;
+
+	freq_window_idx++;
+	freq_window_idx = freq_window_idx & (FREQ_WINDOW_SIZE-1);
+	ci->cd[core].freq_window_idx = freq_window_idx;
+
+	up_count = 0;
+	for (i = 0; i < FREQ_WINDOW_SIZE; i++)
+		up_count +=  ci->cd[core].freq_directions[i];
+
+	if (up_count == 0) {
+		if (ci->cd[core].freq_state != FREQ_MIN) {
+			power_manager_scale_core_min(core);
+			ci->cd[core].freq_state = FREQ_MIN;
+		}
+	} else {
+		if (ci->cd[core].freq_state != FREQ_MAX) {
+			power_manager_scale_core_max(core);
+			ci->cd[core].freq_state = FREQ_MAX;
+		}
+	}
 
 	g_active = 1;
 	return ratio;
diff --git a/examples/vm_power_manager/power_manager.c b/examples/vm_power_manager/power_manager.c
index 9d4e587b0..7b4f4b3c4 100644
--- a/examples/vm_power_manager/power_manager.c
+++ b/examples/vm_power_manager/power_manager.c
@@ -62,14 +62,13 @@  core_info_init(void)
 	ci->core_count = get_nprocs_conf();
 	ci->branch_ratio_threshold = BRANCH_RATIO_THRESHOLD;
 	ci->cd = malloc(ci->core_count * sizeof(struct core_details));
+	memset(ci->cd, 0, ci->core_count * sizeof(struct core_details));
 	if (!ci->cd) {
 		RTE_LOG(ERR, POWER_MANAGER, "Failed to allocate memory for core info.");
 		return -1;
 	}
 	for (i = 0; i < ci->core_count; i++) {
 		ci->cd[i].global_enabled_cpus = 1;
-		ci->cd[i].oob_enabled = 0;
-		ci->cd[i].msr_fd = 0;
 	}
 	printf("%d cores in system\n", ci->core_count);
 	return 0;
diff --git a/examples/vm_power_manager/power_manager.h b/examples/vm_power_manager/power_manager.h
index e81a60ae5..e324766b6 100644
--- a/examples/vm_power_manager/power_manager.h
+++ b/examples/vm_power_manager/power_manager.h
@@ -8,12 +8,24 @@ 
 #ifdef __cplusplus
 extern "C" {
 #endif
+
+#define FREQ_WINDOW_SIZE 32
+
+enum {
+	FREQ_UNKNOWN,
+	FREQ_MIN,
+	FREQ_MAX
+};
+
 struct core_details {
 	uint64_t last_branches;
 	uint64_t last_branch_misses;
 	uint16_t global_enabled_cpus;
 	uint16_t oob_enabled;
 	int msr_fd;
+	uint16_t freq_directions[FREQ_WINDOW_SIZE];
+	uint16_t freq_window_idx;
+	uint16_t freq_state;
 };
 
 struct core_info {