[dpdk-dev,v1,1/2] lib/librte_power: traffic pattern aware power control

Message ID 1528451833-3617-1-git-send-email-liang.j.ma@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [dpdk-dev,v1,1/2] lib/librte_power: traffic pattern aware power control |

Checks

Context Check Description
ci/Intel-compilation fail Compilation issues
ci/checkpatch warning coding style issues

Commit Message

Liang, Ma June 8, 2018, 9:57 a.m. UTC
  1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Tried and failed schemes include calculating the cycles required from
the load on the core, in other words the business. For example,
how many cycles it costs to handle each packet and determining the frequency
cost per core. Due to the varying nature of traffic, types of frames and cost
in cycles to process, this mechanism becomes complex quickly where
a simple scheme is required to solve the problems.

2. Proposed solution

For all polling mechanism, the proposed solution focus on how many times
empty poll executed instead of calculating how many cycles it cost to handle
each packet. The less empty poll number means current core is busy with
processing workload, therefore,  the higher frequency is needed. The high
empty poll number indicate current core has lots spare time, therefore,
we can lower the frequency.

2.1 Power state definition:

	LOW:  the frequency is used for purge mode.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. There is no traffic pass-through,
	  the system will test average empty poll numbers  with LOW/MED/HIGH
	  power state. Those average empty poll numbers  will be the baseline
	  for the normal phase. The system will collect all core's counter
	  every 100ms. The Training phase will take 5 seconds.

	b.Normal phase. When the real traffic pass-though, the system will
	  compare run-time empty poll moving average value with base line
	  then make decision to move to HIGH power state of MED  power state.
	  The system will collect all core's counter every 100ms.

3. Proposed  API

1.  rte_empty_poll_stat_init(void);
which is used to initialize the power management system.
 
2.  rte_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
which allow user customize the frequency of power state.

8.  rte_empty_poll_setup_timer(void);
which is used to setup the timer/callback to process all above counter.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 lib/librte_power/Makefile         |   3 +-
 lib/librte_power/meson.build      |   5 +-
 lib/librte_power/rte_empty_poll.c | 529 ++++++++++++++++++++++++++++++++++++++
 lib/librte_power/rte_empty_poll.h | 135 ++++++++++
 4 files changed, 669 insertions(+), 3 deletions(-)
 create mode 100644 lib/librte_power/rte_empty_poll.c
 create mode 100644 lib/librte_power/rte_empty_poll.h
  

Comments

Hunt, David June 14, 2018, 10:59 a.m. UTC | #1
Hi Liang



On 8/6/2018 10:57 AM, Liang Ma wrote:
> 1. Abstract
>
> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
>
>     * No indication of overload conditions
>
>     * User do not know how much real load is on a system meaning resulted in
>       wasted energy as no power management is utilized
>
> Tried and failed schemes include calculating the cycles required from
> the load on the core, in other words the business. For example,

Typo, I think this should be busyness. :)

> how many cycles it costs to handle each packet and determining the frequency
> cost per core. Due to the varying nature of traffic, types of frames and cost
> in cycles to process, this mechanism becomes complex quickly where
> a simple scheme is required to solve the problems.
>
> 2. Proposed solution
>
> For all polling mechanism, the proposed solution focus on how many times
> empty poll executed instead of calculating how many cycles it cost to handle
> each packet. The less empty poll number means current core is busy with
> processing workload, therefore,  the higher frequency is needed. The high
> empty poll number indicate current core has lots spare time, therefore,
> we can lower the frequency.
>
> 2.1 Power state definition:
>
> 	LOW:  the frequency is used for purge mode.
>
> 	MED:  the frequency is used to process modest traffic workload.
>
> 	HIGH: the frequency is used to process busy traffic workload.
>
> 2.2 There are two phases to establish the power management system:
>
> 	a.Initialization/Training phase. There is no traffic pass-through,
> 	  the system will test average empty poll numbers  with LOW/MED/HIGH
> 	  power state. Those average empty poll numbers  will be the baseline
> 	  for the normal phase. The system will collect all core's counter
> 	  every 100ms. The Training phase will take 5 seconds.

I suggest that you add in that all Rx packets are blocked during this 
phase, and we measure the number of empty polls possible for each of the 
frequencies (low, medium and high).

> 	b.Normal phase. When the real traffic pass-though, the system will
> 	  compare run-time empty poll moving average value with base line
> 	  then make decision to move to HIGH power state of MED  power state.
> 	  The system will collect all core's counter every 100ms.

I think this may need to be reduced from 100ms. The reaction to an 
increase in traffic would need
to be quicker than this to avoid buffer overflow.

>
> 3. Proposed  API
>
> 1.  rte_empty_poll_stat_init(void);
> which is used to initialize the power management system.
>   
> 2.  rte_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
>   
> 3.  rte_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
>   
> 4.  rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
>   
> 5.  rte_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
>   
> 6.  rte_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
>
> 7.  rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> which allow user customize the frequency of power state.
>
> 8.  rte_empty_poll_setup_timer(void);
> which is used to setup the timer/callback to process all above counter.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>   lib/librte_power/Makefile         |   3 +-
>   lib/librte_power/meson.build      |   5 +-
>   lib/librte_power/rte_empty_poll.c | 529 ++++++++++++++++++++++++++++++++++++++
>   lib/librte_power/rte_empty_poll.h | 135 ++++++++++
>   4 files changed, 669 insertions(+), 3 deletions(-)
>   create mode 100644 lib/librte_power/rte_empty_poll.c
>   create mode 100644 lib/librte_power/rte_empty_poll.h
>
> diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> index 6f85e88..dbc175a 100644
> --- a/lib/librte_power/Makefile
> +++ b/lib/librte_power/Makefile
> @@ -16,8 +16,9 @@ LIBABIVER := 1
>   # all source are stored in SRCS-y
>   SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
>   SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_empty_poll.c
>   
>   # install this header file
> -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_empty_poll.h
>   

I would propose re-naming the .h and .c file to 
rte_*power_*empty_poll.[h/c], so we can
associate it with the power library.


>   include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 253173f..5270fa3 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
>   	build = false
>   endif
>   sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> -		'power_kvm_vm.c', 'guest_channel.c')
> -headers = files('rte_power.h')
> +		'power_kvm_vm.c', 'guest_channel.c',
> +		'rte_empty_poll.c')
> +headers = files('rte_power.h','rte_empty_poll.h')
> diff --git a/lib/librte_power/rte_empty_poll.c b/lib/librte_power/rte_empty_poll.c
> new file mode 100644
> index 0000000..57bd63b
> --- /dev/null
> +++ b/lib/librte_power/rte_empty_poll.c
> @@ -0,0 +1,529 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +
> +
> +
> +
> +#include "rte_power.h"
> +#include "rte_empty_poll.h"
> +
> +
> +
> +
> +#define INTERVALS_PER_SECOND 10     /* (100ms) */
> +#define SECONDS_TO_TRAIN_FOR  5
> +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> +#define DEFAULT_CYCLES_PER_PACKET 800
> +
> +
> +static struct ep_params *ep_params;
> +static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> +static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> +
> +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> +
> +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> +
> +static uint32_t freq_index[NUM_FREQ];
> +
> +
> +
> +static uint32_t
> +get_freq_index(enum freq_val index)
> +{
> +	return freq_index[index];
> +}
> +
> +
> +static int
> +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> +{
> +	int err = 0;
> +	uint32_t power_freq_index;
> +	if (!specific_freq)
> +		power_freq_index = get_freq_index(freq);
> +	else
> +		power_freq_index = freq;
> +
> +	err = rte_power_set_freq(lcore_id, power_freq_index);
> +
> +	return err;
> +}
> +
> +
> +static inline void __attribute__((always_inline))
> +exit_training_state(struct priority_worker *poll_stats)
> +{
> +	RTE_SET_USED(poll_stats);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_training_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->cur_freq = LOW;
> +	poll_stats->queue_state = TRAINING;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_normal_state(struct priority_worker *poll_stats)
> +{
> +	/* Clear the averages arrays and strs */
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = MED;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = MED_NORMAL;
> +	set_power_freq(poll_stats->lcore_id, MED, false);
> +
> +	/* Try here */
> +	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> +	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_busy_state(struct priority_worker *poll_stats)
> +{
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = HGH;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = HGH_BUSY;
> +	set_power_freq(poll_stats->lcore_id, HGH, false);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_purge_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->queue_state = LOW_PURGE;
> +}
> +
> +static inline void __attribute__((always_inline))
> +set_state(struct priority_worker *poll_stats,
> +		enum queue_state new_state)
> +{
> +	enum queue_state old_state = poll_stats->queue_state;
> +	if (old_state != new_state) {
> +
> +		/* Call any old state exit functions */
> +		if (old_state == TRAINING)
> +			exit_training_state(poll_stats);
> +
> +		/* Call any new state entry functions */
> +		if (new_state == TRAINING)
> +			enter_training_state(poll_stats);
> +		if (new_state == MED_NORMAL)
> +			enter_normal_state(poll_stats);
> +		if (new_state == HGH_BUSY)
> +			enter_busy_state(poll_stats);
> +		if (new_state == LOW_PURGE)
> +			enter_purge_state(poll_stats);
> +	}
> +}
> +
> +
> +static void
> +update_training_stats(struct priority_worker *poll_stats,
> +		uint32_t freq,
> +		bool specific_freq,
> +		uint32_t max_train_iter)
> +{
> +	RTE_SET_USED(specific_freq);
> +
> +	char pfi_str[32];
> +	uint64_t p0_empty_deq;
> +
> +	sprintf(pfi_str, "%02d", freq);
> +
> +	if (poll_stats->cur_freq == freq &&
> +			poll_stats->thresh[freq].trained == false) {
> +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> +
> +			set_power_freq(poll_stats->lcore_id,
> +					freq, specific_freq);
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +			return;
> +		} else if (poll_stats->thresh[freq].cur_train_iter
> +				<= max_train_iter) {
> +
> +			p0_empty_deq = poll_stats->empty_dequeues -
> +				poll_stats->empty_dequeues_prev;
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +		} else {
> +			if (poll_stats->thresh[freq].trained == false) {
> +				poll_stats->thresh[freq].base_edpi =
> +					poll_stats->thresh[freq].base_edpi /
> +					max_train_iter;
> +
> +				/* Add on a factor of 0.05%, this should remove any */
> +				/* false negatives when the system is 0% busy */
> +				poll_stats->thresh[freq].base_edpi +=
> +					poll_stats->thresh[freq].base_edpi / 2000;

Please use #define for magic numbers

> +
> +				poll_stats->thresh[freq].trained = true;
> +				poll_stats->cur_freq++;
> +
> +			}
> +		}
> +	}
> +}
> +
> +static inline uint32_t __attribute__((always_inline))
> +update_stats(struct priority_worker *poll_stats)
> +{
> +	uint64_t tot_edpi = 0, tot_ppi = 0;
> +	uint32_t j, percent;
> +
> +	struct priority_worker *s = poll_stats;
> +
> +	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> +
> +	s->empty_dequeues_prev = s->empty_dequeues;
> +
> +	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> +
> +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> +
> +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
> +		return 1000UL; /* Value to make us fail */
> +
> +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> +
> +	for (j = 0; j < BINS_AV; j++) {
> +		tot_edpi += s->edpi_av[j];
> +		tot_ppi += s->ppi_av[j];
> +	}
> +
> +	tot_edpi = tot_edpi / BINS_AV;
> +
> +	percent = 100 - (uint32_t)((float)tot_edpi /
> +			(float)s->thresh[s->cur_freq].base_edpi * 100);
> +
> +	return (uint32_t)percent;
> +}
> +
> +
> +static inline void  __attribute__((always_inline))
> +update_stats_normal(struct priority_worker *poll_stats)
> +{
> +	uint32_t percent;
> +
> +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
> +		return;
> +
> +	percent = update_stats(poll_stats);
> +
> +	if (percent > 100)
> +		return;
> +
> +	if (poll_stats->cur_freq == LOW)
> +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> +	else if (poll_stats->cur_freq == MED) {
> +
> +		if (percent > poll_stats->thresh[poll_stats->cur_freq].
> +				threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else
> +				set_state(poll_stats, HGH_BUSY);
> +
> +		} else {
> +			/* reset */
> +			poll_stats->threshold_ctr = 0;
> +		}
> +
> +	} else if (poll_stats->cur_freq == HGH) {
> +
> +		if (percent < poll_stats->thresh[poll_stats->cur_freq].
> +				threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else
> +				set_state(poll_stats, MED_NORMAL);
> +		} else
> +			/* reset */
> +			poll_stats->threshold_ctr = 0;
> +
> +
> +	}
> +}
> +
> +static int
> +empty_poll_trainning(struct priority_worker *poll_stats,
> +		uint32_t max_train_iter) {
> +
> +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> +		poll_stats->iter_counter++;
> +		return 0;
> +	}
> +
> +
> +	update_training_stats(poll_stats,
> +			LOW,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			MED,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			HGH,
> +			false,
> +			max_train_iter);
> +
> +
> +	if (poll_stats->thresh[LOW].trained == true
> +			&& poll_stats->thresh[MED].trained == true
> +			&& poll_stats->thresh[HGH].trained == true) {
> +
> +		set_state(poll_stats, MED_NORMAL);
> +
> +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> +				poll_stats->lcore_id);
> +	}
> +
> +	return 0;
> +}
> +
> +static void
> +empty_poll_detection(struct rte_timer *tim,
> +		void *arg)
> +{
> +
> +	uint32_t i;
> +
> +	struct priority_worker *poll_stats;
> +
> +	RTE_SET_USED(tim);
> +
> +	RTE_SET_USED(arg);
> +
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> +
> +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> +			continue;
> +
> +		switch (poll_stats->queue_state) {
> +		case(TRAINING):
> +			empty_poll_trainning(poll_stats,
> +					ep_params->max_train_iter);
> +			break;
> +
> +		case(HGH_BUSY):
> +		case(MED_NORMAL):
> +			update_stats_normal(poll_stats);
> +
> +			break;
> +
> +		case(LOW_PURGE):
> +			break;
> +		default:
> +			break;
> +
> +		}
> +
> +	}
> +
> +}
> +
> +int
> +rte_empty_poll_stat_init(void)
> +{
> +	uint32_t i;
> +	/* Allocate the ep_params structure */
> +	ep_params = rte_zmalloc_socket(NULL,
> +			sizeof(struct ep_params),
> +			0,
> +			rte_socket_id());
> +
> +	if (!ep_params)
> +		rte_panic("Cannot allocate heap memory for ep_params "
> +				"for socket %d\n", rte_socket_id());
> +
> +	freq_index[LOW] = 14;
> +	freq_index[MED] = 9;
> +	freq_index[HGH] = 1;
> +

Are these default frequency indexes? Can they be changed? Maybe say so 
in a comment.

> +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> +
> +	/* 5 seconds worth of training */
> +	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
> +
> +	struct stats_data *w = &ep_params->wrk_data;
> +
> +	/* initialize all wrk_stats state */
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		if (rte_lcore_is_enabled(i) == 0)
> +			continue;
> +
> +		set_state(&w->wrk_stats[i], TRAINING);
> +		/*init the freqs table */
> +		total_avail_freqs[i] = rte_power_freqs(i,
> +				avail_freqs[i],
> +				NUM_FREQS);
> +
> +		if (get_freq_index(LOW) > total_avail_freqs[i])
> +			return -1;
> +
> +	}
> +
> +
> +	return 0;
> +}
> +
> +void
> +rte_empty_poll_stat_free(void)
> +{
> +
> +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> +
> +	if (ep_params != NULL)
> +		rte_free(ep_params);
> +}
> +
> +int
> +rte_empty_poll_stat_update(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->empty_dequeues++;
> +
> +	return 0;
> +}
> +
> +int
> +rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> +{
> +
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->num_dequeue_pkts += nb_pkt;
> +
> +	return 0;
> +}
> +
> +
> +uint64_t
> +rte_empty_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->empty_dequeues;
> +}
> +
> +uint64_t
> +rte_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->num_dequeue_pkts;
> +}
> +
> +void
> +rte_empty_poll_set_freq(enum freq_val index, uint32_t limit)
> +{
> +	switch (index) {
> +
> +	case LOW:
> +		freq_index[LOW] = limit;
> +		break;
> +
> +	case MED:
> +		freq_index[MED] = limit;
> +		break;
> +
> +	case HGH:
> +		freq_index[HGH] = limit;
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
> +void
> +rte_empty_poll_setup_timer(void)
> +{
> +	int lcore_id = rte_lcore_id();
> +	uint64_t hz = rte_get_timer_hz();
> +
> +	struct  ep_params *ep_ptr = ep_params;
> +
> +	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
> +
> +	rte_timer_reset_sync(&ep_ptr->timer0,
> +			ep_ptr->interval_ticks,
> +			PERIODICAL,
> +			lcore_id,
> +			empty_poll_detection,
> +			(void *)ep_ptr);
> +
> +}
> diff --git a/lib/librte_power/rte_empty_poll.h b/lib/librte_power/rte_empty_poll.h
> new file mode 100644
> index 0000000..7e036ee
> --- /dev/null
> +++ b/lib/librte_power/rte_empty_poll.h
> @@ -0,0 +1,135 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#ifndef _RTE_EMPTY_POLL_H
> +#define _RTE_EMPTY_POLL_H
> +
> +/**
> + * @file
> + * RTE Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +#include <sys/queue.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_string_fns.h>
> +#include <rte_power.h>
> +#include <rte_timer.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define NUM_FREQS 20

Should probably be RTE_MAX_LCORE_FREQS

> +
> +#define BINS_AV 4 /* has to be ^2 */
> +
> +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> +
> +#define NUM_PRIORITIES          2
> +
> +#define NUM_NODES         31 /*any reseanable prime number should work*/
> +
> +
> +enum freq_val {
> +	LOW,
> +	MED,
> +	HGH,

HGH should be HIGH. Where we see this on it's own in the code, it's not 
obvious that it's related to MEDIUM and LOW.  Also, maybe MED should be 
MEDIUM.

Would suggest prefixing these with FREQ_. Makes the code more readable.

> +	NUM_FREQ = NUM_FREQS

This should probably be the number of elements in freq_val rather than 
NUM_FREQS

I think there is some confusion in the code about the number of 
frequencies. Any array where it's holding all the possible frequencies 
should use RTE_MAX_LCORE_FREQS.
But it looks ti me that the freq_index array only holds three values, 
the indexes for Low, Medium and High.


> +};
> +
> +
> +enum queue_state {
> +	TRAINING, /* NO TRAFFIC */
> +	MED_NORMAL,   /* MED */
> +	HGH_BUSY,     /* HIGH */
> +	LOW_PURGE,    /* LOW */
> +};

I'm not sure that the NORMAL, BUSY and PURGE words add any value here. 
How about MEDIUM, HIGH and LOW?

> +
> +/* queue stats */
> +struct freq_threshold {
> +
> +	uint64_t base_edpi;
> +	bool trained;
> +	uint32_t threshold_percent;
> +	uint32_t cur_train_iter;
> +};
> +
> +
> +struct priority_worker {
> +
> +	/* Current dequeue and throughput counts */
> +	/* These 2 are written to by the worker threads */
> +	/* So keep them on their own cache line */
> +	uint64_t empty_dequeues;
> +	uint64_t num_dequeue_pkts;
> +
> +	enum queue_state queue_state;
> +
> +	uint64_t empty_dequeues_prev;
> +	uint64_t num_dequeue_pkts_prev;
> +
> +	/* Used for training only */
> +	struct freq_threshold thresh[NUM_FREQ];
> +	enum freq_val cur_freq;
> +
> +	/* bucket arrays to calculate the averages */
> +	uint64_t edpi_av[BINS_AV];
> +	uint32_t  ec;
> +	uint64_t ppi_av[BINS_AV];
> +	uint32_t  pc;

Suggest a comment per line explaining what the names of these variables 
mean. edpi, ppi, etc.


> +
> +	uint32_t lcore_id;
> +	uint32_t iter_counter;
> +	uint32_t threshold_ctr;
> +	uint32_t display_ctr;
> +	uint8_t  dev_id;
> +
> +} __rte_cache_aligned;
> +
> +
> +struct stats_data {
> +
> +	struct priority_worker wrk_stats[NUM_NODES];
> +
> +	/* flag to stop rx threads processing packets until training over */
> +	bool start_rx;
> +
> +};
> +
> +struct ep_params {
> +
> +	/* timer related stuff */
> +	uint64_t interval_ticks;
> +	uint32_t max_train_iter;
> +	struct rte_timer timer0;
> +
> +	struct stats_data wrk_data;
> +};
> +
> +
> +int rte_empty_poll_stat_init(void);
> +
> +void rte_empty_poll_stat_free(void);
> +
> +int rte_empty_poll_stat_update(unsigned int lcore_id);
> +
> +int rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> +
> +uint64_t rte_empty_poll_stat_fetch(unsigned int lcore_id);
> +
> +uint64_t rte_poll_stat_fetch(unsigned int lcore_id);
> +
> +void rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> +
> +void rte_empty_poll_setup_timer(void);

All the function prototypes need documentation. Please see rte_power.h.
And please makes sure that Doxygen documentation generates correctly 
from the comments.
Suggest prefixing the functions with rte_power_

> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif

A few more comments throughout the code would be good.
  
Liang, Ma June 18, 2018, 4:11 p.m. UTC | #2
On 14 Jun 11:59, Hunt, David wrote:
> Hi Liang
> 
> 
> 
> On 8/6/2018 10:57 AM, Liang Ma wrote:
> > 1. Abstract
> > 
> > For packet processing workloads such as DPDK polling is continuous.
> > This means CPU cores always show 100% busy independent of how much work
> > those cores are doing. It is critical to accurately determine how busy
> > a core is hugely important for the following reasons:
> > 
> >     * No indication of overload conditions
> > 
> >     * User do not know how much real load is on a system meaning resulted in
> >       wasted energy as no power management is utilized
> > 
> > Tried and failed schemes include calculating the cycles required from
> > the load on the core, in other words the business. For example,
> 
> Typo, I think this should be busyness. :)
agree, I will update in next patch.
> 
> > how many cycles it costs to handle each packet and determining the frequency
> > cost per core. Due to the varying nature of traffic, types of frames and cost
> > in cycles to process, this mechanism becomes complex quickly where
> > a simple scheme is required to solve the problems.
> > 
> > 2. Proposed solution
> > 
> > For all polling mechanism, the proposed solution focus on how many times
> > empty poll executed instead of calculating how many cycles it cost to handle
> > each packet. The less empty poll number means current core is busy with
> > processing workload, therefore,  the higher frequency is needed. The high
> > empty poll number indicate current core has lots spare time, therefore,
> > we can lower the frequency.
> > 
> > 2.1 Power state definition:
> > 
> > 	LOW:  the frequency is used for purge mode.
> > 
> > 	MED:  the frequency is used to process modest traffic workload.
> > 
> > 	HIGH: the frequency is used to process busy traffic workload.
> > 
> > 2.2 There are two phases to establish the power management system:
> > 
> > 	a.Initialization/Training phase. There is no traffic pass-through,
> > 	  the system will test average empty poll numbers  with LOW/MED/HIGH
> > 	  power state. Those average empty poll numbers  will be the baseline
> > 	  for the normal phase. The system will collect all core's counter
> > 	  every 100ms. The Training phase will take 5 seconds.
> 
> I suggest that you add in that all Rx packets are blocked during this phase,
> and we measure the number of empty polls possible for each of the
> frequencies (low, medium and high).
I don't think the Rx packet is blocked,  currently, there is no traffic pass-throughduring training phase.
> 
> > 	b.Normal phase. When the real traffic pass-though, the system will
> > 	  compare run-time empty poll moving average value with base line
> > 	  then make decision to move to HIGH power state of MED  power state.
> > 	  The system will collect all core's counter every 100ms.
> 
> I think this may need to be reduced from 100ms. The reaction to an increase
> in traffic would need
> to be quicker than this to avoid buffer overflow.
> 
agree, I will reduce to 10ms in next patch.
> > 
> > 3. Proposed  API
> > 
> > 1.  rte_empty_poll_stat_init(void);
> > which is used to initialize the power management system.
> > 2.  rte_empty_poll_stat_free(void);
> > which is used to free the resource hold by power management system.
> > 3.  rte_empty_poll_stat_update(unsigned int lcore_id);
> > which is used to update specific core empty poll counter, not thread safe
> > 4.  rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > which is used to update specific core valid poll counter, not thread safe
> > 5.  rte_empty_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core empty poll counter.
> > 6.  rte_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core valid poll counter.
> > 
> > 7.  rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> > which allow user customize the frequency of power state.
> > 
> > 8.  rte_empty_poll_setup_timer(void);
> > which is used to setup the timer/callback to process all above counter.
> > 
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > ---
> >   lib/librte_power/Makefile         |   3 +-
> >   lib/librte_power/meson.build      |   5 +-
> >   lib/librte_power/rte_empty_poll.c | 529 ++++++++++++++++++++++++++++++++++++++
> >   lib/librte_power/rte_empty_poll.h | 135 ++++++++++
> >   4 files changed, 669 insertions(+), 3 deletions(-)
> >   create mode 100644 lib/librte_power/rte_empty_poll.c
> >   create mode 100644 lib/librte_power/rte_empty_poll.h
> > 
> > diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> > index 6f85e88..dbc175a 100644
> > --- a/lib/librte_power/Makefile
> > +++ b/lib/librte_power/Makefile
> > @@ -16,8 +16,9 @@ LIBABIVER := 1
> >   # all source are stored in SRCS-y
> >   SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
> >   SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> > +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_empty_poll.c
> >   # install this header file
> > -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> > +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_empty_poll.h
> 
> I would propose re-naming the .h and .c file to
> rte_*power_*empty_poll.[h/c], so we can
> associate it with the power library.
> 
> 
> >   include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> > index 253173f..5270fa3 100644
> > --- a/lib/librte_power/meson.build
> > +++ b/lib/librte_power/meson.build
> > @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
> >   	build = false
> >   endif
> >   sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> > -		'power_kvm_vm.c', 'guest_channel.c')
> > -headers = files('rte_power.h')
> > +		'power_kvm_vm.c', 'guest_channel.c',
> > +		'rte_empty_poll.c')
> > +headers = files('rte_power.h','rte_empty_poll.h')
> > diff --git a/lib/librte_power/rte_empty_poll.c b/lib/librte_power/rte_empty_poll.c
> > new file mode 100644
> > index 0000000..57bd63b
> > --- /dev/null
> > +++ b/lib/librte_power/rte_empty_poll.c
> > @@ -0,0 +1,529 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2018 Intel Corporation
> > + */
> > +
> > +#include <string.h>
> > +
> > +#include <rte_lcore.h>
> > +#include <rte_cycles.h>
> > +#include <rte_atomic.h>
> > +#include <rte_malloc.h>
> > +
> > +
> > +
> > +
> > +#include "rte_power.h"
> > +#include "rte_empty_poll.h"
> > +
> > +
> > +
> > +
> > +#define INTERVALS_PER_SECOND 10     /* (100ms) */
> > +#define SECONDS_TO_TRAIN_FOR  5
> > +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> > +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> > +#define DEFAULT_CYCLES_PER_PACKET 800
> > +
> > +
> > +static struct ep_params *ep_params;
> > +static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> > +static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> > +
> > +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> > +
> > +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> > +
> > +static uint32_t freq_index[NUM_FREQ];
> > +
> > +
> > +
> > +static uint32_t
> > +get_freq_index(enum freq_val index)
> > +{
> > +	return freq_index[index];
> > +}
> > +
> > +
> > +static int
> > +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> > +{
> > +	int err = 0;
> > +	uint32_t power_freq_index;
> > +	if (!specific_freq)
> > +		power_freq_index = get_freq_index(freq);
> > +	else
> > +		power_freq_index = freq;
> > +
> > +	err = rte_power_set_freq(lcore_id, power_freq_index);
> > +
> > +	return err;
> > +}
> > +
> > +
> > +static inline void __attribute__((always_inline))
> > +exit_training_state(struct priority_worker *poll_stats)
> > +{
> > +	RTE_SET_USED(poll_stats);
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_training_state(struct priority_worker *poll_stats)
> > +{
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->cur_freq = LOW;
> > +	poll_stats->queue_state = TRAINING;
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_normal_state(struct priority_worker *poll_stats)
> > +{
> > +	/* Clear the averages arrays and strs */
> > +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> > +	poll_stats->ec = 0;
> > +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> > +	poll_stats->pc = 0;
> > +
> > +	poll_stats->cur_freq = MED;
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->threshold_ctr = 0;
> > +	poll_stats->queue_state = MED_NORMAL;
> > +	set_power_freq(poll_stats->lcore_id, MED, false);
> > +
> > +	/* Try here */
> > +	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> > +	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_busy_state(struct priority_worker *poll_stats)
> > +{
> > +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> > +	poll_stats->ec = 0;
> > +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> > +	poll_stats->pc = 0;
> > +
> > +	poll_stats->cur_freq = HGH;
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->threshold_ctr = 0;
> > +	poll_stats->queue_state = HGH_BUSY;
> > +	set_power_freq(poll_stats->lcore_id, HGH, false);
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_purge_state(struct priority_worker *poll_stats)
> > +{
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->queue_state = LOW_PURGE;
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +set_state(struct priority_worker *poll_stats,
> > +		enum queue_state new_state)
> > +{
> > +	enum queue_state old_state = poll_stats->queue_state;
> > +	if (old_state != new_state) {
> > +
> > +		/* Call any old state exit functions */
> > +		if (old_state == TRAINING)
> > +			exit_training_state(poll_stats);
> > +
> > +		/* Call any new state entry functions */
> > +		if (new_state == TRAINING)
> > +			enter_training_state(poll_stats);
> > +		if (new_state == MED_NORMAL)
> > +			enter_normal_state(poll_stats);
> > +		if (new_state == HGH_BUSY)
> > +			enter_busy_state(poll_stats);
> > +		if (new_state == LOW_PURGE)
> > +			enter_purge_state(poll_stats);
> > +	}
> > +}
> > +
> > +
> > +static void
> > +update_training_stats(struct priority_worker *poll_stats,
> > +		uint32_t freq,
> > +		bool specific_freq,
> > +		uint32_t max_train_iter)
> > +{
> > +	RTE_SET_USED(specific_freq);
> > +
> > +	char pfi_str[32];
> > +	uint64_t p0_empty_deq;
> > +
> > +	sprintf(pfi_str, "%02d", freq);
> > +
> > +	if (poll_stats->cur_freq == freq &&
> > +			poll_stats->thresh[freq].trained == false) {
> > +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> > +
> > +			set_power_freq(poll_stats->lcore_id,
> > +					freq, specific_freq);
> > +
> > +			poll_stats->empty_dequeues_prev =
> > +				poll_stats->empty_dequeues;
> > +
> > +			poll_stats->thresh[freq].cur_train_iter++;
> > +
> > +			return;
> > +		} else if (poll_stats->thresh[freq].cur_train_iter
> > +				<= max_train_iter) {
> > +
> > +			p0_empty_deq = poll_stats->empty_dequeues -
> > +				poll_stats->empty_dequeues_prev;
> > +
> > +			poll_stats->empty_dequeues_prev =
> > +				poll_stats->empty_dequeues;
> > +
> > +			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> > +			poll_stats->thresh[freq].cur_train_iter++;
> > +
> > +		} else {
> > +			if (poll_stats->thresh[freq].trained == false) {
> > +				poll_stats->thresh[freq].base_edpi =
> > +					poll_stats->thresh[freq].base_edpi /
> > +					max_train_iter;
> > +
> > +				/* Add on a factor of 0.05%, this should remove any */
> > +				/* false negatives when the system is 0% busy */
> > +				poll_stats->thresh[freq].base_edpi +=
> > +					poll_stats->thresh[freq].base_edpi / 2000;
> 
> Please use #define for magic numbers
> 
> > +
> > +				poll_stats->thresh[freq].trained = true;
> > +				poll_stats->cur_freq++;
> > +
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +static inline uint32_t __attribute__((always_inline))
> > +update_stats(struct priority_worker *poll_stats)
> > +{
> > +	uint64_t tot_edpi = 0, tot_ppi = 0;
> > +	uint32_t j, percent;
> > +
> > +	struct priority_worker *s = poll_stats;
> > +
> > +	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> > +
> > +	s->empty_dequeues_prev = s->empty_dequeues;
> > +
> > +	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> > +
> > +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> > +
> > +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
> > +		return 1000UL; /* Value to make us fail */
> > +
> > +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> > +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> > +
> > +	for (j = 0; j < BINS_AV; j++) {
> > +		tot_edpi += s->edpi_av[j];
> > +		tot_ppi += s->ppi_av[j];
> > +	}
> > +
> > +	tot_edpi = tot_edpi / BINS_AV;
> > +
> > +	percent = 100 - (uint32_t)((float)tot_edpi /
> > +			(float)s->thresh[s->cur_freq].base_edpi * 100);
> > +
> > +	return (uint32_t)percent;
> > +}
> > +
> > +
> > +static inline void  __attribute__((always_inline))
> > +update_stats_normal(struct priority_worker *poll_stats)
> > +{
> > +	uint32_t percent;
> > +
> > +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
> > +		return;
> > +
> > +	percent = update_stats(poll_stats);
> > +
> > +	if (percent > 100)
> > +		return;
> > +
> > +	if (poll_stats->cur_freq == LOW)
> > +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> > +	else if (poll_stats->cur_freq == MED) {
> > +
> > +		if (percent > poll_stats->thresh[poll_stats->cur_freq].
> > +				threshold_percent) {
> > +
> > +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> > +				poll_stats->threshold_ctr++;
> > +			else
> > +				set_state(poll_stats, HGH_BUSY);
> > +
> > +		} else {
> > +			/* reset */
> > +			poll_stats->threshold_ctr = 0;
> > +		}
> > +
> > +	} else if (poll_stats->cur_freq == HGH) {
> > +
> > +		if (percent < poll_stats->thresh[poll_stats->cur_freq].
> > +				threshold_percent) {
> > +
> > +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> > +				poll_stats->threshold_ctr++;
> > +			else
> > +				set_state(poll_stats, MED_NORMAL);
> > +		} else
> > +			/* reset */
> > +			poll_stats->threshold_ctr = 0;
> > +
> > +
> > +	}
> > +}
> > +
> > +static int
> > +empty_poll_trainning(struct priority_worker *poll_stats,
> > +		uint32_t max_train_iter) {
> > +
> > +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> > +		poll_stats->iter_counter++;
> > +		return 0;
> > +	}
> > +
> > +
> > +	update_training_stats(poll_stats,
> > +			LOW,
> > +			false,
> > +			max_train_iter);
> > +
> > +	update_training_stats(poll_stats,
> > +			MED,
> > +			false,
> > +			max_train_iter);
> > +
> > +	update_training_stats(poll_stats,
> > +			HGH,
> > +			false,
> > +			max_train_iter);
> > +
> > +
> > +	if (poll_stats->thresh[LOW].trained == true
> > +			&& poll_stats->thresh[MED].trained == true
> > +			&& poll_stats->thresh[HGH].trained == true) {
> > +
> > +		set_state(poll_stats, MED_NORMAL);
> > +
> > +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> > +				poll_stats->lcore_id);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void
> > +empty_poll_detection(struct rte_timer *tim,
> > +		void *arg)
> > +{
> > +
> > +	uint32_t i;
> > +
> > +	struct priority_worker *poll_stats;
> > +
> > +	RTE_SET_USED(tim);
> > +
> > +	RTE_SET_USED(arg);
> > +
> > +	for (i = 0; i < NUM_NODES; i++) {
> > +
> > +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> > +
> > +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> > +			continue;
> > +
> > +		switch (poll_stats->queue_state) {
> > +		case(TRAINING):
> > +			empty_poll_trainning(poll_stats,
> > +					ep_params->max_train_iter);
> > +			break;
> > +
> > +		case(HGH_BUSY):
> > +		case(MED_NORMAL):
> > +			update_stats_normal(poll_stats);
> > +
> > +			break;
> > +
> > +		case(LOW_PURGE):
> > +			break;
> > +		default:
> > +			break;
> > +
> > +		}
> > +
> > +	}
> > +
> > +}
> > +
> > +int
> > +rte_empty_poll_stat_init(void)
> > +{
> > +	uint32_t i;
> > +	/* Allocate the ep_params structure */
> > +	ep_params = rte_zmalloc_socket(NULL,
> > +			sizeof(struct ep_params),
> > +			0,
> > +			rte_socket_id());
> > +
> > +	if (!ep_params)
> > +		rte_panic("Cannot allocate heap memory for ep_params "
> > +				"for socket %d\n", rte_socket_id());
> > +
> > +	freq_index[LOW] = 14;
> > +	freq_index[MED] = 9;
> > +	freq_index[HGH] = 1;
> > +
> 
> Are these default frequency indexes? Can they be changed? Maybe say so in a
> comment.
> 
> > +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> > +
> > +	/* 5 seconds worth of training */
> > +	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
> > +
> > +	struct stats_data *w = &ep_params->wrk_data;
> > +
> > +	/* initialize all wrk_stats state */
> > +	for (i = 0; i < NUM_NODES; i++) {
> > +
> > +		if (rte_lcore_is_enabled(i) == 0)
> > +			continue;
> > +
> > +		set_state(&w->wrk_stats[i], TRAINING);
> > +		/*init the freqs table */
> > +		total_avail_freqs[i] = rte_power_freqs(i,
> > +				avail_freqs[i],
> > +				NUM_FREQS);
> > +
> > +		if (get_freq_index(LOW) > total_avail_freqs[i])
> > +			return -1;
> > +
> > +	}
> > +
> > +
> > +	return 0;
> > +}
> > +
> > +void
> > +rte_empty_poll_stat_free(void)
> > +{
> > +
> > +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> > +
> > +	if (ep_params != NULL)
> > +		rte_free(ep_params);
> > +}
> > +
> > +int
> > +rte_empty_poll_stat_update(unsigned int lcore_id)
> > +{
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	poll_stats->empty_dequeues++;
> > +
> > +	return 0;
> > +}
> > +
> > +int
> > +rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> > +{
> > +
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	poll_stats->num_dequeue_pkts += nb_pkt;
> > +
> > +	return 0;
> > +}
> > +
> > +
> > +uint64_t
> > +rte_empty_poll_stat_fetch(unsigned int lcore_id)
> > +{
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	return poll_stats->empty_dequeues;
> > +}
> > +
> > +uint64_t
> > +rte_poll_stat_fetch(unsigned int lcore_id)
> > +{
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	return poll_stats->num_dequeue_pkts;
> > +}
> > +
> > +void
> > +rte_empty_poll_set_freq(enum freq_val index, uint32_t limit)
> > +{
> > +	switch (index) {
> > +
> > +	case LOW:
> > +		freq_index[LOW] = limit;
> > +		break;
> > +
> > +	case MED:
> > +		freq_index[MED] = limit;
> > +		break;
> > +
> > +	case HGH:
> > +		freq_index[HGH] = limit;
> > +		break;
> > +	default:
> > +		break;
> > +	}
> > +}
> > +
> > +void
> > +rte_empty_poll_setup_timer(void)
> > +{
> > +	int lcore_id = rte_lcore_id();
> > +	uint64_t hz = rte_get_timer_hz();
> > +
> > +	struct  ep_params *ep_ptr = ep_params;
> > +
> > +	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
> > +
> > +	rte_timer_reset_sync(&ep_ptr->timer0,
> > +			ep_ptr->interval_ticks,
> > +			PERIODICAL,
> > +			lcore_id,
> > +			empty_poll_detection,
> > +			(void *)ep_ptr);
> > +
> > +}
> > diff --git a/lib/librte_power/rte_empty_poll.h b/lib/librte_power/rte_empty_poll.h
> > new file mode 100644
> > index 0000000..7e036ee
> > --- /dev/null
> > +++ b/lib/librte_power/rte_empty_poll.h
> > @@ -0,0 +1,135 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2018 Intel Corporation
> > + */
> > +
> > +#ifndef _RTE_EMPTY_POLL_H
> > +#define _RTE_EMPTY_POLL_H
> > +
> > +/**
> > + * @file
> > + * RTE Power Management
> > + */
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +#include <sys/queue.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_byteorder.h>
> > +#include <rte_log.h>
> > +#include <rte_string_fns.h>
> > +#include <rte_power.h>
> > +#include <rte_timer.h>
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#define NUM_FREQS 20
> 
> Should probably be RTE_MAX_LCORE_FREQS
> 
> > +
> > +#define BINS_AV 4 /* has to be ^2 */
> > +
> > +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> > +
> > +#define NUM_PRIORITIES          2
> > +
> > +#define NUM_NODES         31 /*any reseanable prime number should work*/
> > +
> > +
> > +enum freq_val {
> > +	LOW,
> > +	MED,
> > +	HGH,
> 
> HGH should be HIGH. Where we see this on it's own in the code, it's not
> obvious that it's related to MEDIUM and LOW.  Also, maybe MED should be
> MEDIUM.
> 
> Would suggest prefixing these with FREQ_. Makes the code more readable.
> 
> > +	NUM_FREQ = NUM_FREQS
> 
> This should probably be the number of elements in freq_val rather than
> NUM_FREQS
> 
> I think there is some confusion in the code about the number of frequencies.
> Any array where it's holding all the possible frequencies should use
> RTE_MAX_LCORE_FREQS.
> But it looks ti me that the freq_index array only holds three values, the
> indexes for Low, Medium and High.
> 
> 
> > +};
> > +
> > +
> > +enum queue_state {
> > +	TRAINING, /* NO TRAFFIC */
> > +	MED_NORMAL,   /* MED */
> > +	HGH_BUSY,     /* HIGH */
> > +	LOW_PURGE,    /* LOW */
> > +};
> 
> I'm not sure that the NORMAL, BUSY and PURGE words add any value here. How
> about MEDIUM, HIGH and LOW?
> 
> > +
> > +/* queue stats */
> > +struct freq_threshold {
> > +
> > +	uint64_t base_edpi;
> > +	bool trained;
> > +	uint32_t threshold_percent;
> > +	uint32_t cur_train_iter;
> > +};
> > +
> > +
> > +struct priority_worker {
> > +
> > +	/* Current dequeue and throughput counts */
> > +	/* These 2 are written to by the worker threads */
> > +	/* So keep them on their own cache line */
> > +	uint64_t empty_dequeues;
> > +	uint64_t num_dequeue_pkts;
> > +
> > +	enum queue_state queue_state;
> > +
> > +	uint64_t empty_dequeues_prev;
> > +	uint64_t num_dequeue_pkts_prev;
> > +
> > +	/* Used for training only */
> > +	struct freq_threshold thresh[NUM_FREQ];
> > +	enum freq_val cur_freq;
> > +
> > +	/* bucket arrays to calculate the averages */
> > +	uint64_t edpi_av[BINS_AV];
> > +	uint32_t  ec;
> > +	uint64_t ppi_av[BINS_AV];
> > +	uint32_t  pc;
> 
> Suggest a comment per line explaining what the names of these variables
> mean. edpi, ppi, etc.
> 
> 
> > +
> > +	uint32_t lcore_id;
> > +	uint32_t iter_counter;
> > +	uint32_t threshold_ctr;
> > +	uint32_t display_ctr;
> > +	uint8_t  dev_id;
> > +
> > +} __rte_cache_aligned;
> > +
> > +
> > +struct stats_data {
> > +
> > +	struct priority_worker wrk_stats[NUM_NODES];
> > +
> > +	/* flag to stop rx threads processing packets until training over */
> > +	bool start_rx;
> > +
> > +};
> > +
> > +struct ep_params {
> > +
> > +	/* timer related stuff */
> > +	uint64_t interval_ticks;
> > +	uint32_t max_train_iter;
> > +	struct rte_timer timer0;
> > +
> > +	struct stats_data wrk_data;
> > +};
> > +
> > +
> > +int rte_empty_poll_stat_init(void);
> > +
> > +void rte_empty_poll_stat_free(void);
> > +
> > +int rte_empty_poll_stat_update(unsigned int lcore_id);
> > +
> > +int rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > +
> > +uint64_t rte_empty_poll_stat_fetch(unsigned int lcore_id);
> > +
> > +uint64_t rte_poll_stat_fetch(unsigned int lcore_id);
> > +
> > +void rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> > +
> > +void rte_empty_poll_setup_timer(void);
> 
> All the function prototypes need documentation. Please see rte_power.h.
> And please makes sure that Doxygen documentation generates correctly from
> the comments.
> Suggest prefixing the functions with rte_power_
> 
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif
> 
> A few more comments throughout the code would be good.
> 
> 
>
  

Patch

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..dbc175a 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -16,8 +16,9 @@  LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..5270fa3 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@  if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_empty_poll.c')
+headers = files('rte_power.h','rte_empty_poll.h')
diff --git a/lib/librte_power/rte_empty_poll.c b/lib/librte_power/rte_empty_poll.c
new file mode 100644
index 0000000..57bd63b
--- /dev/null
+++ b/lib/librte_power/rte_empty_poll.c
@@ -0,0 +1,529 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+
+
+
+#include "rte_power.h"
+#include "rte_empty_poll.h"
+
+
+
+
+#define INTERVALS_PER_SECOND 10     /* (100ms) */
+#define SECONDS_TO_TRAIN_FOR  5
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%, this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
+		return 1000UL; /* Value to make us fail */
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
+		return;
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100)
+		return;
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent > poll_stats->thresh[poll_stats->cur_freq].
+				threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, HGH_BUSY);
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent < poll_stats->thresh[poll_stats->cur_freq].
+				threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, MED_NORMAL);
+		} else
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+
+
+	}
+}
+
+static int
+empty_poll_trainning(struct priority_worker *poll_stats,
+		uint32_t max_train_iter) {
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+static void
+empty_poll_detection(struct rte_timer *tim,
+		void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_trainning(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int
+rte_empty_poll_stat_init(void)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	freq_index[LOW] = 14;
+	freq_index[MED] = 9;
+	freq_index[HGH] = 1;
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+
+		set_state(&w->wrk_stats[i], TRAINING);
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+	}
+
+
+	return 0;
+}
+
+void
+rte_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int
+rte_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int
+rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t
+rte_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t
+rte_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
+
+void
+rte_empty_poll_set_freq(enum freq_val index, uint32_t limit)
+{
+	switch (index) {
+
+	case LOW:
+		freq_index[LOW] = limit;
+		break;
+
+	case MED:
+		freq_index[MED] = limit;
+		break;
+
+	case HGH:
+		freq_index[HGH] = limit;
+		break;
+	default:
+		break;
+	}
+}
+
+void
+rte_empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			empty_poll_detection,
+			(void *)ep_ptr);
+
+}
diff --git a/lib/librte_power/rte_empty_poll.h b/lib/librte_power/rte_empty_poll.h
new file mode 100644
index 0000000..7e036ee
--- /dev/null
+++ b/lib/librte_power/rte_empty_poll.h
@@ -0,0 +1,135 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         31 /*any reseanable prime number should work*/
+
+
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* queue stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+struct ep_params {
+
+	/* timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+	struct rte_timer timer0;
+
+	struct stats_data wrk_data;
+};
+
+
+int rte_empty_poll_stat_init(void);
+
+void rte_empty_poll_stat_free(void);
+
+int rte_empty_poll_stat_update(unsigned int lcore_id);
+
+int rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+uint64_t rte_empty_poll_stat_fetch(unsigned int lcore_id);
+
+uint64_t rte_poll_stat_fetch(unsigned int lcore_id);
+
+void rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
+
+void rte_empty_poll_setup_timer(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif