From patchwork Fri Oct 19 11:07:18 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Liang, Ma" X-Patchwork-Id: 47077 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 5D6601B53F; Fri, 19 Oct 2018 13:07:30 +0200 (CEST) Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by dpdk.org (Postfix) with ESMTP id 7DC6A1B4E8 for ; Fri, 19 Oct 2018 13:07:28 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Oct 2018 04:07:27 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,399,1534834800"; d="scan'208";a="83906810" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga006.jf.intel.com with ESMTP; 19 Oct 2018 04:07:25 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w9JB7O6H014335; Fri, 19 Oct 2018 12:07:25 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w9JB7OHW016845; Fri, 19 Oct 2018 12:07:24 +0100 Received: (from lma25@localhost) by sivswdev01.ir.intel.com with LOCAL id w9JB7OOI016841; Fri, 19 Oct 2018 12:07:24 +0100 From: Liang Ma To: david.hunt@intel.com Cc: dev@dpdk.org, lei.a.yao@intel.com, ktraynor@redhat.com, marko.kovacevic@intel.com, Liang Ma Date: Fri, 19 Oct 2018 12:07:18 +0100 Message-Id: <1539947242-16729-1-git-send-email-liang.j.ma@intel.com> X-Mailer: git-send-email 1.7.7.4 In-Reply-To: <1539944630-21625-1-git-send-email-liang.j.ma@intel.com> References: <1539944630-21625-1-git-send-email-liang.j.ma@intel.com> MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" 1. Abstract For packet processing workloads such as DPDK polling is continuous. This means CPU cores always show 100% busy independent of how much work those cores are doing. It is critical to accurately determine how busy a core is hugely important for the following reasons: * No indication of overload conditions. * User does not know how much real load is on a system, resulting in wasted energy as no power management is utilized. Compared to the original l3fwd-power design, instead of going to sleep after detecting an empty poll, the new mechanism just lowers the core frequency. As a result, the application does not stop polling the device, which leads to improved handling of bursts of traffic. When the system become busy, the empty poll mechanism can also increase the core frequency (including turbo) to do best effort for intensive traffic. This gives us more flexible and balanced traffic awareness over the standard l3fwd-power application. 2. Proposed solution The proposed solution focuses on how many times empty polls are executed. The less the number of empty polls, means current core is busy with processing workload, therefore, the higher frequency is needed. The high empty poll number indicates the current core not doing any real work therefore, we can lower the frequency to safe power. In the current implementation, each core has 1 empty-poll counter which assume 1 core is dedicated to 1 queue. This will need to be expanded in the future to support multiple queues per core. 2.1 Power state definition: LOW: Not currently used, reserved for future use. MED: the frequency is used to process modest traffic workload. HIGH: the frequency is used to process busy traffic workload. 2.2 There are two phases to establish the power management system: a.Initialization/Training phase. The training phase is necessary in order to figure out the system polling baseline numbers from idle to busy. The highest poll count will be during idle, where all polls are empty. These poll counts will be different between systems due to the many possible processor micro-arch, cache and device configurations, hence the training phase. In the training phase, traffic is blocked so the training algorithm can average the empty-poll numbers for the LOW, MED and HIGH power states in order to create a baseline. The core's counter are collected every 10ms, and the Training phase will take 2 seconds. Training is disabled as default configuration. The default parameter is applied. Sample App still can trigger training if that's needed. Once the training phase has been executed once on a system, the application can then be started with the relevant thresholds provided on the command line, allowing the application to start passing start traffic immediately b.Normal phase. Traffic starts immediately based on the default thresholds, or based on the user supplied thresholds via the command line parameters. The run-time poll counts are compared with the baseline and the decision will be taken to move to MED power state or HIGH power state. The counters are calculated every 10ms. 3. Proposed API 1. rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb, struct ep_policy *policy); which is used to initialize the power management system.   2. rte_power_empty_poll_stat_free(void); which is used to free the resource hold by power management system.   3. rte_power_empty_poll_stat_update(unsigned int lcore_id); which is used to update specific core empty poll counter, not thread safe   4. rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt); which is used to update specific core valid poll counter, not thread safe   5. rte_power_empty_poll_stat_fetch(unsigned int lcore_id); which is used to get specific core empty poll counter.   6. rte_power_poll_stat_fetch(unsigned int lcore_id); which is used to get specific core valid poll counter. 7. rte_empty_poll_detection(struct rte_timer *tim, void *arg); which is used to detect empty poll state changes then take action. ChangeLog: v2: fix some coding style issues. v3: rename the filename, API name. v4: no change. v5: no change. v6: re-work the code layout, update API. v7: fix minor typo and lift node num limit. v8: disable training as default option. v9: minor git log update. v10: update due to the code review comments. v12: remove rte_panic Signed-off-by: Liang Ma Reviewed-by: Lei Yao Acked-by: David Hunt --- lib/librte_power/Makefile | 6 +- lib/librte_power/meson.build | 5 +- lib/librte_power/rte_power_empty_poll.c | 544 ++++++++++++++++++++++++++++++++ lib/librte_power/rte_power_empty_poll.h | 223 +++++++++++++ lib/librte_power/rte_power_version.map | 13 + 5 files changed, 787 insertions(+), 4 deletions(-) create mode 100644 lib/librte_power/rte_power_empty_poll.c create mode 100644 lib/librte_power/rte_power_empty_poll.h diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile index 6f85e88..a8f1301 100644 --- a/lib/librte_power/Makefile +++ b/lib/librte_power/Makefile @@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk # library name LIB = librte_power.a +CFLAGS += -DALLOW_EXPERIMENTAL_API CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing -LDLIBS += -lrte_eal +LDLIBS += -lrte_eal -lrte_timer EXPORT_MAP := rte_power_version.map @@ -16,8 +17,9 @@ LIBABIVER := 1 # all source are stored in SRCS-y SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c # install this header file -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h rte_power_empty_poll.h include $(RTE_SDK)/mk/rte.lib.mk diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build index 253173f..63957eb 100644 --- a/lib/librte_power/meson.build +++ b/lib/librte_power/meson.build @@ -5,5 +5,6 @@ if host_machine.system() != 'linux' build = false endif sources = files('rte_power.c', 'power_acpi_cpufreq.c', - 'power_kvm_vm.c', 'guest_channel.c') -headers = files('rte_power.h') + 'power_kvm_vm.c', 'guest_channel.c', + 'rte_power_empty_poll.c') +headers = files('rte_power.h','rte_power_empty_poll.h') diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c new file mode 100644 index 0000000..c1e10e0 --- /dev/null +++ b/lib/librte_power/rte_power_empty_poll.c @@ -0,0 +1,544 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2018 Intel Corporation + */ + +#include + +#include +#include +#include +#include + +#include "rte_power.h" +#include "rte_power_empty_poll.h" + +#define INTERVALS_PER_SECOND 100 /* (10ms) */ +#define SECONDS_TO_TRAIN_FOR 2 +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70 +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30 +#define DEFAULT_CYCLES_PER_PACKET 800 + +static struct ep_params *ep_params; +static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD; +static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD; + +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS]; + +static uint32_t total_avail_freqs[RTE_MAX_LCORE]; + +static uint32_t freq_index[NUM_FREQ]; + +static uint32_t +get_freq_index(enum freq_val index) +{ + return freq_index[index]; +} + + +static int +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq) +{ + int err = 0; + uint32_t power_freq_index; + if (!specific_freq) + power_freq_index = get_freq_index(freq); + else + power_freq_index = freq; + + err = rte_power_set_freq(lcore_id, power_freq_index); + + return err; +} + + +static inline void __attribute__((always_inline)) +exit_training_state(struct priority_worker *poll_stats) +{ + RTE_SET_USED(poll_stats); +} + +static inline void __attribute__((always_inline)) +enter_training_state(struct priority_worker *poll_stats) +{ + poll_stats->iter_counter = 0; + poll_stats->cur_freq = LOW; + poll_stats->queue_state = TRAINING; +} + +static inline void __attribute__((always_inline)) +enter_normal_state(struct priority_worker *poll_stats) +{ + /* Clear the averages arrays and strs */ + memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av)); + poll_stats->ec = 0; + memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av)); + poll_stats->pc = 0; + + poll_stats->cur_freq = MED; + poll_stats->iter_counter = 0; + poll_stats->threshold_ctr = 0; + poll_stats->queue_state = MED_NORMAL; + RTE_LOG(INFO, POWER, "Set the power freq to MED\n"); + set_power_freq(poll_stats->lcore_id, MED, false); + + poll_stats->thresh[MED].threshold_percent = med_to_high_threshold; + poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold; +} + +static inline void __attribute__((always_inline)) +enter_busy_state(struct priority_worker *poll_stats) +{ + memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av)); + poll_stats->ec = 0; + memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av)); + poll_stats->pc = 0; + + poll_stats->cur_freq = HGH; + poll_stats->iter_counter = 0; + poll_stats->threshold_ctr = 0; + poll_stats->queue_state = HGH_BUSY; + set_power_freq(poll_stats->lcore_id, HGH, false); +} + +static inline void __attribute__((always_inline)) +enter_purge_state(struct priority_worker *poll_stats) +{ + poll_stats->iter_counter = 0; + poll_stats->queue_state = LOW_PURGE; +} + +static inline void __attribute__((always_inline)) +set_state(struct priority_worker *poll_stats, + enum queue_state new_state) +{ + enum queue_state old_state = poll_stats->queue_state; + if (old_state != new_state) { + + /* Call any old state exit functions */ + if (old_state == TRAINING) + exit_training_state(poll_stats); + + /* Call any new state entry functions */ + if (new_state == TRAINING) + enter_training_state(poll_stats); + if (new_state == MED_NORMAL) + enter_normal_state(poll_stats); + if (new_state == HGH_BUSY) + enter_busy_state(poll_stats); + if (new_state == LOW_PURGE) + enter_purge_state(poll_stats); + } +} + +static inline void __attribute__((always_inline)) +set_policy(struct priority_worker *poll_stats, + struct ep_policy *policy) +{ + set_state(poll_stats, policy->state); + + if (policy->state == TRAINING) + return; + + poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi; + poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi; + + poll_stats->thresh[MED_NORMAL].trained = true; + poll_stats->thresh[HGH_BUSY].trained = true; + +} + +static void +update_training_stats(struct priority_worker *poll_stats, + uint32_t freq, + bool specific_freq, + uint32_t max_train_iter) +{ + RTE_SET_USED(specific_freq); + + char pfi_str[32]; + uint64_t p0_empty_deq; + + sprintf(pfi_str, "%02d", freq); + + if (poll_stats->cur_freq == freq && + poll_stats->thresh[freq].trained == false) { + if (poll_stats->thresh[freq].cur_train_iter == 0) { + + set_power_freq(poll_stats->lcore_id, + freq, specific_freq); + + poll_stats->empty_dequeues_prev = + poll_stats->empty_dequeues; + + poll_stats->thresh[freq].cur_train_iter++; + + return; + } else if (poll_stats->thresh[freq].cur_train_iter + <= max_train_iter) { + + p0_empty_deq = poll_stats->empty_dequeues - + poll_stats->empty_dequeues_prev; + + poll_stats->empty_dequeues_prev = + poll_stats->empty_dequeues; + + poll_stats->thresh[freq].base_edpi += p0_empty_deq; + poll_stats->thresh[freq].cur_train_iter++; + + } else { + if (poll_stats->thresh[freq].trained == false) { + poll_stats->thresh[freq].base_edpi = + poll_stats->thresh[freq].base_edpi / + max_train_iter; + + /* Add on a factor of 0.05% + * this should remove any + * false negatives when the system is 0% busy + */ + poll_stats->thresh[freq].base_edpi += + poll_stats->thresh[freq].base_edpi / 2000; + + poll_stats->thresh[freq].trained = true; + poll_stats->cur_freq++; + + } + } + } +} + +static inline uint32_t __attribute__((always_inline)) +update_stats(struct priority_worker *poll_stats) +{ + uint64_t tot_edpi = 0, tot_ppi = 0; + uint32_t j, percent; + + struct priority_worker *s = poll_stats; + + uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev; + + s->empty_dequeues_prev = s->empty_dequeues; + + uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev; + + s->num_dequeue_pkts_prev = s->num_dequeue_pkts; + + if (s->thresh[s->cur_freq].base_edpi < cur_edpi) { + + /* edpi mean empty poll counter difference per interval */ + RTE_LOG(DEBUG, POWER, "cur_edpi is too large " + "cur edpi %ld " + "base edpi %ld\n", + cur_edpi, + s->thresh[s->cur_freq].base_edpi); + /* Value to make us fail need debug log*/ + return 1000UL; + } + + s->edpi_av[s->ec++ % BINS_AV] = cur_edpi; + s->ppi_av[s->pc++ % BINS_AV] = ppi; + + for (j = 0; j < BINS_AV; j++) { + tot_edpi += s->edpi_av[j]; + tot_ppi += s->ppi_av[j]; + } + + tot_edpi = tot_edpi / BINS_AV; + + percent = 100 - (uint32_t)(((float)tot_edpi / + (float)s->thresh[s->cur_freq].base_edpi) * 100); + + return (uint32_t)percent; +} + + +static inline void __attribute__((always_inline)) +update_stats_normal(struct priority_worker *poll_stats) +{ + uint32_t percent; + + if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) { + + enum freq_val cur_freq = poll_stats->cur_freq; + + /* edpi mean empty poll counter difference per interval */ + RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n", + cur_freq, + poll_stats->thresh[cur_freq].base_edpi); + return; + } + + percent = update_stats(poll_stats); + + if (percent > 100) { + /* edpi mean empty poll counter difference per interval */ + RTE_LOG(DEBUG, POWER, "Edpi is bigger than threshold\n"); + return; + } + + if (poll_stats->cur_freq == LOW) + RTE_LOG(INFO, POWER, "Purge Mode is not currently supported\n"); + else if (poll_stats->cur_freq == MED) { + + if (percent > + poll_stats->thresh[MED].threshold_percent) { + + if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND) + poll_stats->threshold_ctr++; + else { + set_state(poll_stats, HGH_BUSY); + RTE_LOG(INFO, POWER, "MOVE to HGH\n"); + } + + } else { + /* reset */ + poll_stats->threshold_ctr = 0; + } + + } else if (poll_stats->cur_freq == HGH) { + + if (percent < + poll_stats->thresh[HGH].threshold_percent) { + + if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND) + poll_stats->threshold_ctr++; + else { + set_state(poll_stats, MED_NORMAL); + RTE_LOG(INFO, POWER, "MOVE to MED\n"); + } + } else { + /* reset */ + poll_stats->threshold_ctr = 0; + } + + } +} + +static int +empty_poll_training(struct priority_worker *poll_stats, + uint32_t max_train_iter) +{ + + if (poll_stats->iter_counter < INTERVALS_PER_SECOND) { + poll_stats->iter_counter++; + return 0; + } + + + update_training_stats(poll_stats, + LOW, + false, + max_train_iter); + + update_training_stats(poll_stats, + MED, + false, + max_train_iter); + + update_training_stats(poll_stats, + HGH, + false, + max_train_iter); + + + if (poll_stats->thresh[LOW].trained == true + && poll_stats->thresh[MED].trained == true + && poll_stats->thresh[HGH].trained == true) { + + set_state(poll_stats, MED_NORMAL); + + RTE_LOG(INFO, POWER, "LOW threshold is %lu\n", + poll_stats->thresh[LOW].base_edpi); + + RTE_LOG(INFO, POWER, "MED threshold is %lu\n", + poll_stats->thresh[MED].base_edpi); + + + RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n", + poll_stats->thresh[HGH].base_edpi); + + RTE_LOG(INFO, POWER, "Training is Complete for %d\n", + poll_stats->lcore_id); + } + + return 0; +} + +void __rte_experimental +rte_empty_poll_detection(struct rte_timer *tim, void *arg) +{ + + uint32_t i; + + struct priority_worker *poll_stats; + + RTE_SET_USED(tim); + + RTE_SET_USED(arg); + + for (i = 0; i < NUM_NODES; i++) { + + poll_stats = &(ep_params->wrk_data.wrk_stats[i]); + + if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0) + continue; + + switch (poll_stats->queue_state) { + case(TRAINING): + empty_poll_training(poll_stats, + ep_params->max_train_iter); + break; + + case(HGH_BUSY): + case(MED_NORMAL): + update_stats_normal(poll_stats); + break; + + case(LOW_PURGE): + break; + default: + break; + + } + + } + +} + +int __rte_experimental +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb, + struct ep_policy *policy) +{ + uint32_t i; + /* Allocate the ep_params structure */ + ep_params = rte_zmalloc_socket(NULL, + sizeof(struct ep_params), + 0, + rte_socket_id()); + + if (!ep_params) + return -1; + + if (freq_tlb == NULL) { + freq_index[LOW] = 14; + freq_index[MED] = 9; + freq_index[HGH] = 1; + } else { + freq_index[LOW] = freq_tlb[LOW]; + freq_index[MED] = freq_tlb[MED]; + freq_index[HGH] = freq_tlb[HGH]; + } + + RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n"); + + /* Train for pre-defined period */ + ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR; + + struct stats_data *w = &ep_params->wrk_data; + + *eptr = ep_params; + + /* initialize all wrk_stats state */ + for (i = 0; i < NUM_NODES; i++) { + + if (rte_lcore_is_enabled(i) == 0) + continue; + /*init the freqs table */ + total_avail_freqs[i] = rte_power_freqs(i, + avail_freqs[i], + NUM_FREQS); + + RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n", + total_avail_freqs[i], + i); + + if (get_freq_index(LOW) > total_avail_freqs[i]) + return -1; + + if (rte_get_master_lcore() != i) { + w->wrk_stats[i].lcore_id = i; + set_policy(&w->wrk_stats[i], policy); + } + } + + return 0; +} + +void __rte_experimental +rte_power_empty_poll_stat_free(void) +{ + + RTE_LOG(INFO, POWER, "Close the Empty Poll\n"); + + if (ep_params != NULL) + rte_free(ep_params); +} + +int __rte_experimental +rte_power_empty_poll_stat_update(unsigned int lcore_id) +{ + struct priority_worker *poll_stats; + + if (lcore_id >= NUM_NODES) + return -1; + + poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]); + + if (poll_stats->lcore_id == 0) + poll_stats->lcore_id = lcore_id; + + poll_stats->empty_dequeues++; + + return 0; +} + +int __rte_experimental +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt) +{ + + struct priority_worker *poll_stats; + + if (lcore_id >= NUM_NODES) + return -1; + + poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]); + + if (poll_stats->lcore_id == 0) + poll_stats->lcore_id = lcore_id; + + poll_stats->num_dequeue_pkts += nb_pkt; + + return 0; +} + + +uint64_t __rte_experimental +rte_power_empty_poll_stat_fetch(unsigned int lcore_id) +{ + struct priority_worker *poll_stats; + + if (lcore_id >= NUM_NODES) + return -1; + + poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]); + + if (poll_stats->lcore_id == 0) + poll_stats->lcore_id = lcore_id; + + return poll_stats->empty_dequeues; +} + +uint64_t __rte_experimental +rte_power_poll_stat_fetch(unsigned int lcore_id) +{ + struct priority_worker *poll_stats; + + if (lcore_id >= NUM_NODES) + return -1; + + poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]); + + if (poll_stats->lcore_id == 0) + poll_stats->lcore_id = lcore_id; + + return poll_stats->num_dequeue_pkts; +} diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h new file mode 100644 index 0000000..d8cbb17 --- /dev/null +++ b/lib/librte_power/rte_power_empty_poll.h @@ -0,0 +1,223 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2018 Intel Corporation + */ + +#ifndef _RTE_EMPTY_POLL_H +#define _RTE_EMPTY_POLL_H + +/** + * @file + * RTE Power Management + */ +#include +#include + +#include +#include +#include +#include +#include +#include + +#ifdef __cplusplus +extern "C" { +#endif + +#define NUM_FREQS RTE_MAX_LCORE_FREQS + +#define BINS_AV 4 /* Has to be ^2 */ + +#define DROP (NUM_DIRECTIONS * NUM_DEVICES) + +#define NUM_PRIORITIES 2 + +#define NUM_NODES 256 /* Max core number*/ + +/* Processor Power State */ +enum freq_val { + LOW, + MED, + HGH, + NUM_FREQ = NUM_FREQS +}; + + +/* Queue Polling State */ +enum queue_state { + TRAINING, /* NO TRAFFIC */ + MED_NORMAL, /* MED */ + HGH_BUSY, /* HIGH */ + LOW_PURGE, /* LOW */ +}; + +/* Queue Stats */ +struct freq_threshold { + + uint64_t base_edpi; + bool trained; + uint32_t threshold_percent; + uint32_t cur_train_iter; +}; + +/* Each Worder Thread Empty Poll Stats */ +struct priority_worker { + + /* Current dequeue and throughput counts */ + /* These 2 are written to by the worker threads */ + /* So keep them on their own cache line */ + uint64_t empty_dequeues; + uint64_t num_dequeue_pkts; + + enum queue_state queue_state; + + uint64_t empty_dequeues_prev; + uint64_t num_dequeue_pkts_prev; + + /* Used for training only */ + struct freq_threshold thresh[NUM_FREQ]; + enum freq_val cur_freq; + + /* bucket arrays to calculate the averages */ + /* edpi mean empty poll counter difference per interval */ + uint64_t edpi_av[BINS_AV]; + /* empty poll counter */ + uint32_t ec; + /* ppi mean valid poll counter per interval */ + uint64_t ppi_av[BINS_AV]; + /* valid poll counter */ + uint32_t pc; + + uint32_t lcore_id; + uint32_t iter_counter; + uint32_t threshold_ctr; + uint32_t display_ctr; + uint8_t dev_id; + +} __rte_cache_aligned; + + +struct stats_data { + + struct priority_worker wrk_stats[NUM_NODES]; + + /* flag to stop rx threads processing packets until training over */ + bool start_rx; + +}; + +/* Empty Poll Parameters */ +struct ep_params { + + /* Timer related stuff */ + uint64_t interval_ticks; + uint32_t max_train_iter; + + struct rte_timer timer0; + struct stats_data wrk_data; +}; + + +/* Sample App Init information */ +struct ep_policy { + + uint64_t med_base_edpi; + uint64_t hgh_base_edpi; + + enum queue_state state; +}; + + + +/** + * Initialize the power management system. + * + * @param eptr + * the structure of empty poll configuration + * @freq_tlb + * the power state/frequency mapping table + * @policy + * the initialization policy from sample app + * + * @return + * - 0 on success. + * - Negative on error. + */ +int __rte_experimental +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb, + struct ep_policy *policy); + +/** + * Free the resource hold by power management system. + */ +void __rte_experimental +rte_power_empty_poll_stat_free(void); + +/** + * Update specific core empty poll counter + * It's not thread safe. + * + * @param lcore_id + * lcore id + * + * @return + * - 0 on success. + * - Negative on error. + */ +int __rte_experimental +rte_power_empty_poll_stat_update(unsigned int lcore_id); + +/** + * Update specific core valid poll counter, not thread safe. + * + * @param lcore_id + * lcore id. + * @param nb_pkt + * The packet number of one valid poll. + * + * @return + * - 0 on success. + * - Negative on error. + */ +int __rte_experimental +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt); + +/** + * Fetch specific core empty poll counter. + * + * @param lcore_id + * lcore id + * + * @return + * Current lcore empty poll counter value. + */ +uint64_t __rte_experimental +rte_power_empty_poll_stat_fetch(unsigned int lcore_id); + +/** + * Fetch specific core valid poll counter. + * + * @param lcore_id + * lcore id + * + * @return + * Current lcore valid poll counter value. + */ +uint64_t __rte_experimental +rte_power_poll_stat_fetch(unsigned int lcore_id); + +/** + * Empty poll state change detection function + * + * @param tim + * The timer structure + * @param arg + * The customized parameter + */ +void __rte_experimental +rte_empty_poll_detection(struct rte_timer *tim, void *arg); + +#ifdef __cplusplus +} +#endif + +#endif diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map index dd587df..17a083b 100644 --- a/lib/librte_power/rte_power_version.map +++ b/lib/librte_power/rte_power_version.map @@ -33,3 +33,16 @@ DPDK_18.08 { rte_power_get_capabilities; } DPDK_17.11; + +EXPERIMENTAL { + global: + + rte_empty_poll_detection; + rte_power_empty_poll_stat_fetch; + rte_power_empty_poll_stat_free; + rte_power_empty_poll_stat_init; + rte_power_empty_poll_stat_update; + rte_power_poll_stat_fetch; + rte_power_poll_stat_update; + +}; From patchwork Fri Oct 19 11:07:19 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Liang, Ma" X-Patchwork-Id: 47078 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 29CBF1B54D; Fri, 19 Oct 2018 13:07:38 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id 1FEEC1B53C for ; Fri, 19 Oct 2018 13:07:28 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Oct 2018 04:07:28 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,399,1534834800"; d="scan'208";a="79993866" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga008.fm.intel.com with ESMTP; 19 Oct 2018 04:07:26 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w9JB7Qf2014340; Fri, 19 Oct 2018 12:07:26 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w9JB7PkO016861; Fri, 19 Oct 2018 12:07:25 +0100 Received: (from lma25@localhost) by sivswdev01.ir.intel.com with LOCAL id w9JB7Pgg016857; Fri, 19 Oct 2018 12:07:25 +0100 From: Liang Ma To: david.hunt@intel.com Cc: dev@dpdk.org, lei.a.yao@intel.com, ktraynor@redhat.com, marko.kovacevic@intel.com, Liang Ma Date: Fri, 19 Oct 2018 12:07:19 +0100 Message-Id: <1539947242-16729-2-git-send-email-liang.j.ma@intel.com> X-Mailer: git-send-email 1.7.7.4 In-Reply-To: <1539947242-16729-1-git-send-email-liang.j.ma@intel.com> References: <1539944630-21625-1-git-send-email-liang.j.ma@intel.com> <1539947242-16729-1-git-send-email-liang.j.ma@intel.com> Subject: [dpdk-dev] [PATCH v12 2/5] examples/l3fwd-power: simple app update for new API X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Add the support for new traffic pattern aware power control power management API. Example: ./l3fwd-power -l xxx -n 4 -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1 Please Reference l3fwd-power document for full parameter usage The option "l", "m", "h" are used to set the power index for LOW, MED, HIGH power state. Only is useful after enable empty-poll --empty-poll="training_flag, med_threshold, high_threshold" The option training_flag is used to enable/disable training mode. The option med_threshold is used to indicate the empty poll threshold of modest state which is customized by user. The option high_threshold is used to indicate the empty poll threshold of busy state which is customized by user. Above three option default value is all 0. Once enable empty-poll. System will apply the default parameter if no other command line options are provided. If training mode is enabled, the user should ensure that no traffic is allowed to pass through the system. When training phase complete, the application transfer to normal operation System will start running with the modest power mode. If the traffic goes above 70%, then system will move to High power state. If the traffic drops below 30%, the system will fallback to the modest power state. Example code use master thread to monitoring worker thread busyness. The default timer resolution is 10ms. ChangeLog: v2 fix some coding style issues v3 rename the API. v6 re-work the API. v7 no change. v8 disable training as default option. v10 update due to review comments. v11 add checking for empty poll init function return value. Signed-off-by: Liang Ma Reviewed-by: Lei Yao Acked-by: David Hunt --- examples/l3fwd-power/Makefile | 3 + examples/l3fwd-power/main.c | 346 +++++++++++++++++++++++++++++++++++++-- examples/l3fwd-power/meson.build | 1 + 3 files changed, 333 insertions(+), 17 deletions(-) diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile index d7e39a3..772ec7b 100644 --- a/examples/l3fwd-power/Makefile +++ b/examples/l3fwd-power/Makefile @@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk) LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk) LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk) +CFLAGS += -DALLOW_EXPERIMENTAL_API + build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED) @@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET environment variable) all: else +CFLAGS += -DALLOW_EXPERIMENTAL_API CFLAGS += -O3 CFLAGS += $(WERROR_FLAGS) diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index 68527d2..c07eeff 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -43,6 +43,7 @@ #include #include #include +#include #include "perf_core.h" #include "main.h" @@ -55,6 +56,8 @@ /* 100 ms interval */ #define TIMER_NUMBER_PER_SECOND 10 +/* (10ms) */ +#define INTERVALS_PER_SECOND 100 /* 100000 us */ #define SCALING_PERIOD (1000000/TIMER_NUMBER_PER_SECOND) #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25 @@ -117,6 +120,17 @@ */ #define RTE_TEST_RX_DESC_DEFAULT 1024 #define RTE_TEST_TX_DESC_DEFAULT 1024 + +/* + * These two thresholds were decided on by running the training algorithm on + * a 2.5GHz Xeon. These defaults can be overridden by supplying non-zero values + * for the med_threshold and high_threshold parameters on the command line. + */ +#define EMPTY_POLL_MED_THRESHOLD 350000UL +#define EMPTY_POLL_HGH_THRESHOLD 580000UL + + + static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT; static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT; @@ -132,6 +146,14 @@ static uint32_t enabled_port_mask = 0; static int promiscuous_on = 0; /* NUMA is enabled by default. */ static int numa_on = 1; +/* emptypoll is disabled by default. */ +static bool empty_poll_on; +static bool empty_poll_train; +volatile bool empty_poll_stop; +static struct ep_params *ep_params; +static struct ep_policy policy; +static long ep_med_edpi, ep_hgh_edpi; + static int parse_ptype; /**< Parse packet type using rx callback, and */ /**< disabled by default */ @@ -330,6 +352,19 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count); static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \ unsigned int lcore_id, uint16_t port_id, uint16_t queue_id); + +/* + * These defaults are using the max frequency index (1), a medium index (9) + * and a typical low frequency index (14). These can be adjusted to use + * different indexes using the relevant command line parameters. + */ +static uint8_t freq_tlb[] = {14, 9, 1}; + +static int is_done(void) +{ + return empty_poll_stop; +} + /* exit signal handler */ static void signal_exit_now(int sigtype) @@ -338,7 +373,15 @@ signal_exit_now(int sigtype) unsigned int portid; int ret; + RTE_SET_USED(lcore_id); + RTE_SET_USED(portid); + RTE_SET_USED(ret); + if (sigtype == SIGINT) { + if (empty_poll_on) + empty_poll_stop = true; + + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { if (rte_lcore_is_enabled(lcore_id) == 0) continue; @@ -351,16 +394,19 @@ signal_exit_now(int sigtype) "core%u\n", lcore_id); } - RTE_ETH_FOREACH_DEV(portid) { - if ((enabled_port_mask & (1 << portid)) == 0) - continue; + if (!empty_poll_on) { + RTE_ETH_FOREACH_DEV(portid) { + if ((enabled_port_mask & (1 << portid)) == 0) + continue; - rte_eth_dev_stop(portid); - rte_eth_dev_close(portid); + rte_eth_dev_stop(portid); + rte_eth_dev_close(portid); + } } } - rte_exit(EXIT_SUCCESS, "User forced exit\n"); + if (!empty_poll_on) + rte_exit(EXIT_SUCCESS, "User forced exit\n"); } /* Freqency scale down timer callback */ @@ -825,7 +871,110 @@ static int event_register(struct lcore_conf *qconf) return 0; } +/* main processing loop */ +static int +main_empty_poll_loop(__attribute__((unused)) void *dummy) +{ + struct rte_mbuf *pkts_burst[MAX_PKT_BURST]; + unsigned int lcore_id; + uint64_t prev_tsc, diff_tsc, cur_tsc; + int i, j, nb_rx; + uint8_t queueid; + uint16_t portid; + struct lcore_conf *qconf; + struct lcore_rx_queue *rx_queue; + + const uint64_t drain_tsc = + (rte_get_tsc_hz() + US_PER_S - 1) / + US_PER_S * BURST_TX_DRAIN_US; + + prev_tsc = 0; + + lcore_id = rte_lcore_id(); + qconf = &lcore_conf[lcore_id]; + + if (qconf->n_rx_queue == 0) { + RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", + lcore_id); + return 0; + } + + for (i = 0; i < qconf->n_rx_queue; i++) { + portid = qconf->rx_queue_list[i].port_id; + queueid = qconf->rx_queue_list[i].queue_id; + RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u " + "rxqueueid=%hhu\n", lcore_id, portid, queueid); + } + + while (!is_done()) { + stats[lcore_id].nb_iteration_looped++; + + cur_tsc = rte_rdtsc(); + /* + * TX burst queue drain + */ + diff_tsc = cur_tsc - prev_tsc; + if (unlikely(diff_tsc > drain_tsc)) { + for (i = 0; i < qconf->n_tx_port; ++i) { + portid = qconf->tx_port_id[i]; + rte_eth_tx_buffer_flush(portid, + qconf->tx_queue_id[portid], + qconf->tx_buffer[portid]); + } + prev_tsc = cur_tsc; + } + + /* + * Read packet from RX queues + */ + for (i = 0; i < qconf->n_rx_queue; ++i) { + rx_queue = &(qconf->rx_queue_list[i]); + rx_queue->idle_hint = 0; + portid = rx_queue->port_id; + queueid = rx_queue->queue_id; + + nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, + MAX_PKT_BURST); + + stats[lcore_id].nb_rx_processed += nb_rx; + + if (nb_rx == 0) { + + rte_power_empty_poll_stat_update(lcore_id); + + continue; + } else { + rte_power_poll_stat_update(lcore_id, nb_rx); + } + + + /* Prefetch first packets */ + for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) { + rte_prefetch0(rte_pktmbuf_mtod( + pkts_burst[j], void *)); + } + + /* Prefetch and forward already prefetched packets */ + for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) { + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[ + j + PREFETCH_OFFSET], + void *)); + l3fwd_simple_forward(pkts_burst[j], portid, + qconf); + } + + /* Forward remaining prefetched packets */ + for (; j < nb_rx; j++) { + l3fwd_simple_forward(pkts_burst[j], portid, + qconf); + } + + } + } + + return 0; +} /* main processing loop */ static int main_loop(__attribute__((unused)) void *dummy) @@ -1127,7 +1276,9 @@ print_usage(const char *prgname) " --no-numa: optional, disable numa awareness\n" " --enable-jumbo: enable jumbo frame" " which max packet len is PKTLEN in decimal (64-9600)\n" - " --parse-ptype: parse packet type by software\n", + " --parse-ptype: parse packet type by software\n" + " --empty-poll: enable empty poll detection" + " follow (training_flag, high_threshold, med_threshold)\n", prgname); } @@ -1220,7 +1371,55 @@ parse_config(const char *q_arg) return 0; } +static int +parse_ep_config(const char *q_arg) +{ + char s[256]; + const char *p = q_arg; + char *end; + int num_arg; + + char *str_fld[3]; + + int training_flag; + int med_edpi; + int hgh_edpi; + + ep_med_edpi = EMPTY_POLL_MED_THRESHOLD; + ep_hgh_edpi = EMPTY_POLL_MED_THRESHOLD; + + snprintf(s, sizeof(s), "%s", p); + + num_arg = rte_strsplit(s, sizeof(s), str_fld, 3, ','); + + empty_poll_train = false; + if (num_arg == 0) + return 0; + + if (num_arg == 3) { + + training_flag = strtoul(str_fld[0], &end, 0); + med_edpi = strtoul(str_fld[1], &end, 0); + hgh_edpi = strtoul(str_fld[2], &end, 0); + + if (training_flag == 1) + empty_poll_train = true; + + if (med_edpi > 0) + ep_med_edpi = med_edpi; + + if (med_edpi > 0) + ep_hgh_edpi = hgh_edpi; + + } else { + + return -1; + } + + return 0; + +} #define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype" /* Parse the argument given in the command line of the application */ @@ -1230,6 +1429,7 @@ parse_args(int argc, char **argv) int opt, ret; char **argvopt; int option_index; + uint32_t limit; char *prgname = argv[0]; static struct option lgopts[] = { {"config", 1, 0, 0}, @@ -1237,13 +1437,14 @@ parse_args(int argc, char **argv) {"high-perf-cores", 1, 0, 0}, {"no-numa", 0, 0, 0}, {"enable-jumbo", 0, 0, 0}, + {"empty-poll", 1, 0, 0}, {CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0}, {NULL, 0, 0, 0} }; argvopt = argv; - while ((opt = getopt_long(argc, argvopt, "p:P", + while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P", lgopts, &option_index)) != EOF) { switch (opt) { @@ -1260,7 +1461,18 @@ parse_args(int argc, char **argv) printf("Promiscuous mode selected\n"); promiscuous_on = 1; break; - + case 'l': + limit = parse_max_pkt_len(optarg); + freq_tlb[LOW] = limit; + break; + case 'm': + limit = parse_max_pkt_len(optarg); + freq_tlb[MED] = limit; + break; + case 'h': + limit = parse_max_pkt_len(optarg); + freq_tlb[HGH] = limit; + break; /* long options */ case 0: if (!strncmp(lgopts[option_index].name, "config", 6)) { @@ -1299,6 +1511,20 @@ parse_args(int argc, char **argv) } if (!strncmp(lgopts[option_index].name, + "empty-poll", 10)) { + printf("empty-poll is enabled\n"); + empty_poll_on = true; + ret = parse_ep_config(optarg); + + if (ret) { + printf("invalid empty poll config\n"); + print_usage(prgname); + return -1; + } + + } + + if (!strncmp(lgopts[option_index].name, "enable-jumbo", 12)) { struct option lenopts = {"max-pkt-len", required_argument, \ @@ -1646,6 +1872,59 @@ init_power_library(void) } return ret; } +static void +empty_poll_setup_timer(void) +{ + int lcore_id = rte_lcore_id(); + uint64_t hz = rte_get_timer_hz(); + + struct ep_params *ep_ptr = ep_params; + + ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND; + + rte_timer_reset_sync(&ep_ptr->timer0, + ep_ptr->interval_ticks, + PERIODICAL, + lcore_id, + rte_empty_poll_detection, + (void *)ep_ptr); + +} +static int +launch_timer(unsigned int lcore_id) +{ + int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms; + + RTE_SET_USED(lcore_id); + + + if (rte_get_master_lcore() != lcore_id) { + rte_panic("timer on lcore:%d which is not master core:%d\n", + lcore_id, + rte_get_master_lcore()); + } + + RTE_LOG(INFO, POWER, "Bring up the Timer\n"); + + empty_poll_setup_timer(); + + cycles_10ms = rte_get_timer_hz() / 100; + + while (!is_done()) { + cur_tsc = rte_rdtsc(); + diff_tsc = cur_tsc - prev_tsc; + if (diff_tsc > cycles_10ms) { + rte_timer_manage(); + prev_tsc = cur_tsc; + cycles_10ms = rte_get_timer_hz() / 100; + } + } + + RTE_LOG(INFO, POWER, "Timer_subsystem is done\n"); + + return 0; +} + int main(int argc, char **argv) @@ -1828,13 +2107,15 @@ main(int argc, char **argv) if (rte_lcore_is_enabled(lcore_id) == 0) continue; - /* init timer structures for each enabled lcore */ - rte_timer_init(&power_timers[lcore_id]); - hz = rte_get_timer_hz(); - rte_timer_reset(&power_timers[lcore_id], - hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id, - power_timer_cb, NULL); - + if (empty_poll_on == false) { + /* init timer structures for each enabled lcore */ + rte_timer_init(&power_timers[lcore_id]); + hz = rte_get_timer_hz(); + rte_timer_reset(&power_timers[lcore_id], + hz/TIMER_NUMBER_PER_SECOND, + SINGLE, lcore_id, + power_timer_cb, NULL); + } qconf = &lcore_conf[lcore_id]; printf("\nInitializing rx queues on lcore %u ... ", lcore_id ); fflush(stdout); @@ -1905,12 +2186,43 @@ main(int argc, char **argv) check_all_ports_link_status(enabled_port_mask); + if (empty_poll_on == true) { + + if (empty_poll_train) { + policy.state = TRAINING; + } else { + policy.state = MED_NORMAL; + policy.med_base_edpi = ep_med_edpi; + policy.hgh_base_edpi = ep_hgh_edpi; + } + + ret = rte_power_empty_poll_stat_init(&ep_params, + freq_tlb, + &policy); + if (ret < 0) + rte_exit(EXIT_FAILURE, "empty poll init failed"); + } + + /* launch per-lcore init on every lcore */ - rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER); + if (empty_poll_on == false) { + rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER); + } else { + empty_poll_stop = false; + rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, + SKIP_MASTER); + } + + if (empty_poll_on == true) + launch_timer(rte_lcore_id()); + RTE_LCORE_FOREACH_SLAVE(lcore_id) { if (rte_eal_wait_lcore(lcore_id) < 0) return -1; } + if (empty_poll_on) + rte_power_empty_poll_stat_free(); + return 0; } diff --git a/examples/l3fwd-power/meson.build b/examples/l3fwd-power/meson.build index 20c8054..a3c5c2f 100644 --- a/examples/l3fwd-power/meson.build +++ b/examples/l3fwd-power/meson.build @@ -9,6 +9,7 @@ if host_machine.system() != 'linux' build = false endif +allow_experimental_apis = true deps += ['power', 'timer', 'lpm', 'hash'] sources = files( 'main.c', 'perf_core.c' From patchwork Fri Oct 19 11:07:20 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Liang, Ma" X-Patchwork-Id: 47079 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 3246E1B55B; Fri, 19 Oct 2018 13:07:40 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id 1B5581B4E8 for ; Fri, 19 Oct 2018 13:07:29 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Oct 2018 04:07:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,399,1534834800"; d="scan'208";a="79993869" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga008.fm.intel.com with ESMTP; 19 Oct 2018 04:07:28 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w9JB7RWN014344; Fri, 19 Oct 2018 12:07:27 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w9JB7R1Q016873; Fri, 19 Oct 2018 12:07:27 +0100 Received: (from lma25@localhost) by sivswdev01.ir.intel.com with LOCAL id w9JB7RwV016869; Fri, 19 Oct 2018 12:07:27 +0100 From: Liang Ma To: david.hunt@intel.com Cc: dev@dpdk.org, lei.a.yao@intel.com, ktraynor@redhat.com, marko.kovacevic@intel.com, Liang Ma Date: Fri, 19 Oct 2018 12:07:20 +0100 Message-Id: <1539947242-16729-3-git-send-email-liang.j.ma@intel.com> X-Mailer: git-send-email 1.7.7.4 In-Reply-To: <1539947242-16729-1-git-send-email-liang.j.ma@intel.com> References: <1539944630-21625-1-git-send-email-liang.j.ma@intel.com> <1539947242-16729-1-git-send-email-liang.j.ma@intel.com> Subject: [dpdk-dev] [PATCH v12 3/5] doc/guides/pro_guide/power-man: update the power API X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Update the document for empty poll API. Change Logs: v9: minor changes for syntax. Update document. Signed-off-by: Liang Ma Acked-by: David Hunt --- doc/guides/prog_guide/power_man.rst | 86 +++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index eba1cc6..68b7e8b 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -106,6 +106,92 @@ User Cases The power management mechanism is used to save power when performing L3 forwarding. + +Empty Poll API +-------------- + +Abstract +~~~~~~~~ + +For packet processing workloads such as DPDK polling is continuous. +This means CPU cores always show 100% busy independent of how much work +those cores are doing. It is critical to accurately determine how busy +a core is hugely important for the following reasons: + + * No indication of overload conditions + * User does not know how much real load is on a system, resulting + in wasted energy as no power management is utilized + +Compared to the original l3fwd-power design, instead of going to sleep +after detecting an empty poll, the new mechanism just lowers the core frequency. +As a result, the application does not stop polling the device, which leads +to improved handling of bursts of traffic. + +When the system become busy, the empty poll mechanism can also increase the core +frequency (including turbo) to do best effort for intensive traffic. This gives +us more flexible and balanced traffic awareness over the standard l3fwd-power +application. + + +Proposed Solution +~~~~~~~~~~~~~~~~~ +The proposed solution focuses on how many times empty polls are executed. +The less the number of empty polls, means current core is busy with processing +workload, therefore, the higher frequency is needed. The high empty poll number +indicates the current core not doing any real work therefore, we can lower the +frequency to safe power. + +In the current implementation, each core has 1 empty-poll counter which assume +1 core is dedicated to 1 queue. This will need to be expanded in the future to +support multiple queues per core. + +Power state definition: +^^^^^^^^^^^^^^^^^^^^^^^ + +* LOW: Not currently used, reserved for future use. + +* MED: the frequency is used to process modest traffic workload. + +* HIGH: the frequency is used to process busy traffic workload. + +There are two phases to establish the power management system: +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +* Training phase. This phase is used to measure the optimal frequency + change thresholds for a given system. The thresholds will differ from + system to system due to differences in processor micro-architecture, + cache and device configurations. + In this phase, the user must ensure that no traffic can enter the + system so that counts can be measured for empty polls at low, medium + and high frequencies. Each frequency is measured for two seconds. + Once the training phase is complete, the threshold numbers are + displayed, and normal mode resumes, and traffic can be allowed into + the system. These threshold number can be used on the command line + when starting the application in normal mode to avoid re-training + every time. + +* Normal phase. Every 10ms the run-time counters are compared + to the supplied threshold values, and the decision will be made + whether to move to a different power state (by adjusting the + frequency). + +API Overview for Empty Poll Power Management +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +* **State Init**: initialize the power management system. + +* **State Free**: free the resource hold by power management system. + +* **Update Empty Poll Counter**: update the empty poll counter. + +* **Update Valid Poll Counter**: update the valid poll counter. + +* **Set the Fequence Index**: update the power state/frequency mapping. + +* **Detect empty poll state change**: empty poll state change detection algorithm then take action. + +User Cases +---------- +The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA. + References ---------- From patchwork Fri Oct 19 11:07:21 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Liang, Ma" X-Patchwork-Id: 47080 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id B0CEF1B55C; Fri, 19 Oct 2018 13:07:48 +0200 (CEST) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id C03551B564 for ; Fri, 19 Oct 2018 13:07:45 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Oct 2018 04:07:44 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,399,1534834800"; d="scan'208";a="89578841" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by FMSMGA003.fm.intel.com with ESMTP; 19 Oct 2018 04:07:43 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w9JB7hZR014352; Fri, 19 Oct 2018 12:07:43 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w9JB7gIi017070; Fri, 19 Oct 2018 12:07:42 +0100 Received: (from lma25@localhost) by sivswdev01.ir.intel.com with LOCAL id w9JB7gDE017064; Fri, 19 Oct 2018 12:07:42 +0100 From: Liang Ma To: david.hunt@intel.com Cc: dev@dpdk.org, lei.a.yao@intel.com, ktraynor@redhat.com, marko.kovacevic@intel.com, Liang Ma Date: Fri, 19 Oct 2018 12:07:21 +0100 Message-Id: <1539947242-16729-4-git-send-email-liang.j.ma@intel.com> X-Mailer: git-send-email 1.7.7.4 In-Reply-To: <1539947242-16729-1-git-send-email-liang.j.ma@intel.com> References: <1539944630-21625-1-git-send-email-liang.j.ma@intel.com> <1539947242-16729-1-git-send-email-liang.j.ma@intel.com> MIME-Version: 1.0 Subject: [dpdk-dev] =?utf-8?q?=5BPATCH_v12_4/5=5D_doc/guides/sample=5Fapp?= =?utf-8?q?=5Fug/l3=5Fforward=5Fpower=5Fman=2Erst=3A_update?= X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Add empty poll mode command line example ChangeLogs: v9: update the document Signed-off-by: Liang Ma Acked-by: David Hunt --- doc/guides/sample_app_ug/l3_forward_power_man.rst | 69 +++++++++++++++++++++++ 1 file changed, 69 insertions(+) diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst index 795a570..e44a11b 100644 --- a/doc/guides/sample_app_ug/l3_forward_power_man.rst +++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst @@ -105,6 +105,8 @@ where, * --no-numa: optional, disables numa awareness +* --empty-poll: Traffic Aware power management. See below for details + See :doc:`l3_forward` for details. The L3fwd-power example reuses the L3fwd command line options. @@ -362,3 +364,70 @@ The algorithm has the following sleeping behavior depending on the idle counter: If a thread polls multiple Rx queues and different queue returns different sleep duration values, the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time in order to avoid a potential performance impact. + +Empty Poll Mode +------------------------- +Additionally, there is a traffic aware mode of operation called "Empty +Poll" where the number of empty polls can be monitored to keep track +of how busy the application is. Empty poll mode can be enabled by the +command line option --empty-poll. + +See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK Programmer's Guide for empty poll mode details. + +.. code-block:: console + + ./l3fwd-power -l xxx -n 4 -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1 + +Where, + +--empty-poll: Enable the empty poll mode instead of original algorithm + +--empty-poll="training_flag, med_threshold, high_threshold" + +* ``training_flag`` : optional, enable/disable training mode. Default value is 0. If the training_flag is set as 1(true), then the application will start in training mode and print out the trained threshold values. If the training_flag is set as 0(false), the application will start in normal mode, and will use either the default thresholds or those supplied on the command line. The trained threshold values are specific to the user’s system, may give a better power profile when compared to the default threshold values. + +* ``med_threshold`` : optional, sets the empty poll threshold of a modestly busy system state. If this is not supplied, the application will apply the default value of 350000. + +* ``high_threshold`` : optional, sets the empty poll threshold of a busy system state. If this is not supplied, the application will apply the default value of 580000. + +* -l : optional, set up the LOW power state frequency index + +* -m : optional, set up the MED power state frequency index + +* -h : optional, set up the HIGH power state frequency index + +Empty Poll Mode Example Usage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +To initially obtain the ideal thresholds for the system, the training +mode should be run first. This is achieved by running the l3fwd-power +app with the training flag set to “1”, and the other parameters set to +0. + +.. code-block:: console + + ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "1,0,0" –P + +This will run the training algorithm for x seconds on each core (cores 2 +and 3), and then print out the recommended threshold values for those +cores. The thresholds should be very similar for each core. + +.. code-block:: console + + POWER: Bring up the Timer + POWER: set the power freq to MED + POWER: Low threshold is 230277 + POWER: MED threshold is 335071 + POWER: HIGH threshold is 523769 + POWER: Training is Complete for 2 + POWER: set the power freq to MED + POWER: Low threshold is 236814 + POWER: MED threshold is 344567 + POWER: HIGH threshold is 538580 + POWER: Training is Complete for 3 + +Once the values have been measured for a particular system, the app can +then be started without the training mode so traffic can start immediately. + +.. code-block:: console + + ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "0,340000,540000" –P From patchwork Fri Oct 19 11:07:22 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Liang, Ma" X-Patchwork-Id: 47081 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 446C41B56E; Fri, 19 Oct 2018 13:07:55 +0200 (CEST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 19D6B1B569 for ; Fri, 19 Oct 2018 13:07:49 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Oct 2018 04:07:49 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,399,1534834800"; d="scan'208";a="82755941" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga008.jf.intel.com with ESMTP; 19 Oct 2018 04:07:47 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w9JB7kBg014355; Fri, 19 Oct 2018 12:07:46 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w9JB7kLi017114; Fri, 19 Oct 2018 12:07:46 +0100 Received: (from lma25@localhost) by sivswdev01.ir.intel.com with LOCAL id w9JB7kK4017110; Fri, 19 Oct 2018 12:07:46 +0100 From: Liang Ma To: david.hunt@intel.com Cc: dev@dpdk.org, lei.a.yao@intel.com, ktraynor@redhat.com, marko.kovacevic@intel.com, Liang Ma Date: Fri, 19 Oct 2018 12:07:22 +0100 Message-Id: <1539947242-16729-5-git-send-email-liang.j.ma@intel.com> X-Mailer: git-send-email 1.7.7.4 In-Reply-To: <1539947242-16729-1-git-send-email-liang.j.ma@intel.com> References: <1539944630-21625-1-git-send-email-liang.j.ma@intel.com> <1539947242-16729-1-git-send-email-liang.j.ma@intel.com> Subject: [dpdk-dev] [PATCH v12 5/5] doc: update release notes for empty poll library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Update the release nots for Traffic Pattern Aware Control Library(empty poll). Signed-off-by: Liang Ma Acked-by: Marko Kovacevic --- doc/guides/rel_notes/release_18_11.rst | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst index a8327ea..5efc74e 100644 --- a/doc/guides/rel_notes/release_18_11.rst +++ b/doc/guides/rel_notes/release_18_11.rst @@ -97,6 +97,16 @@ New Features the SW eventdev PMD, sacrifices load balancing performance to gain better event scheduling throughput and scalability. +* **Added Traffic Pattern Aware Power Control Library** + + Added an experimental library. This extend Power Library and provide + empty_poll APIs. This feature measure how many times empty_poll are + executed per core, use the number of empty polls as a hint for system + power management. + + See the :doc:`../prog_guide/power_man` section of the DPDK Programmers + Guide document for more information. + * **Added ability to switch queue deferred start flag on testpmd app.** Added a console command to testpmd app, giving ability to switch @@ -104,7 +114,6 @@ New Features the specified port. The port must be stopped before the command call in order to reconfigure queues. - API Changes ----------- @@ -118,6 +127,16 @@ API Changes Also, make sure to start the actual text at the margin. ========================================================= +* power: Traffic Pattern Aware Control APIs (marked as experimental): + + - ``rte_power_empty_poll_stat_init`` + - ``rte_power_empty_poll_stat_free`` + - ``rte_power_empty_poll_stat_update`` + - ``rte_power_empty_poll_stat_fetch`` + - ``rte_power_poll_stat_update`` + - ``rte_power_poll_stat_fetch`` + - ``rte_empty_poll_detection`` + * mbuf: The ``__rte_mbuf_raw_free()`` and ``__rte_pktmbuf_prefree_seg()`` functions were deprecated since 17.05 and are replaced by ``rte_mbuf_raw_free()`` and ``rte_pktmbuf_prefree_seg()``.