[v6,05/10] power: add PMD power management API and callback

Message ID efeaf2e7e5577eaf1879749b638ced8eaa21a5c1.1602682206.git.anatoly.burakov@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v6,01/10] eal: add new x86 cpuid support for WAITPKG |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Anatoly Burakov Oct. 14, 2020, 1:30 p.m. UTC
  From: Liang Ma <liang.j.ma@intel.com>

Add a simple on/off switch that will enable saving power when no
packets are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering an
architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when packets arrive.

This API mandates a core-to-single-queue mapping (that is, multiple
queued per device are supported, but they have to be polled on different
cores).

This design is using PMD RX callbacks.

1. UMWAIT/UMONITOR:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   Instead of move the core into deeper C state, this method uses the
   pause instruction to avoid busy polling.

3. Frequency scaling
   Reuse existing DPDK power library to scale up/down core frequency
   depending on traffic volume.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v6:
    - Added wakeup mechanism for UMWAIT
    - Removed memory allocation (everything is now allocated statically)
    - Fixed various typos and comments
    - Check for invalid queue ID
    - Moved release notes to this patch
    
    v5:
    - Make error checking more robust
      - Prevent initializing scaling if ACPI or PSTATE env wasn't set
      - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
    - Add some debug logging
    - Replace x86-specific code path to generic path using the intrinsic check

 doc/guides/rel_notes/release_20_11.rst |  11 +
 lib/librte_power/meson.build           |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c  | 300 +++++++++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h  |  92 ++++++++
 lib/librte_power/rte_power_version.map |   4 +
 5 files changed, 410 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
  

Comments

Hunt, David Oct. 14, 2020, 2:19 p.m. UTC | #1
On 14/10/2020 2:30 PM, Anatoly Burakov wrote:
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
>
> This API mandates a core-to-single-queue mapping (that is, multiple
> queued per device are supported, but they have to be polled on different
> cores).
>
> This design is using PMD RX callbacks.
>
> 1. UMWAIT/UMONITOR:
>
>     When a certain threshold of empty polls is reached, the core will go
>     into a power optimized sleep while waiting on an address of next RX
>     descriptor to be written to.
>
> 2. Pause instruction
>
>     Instead of move the core into deeper C state, this method uses the
>     pause instruction to avoid busy polling.
>
> 3. Frequency scaling
>     Reuse existing DPDK power library to scale up/down core frequency
>     depending on traffic volume.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---


Hi Liang, Anatoly, great work on the patch set.


Acked-by: David Hunt <david.hunt@intel.com>
  
Ananyev, Konstantin Oct. 14, 2020, 6:41 p.m. UTC | #2
> From: Liang Ma <liang.j.ma@intel.com>
> 
> Add a simple on/off switch that will enable saving power when no
> packets are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering an
> architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when packets arrive.
> 
> This API mandates a core-to-single-queue mapping (that is, multiple
> queued per device are supported, but they have to be polled on different
> cores).
> 
> This design is using PMD RX callbacks.
> 
> 1. UMWAIT/UMONITOR:
> 
>    When a certain threshold of empty polls is reached, the core will go
>    into a power optimized sleep while waiting on an address of next RX
>    descriptor to be written to.
> 
> 2. Pause instruction
> 
>    Instead of move the core into deeper C state, this method uses the
>    pause instruction to avoid busy polling.
> 
> 3. Frequency scaling
>    Reuse existing DPDK power library to scale up/down core frequency
>    depending on traffic volume.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v6:
>     - Added wakeup mechanism for UMWAIT
>     - Removed memory allocation (everything is now allocated statically)
>     - Fixed various typos and comments
>     - Check for invalid queue ID
>     - Moved release notes to this patch
> 
>     v5:
>     - Make error checking more robust
>       - Prevent initializing scaling if ACPI or PSTATE env wasn't set
>       - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
>     - Add some debug logging
>     - Replace x86-specific code path to generic path using the intrinsic check
> 
>  doc/guides/rel_notes/release_20_11.rst |  11 +
>  lib/librte_power/meson.build           |   5 +-
>  lib/librte_power/rte_power_pmd_mgmt.c  | 300 +++++++++++++++++++++++++
>  lib/librte_power/rte_power_pmd_mgmt.h  |  92 ++++++++
>  lib/librte_power/rte_power_version.map |   4 +
>  5 files changed, 410 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
>  create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
> 
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index ca4f43f7f9..06b822aa36 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -197,6 +197,17 @@ New Features
>    * Added new ``RTE_ACL_CLASSIFY_AVX512X32`` vector implementation,
>      which can process up to 32 flows in parallel. Requires AVX512 support.
> 
> +* **Add PMD power management mechanism**
> +
> +  3 new Ethernet PMD power management mechanism is added through existing
> +  RX callback infrastructure.
> +
> +  * Add power saving scheme based on UMWAIT instruction (x86 only)
> +  * Add power saving scheme based on ``rte_pause()``
> +  * Add power saving scheme based on frequency scaling through the power library
> +  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
> +  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
> +
> 
>  Removed Items
>  -------------
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 78c031c943..cc3c7a8646 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -9,6 +9,7 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>  		'power_kvm_vm.c', 'guest_channel.c',
>  		'rte_power_empty_poll.c',
>  		'power_pstate_cpufreq.c',
> +		'rte_power_pmd_mgmt.c',
>  		'power_common.c')
> -headers = files('rte_power.h','rte_power_empty_poll.h')
> -deps += ['timer']
> +headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
> +deps += ['timer' ,'ethdev']
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
> new file mode 100644
> index 0000000000..2b7d2a1a46
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.c
> @@ -0,0 +1,300 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_cpuflags.h>
> +#include <rte_malloc.h>
> +#include <rte_ethdev.h>
> +#include <rte_power_intrinsics.h>
> +
> +#include "rte_power_pmd_mgmt.h"
> +
> +#define EMPTYPOLL_MAX  512
> +
> +/**
> + * Possible power management states of an ethdev port.
> + */
> +enum pmd_mgmt_state {
> +	/** Device power management is disabled. */
> +	PMD_MGMT_DISABLED = 0,
> +	/** Device power management is enabled. */
> +	PMD_MGMT_ENABLED,
> +};
> +
> +struct pmd_queue_cfg {
> +	enum pmd_mgmt_state pwr_mgmt_state;
> +	/**< State of power management for this queue */
> +	enum rte_power_pmd_mgmt_type cb_mode;
> +	/**< Callback mode for this queue */
> +	const struct rte_eth_rxtx_callback *cur_cb;
> +	/**< Callback instance */
> +	rte_spinlock_t umwait_lock;
> +	/**< Per-queue status lock - used only for UMWAIT mode */
> +	volatile void *wait_addr;
> +	/**< UMWAIT wakeup address */
> +	uint64_t empty_poll_stats;
> +	/**< Number of empty polls */
> +} __rte_cache_aligned;
> +
> +static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
> +
> +/* trigger a write to the cache line we're waiting on */
> +static void umwait_wakeup(volatile void *addr)
> +{
> +	uint64_t val;
> +
> +	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
> +	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
> +			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
> +}
> +
> +static uint16_t
> +clb_umwait(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
> +{
> +
> +	struct pmd_queue_cfg *q_conf;
> +
> +	q_conf = &port_cfg[port_id][qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +			volatile void *target_addr;
> +			uint64_t expected, mask;
> +			uint16_t ret;
> +			uint8_t data_sz;
> +
> +			/*
> +			 * get address of next descriptor in the RX
> +			 * ring for this queue, as well as expected
> +			 * value and a mask.
> +			 */
> +			ret = rte_eth_get_wake_addr(port_id, qidx,
> +					&target_addr, &expected, &mask,
> +					&data_sz);
> +			if (ret == 0) {
> +				/*
> +				 * we need to ensure we can wake up by another
> +				 * thread triggering a write, so we need the
> +				 * address to always be up to date.
> +				 */
> +				rte_spinlock_lock(&q_conf->umwait_lock);


I think you need to check state here, and _disable() have to set state with lock grabbed.
Otherwise this lock wouldn't protect you from race conditions.
As an example:

CP@T0:
	rte_spinlock_lock(&queue_cfg->umwait_lock);
	if (queue_cfg->wait_addr != NULL) //wait_addr == NULL, fallthrough
	rte_spinlock_unlock(&queue_cfg->umwait_lock);

DP@T1:
	rte_spinlock_lock(&queue_cfg->umwait_lock);
	queue_cfg->wait_addr = target_addr;
	monitor_sync(...);  // DP was put to sleep

CP@T2:
	queue_cfg->cur_cb = NULL;
	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
	ret = 0;

rte_power_pmd_mgmt_queue_disable() finished with success,
but DP core wasn't wokenup.

To be more specific:
clb_umwait(...) {
	...
	lock(&qcfg->lck);
	if (qcfg->state == ENABLED)  {
		qcfg->wake_addr = addr;
		monitor_sync(addr, ...,&qcfg->lck);
	}
	unlock(&qcfg->lck); 
	...
}

_disable(...) {
	...
	lock(&qcfg->lck);
	qcfg->state = DISABLED;
	if (qcfg->wake_addr != NULL)
		monitor_wakeup(qcfg->wake_addr);
	unlock(&qcfg->lock);
	...
}

> +				q_conf->wait_addr = target_addr;
> +				/* -1ULL is maximum value for TSC */
> +				rte_power_monitor_sync(target_addr, expected,
> +						mask, -1ULL, data_sz,
> +						&q_conf->umwait_lock);
> +				/* erase the address */
> +				q_conf->wait_addr = NULL;
> +				rte_spinlock_unlock(&q_conf->umwait_lock);
> +			}
> +		}
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +clb_pause(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
> +{
> +	struct pmd_queue_cfg *q_conf;
> +
> +	q_conf = &port_cfg[port_id][qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		/* sleep for 1 microsecond */
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
> +			rte_delay_us(1);
> +	} else
> +		q_conf->empty_poll_stats = 0;
> +
> +	return nb_rx;
> +}
> +
> +static uint16_t
> +clb_scale_freq(uint16_t port_id, uint16_t qidx,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +{
> +	struct pmd_queue_cfg *q_conf;
> +
> +	q_conf = &port_cfg[port_id][qidx];
> +
> +	if (unlikely(nb_rx == 0)) {
> +		q_conf->empty_poll_stats++;
> +		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
> +			/* scale down freq */
> +			rte_power_freq_min(rte_lcore_id());
> +	} else {
> +		q_conf->empty_poll_stats = 0;
> +		/* scale up freq */
> +		rte_power_freq_max(rte_lcore_id());
> +	}
> +
> +	return nb_rx;
> +}
> +
> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id,
> +		enum rte_power_pmd_mgmt_type mode)
> +{
> +	struct rte_eth_dev *dev;
> +	struct pmd_queue_cfg *queue_cfg;
> +	int ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> +	dev = &rte_eth_devices[port_id];
> +
> +	/* check if queue id is valid */
> +	if (queue_id >= dev->data->nb_rx_queues ||
> +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> +		return -EINVAL;
> +	}
> +
> +	queue_cfg = &port_cfg[port_id][queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
> +		ret = -EINVAL;
> +		goto end;
> +	}
> +
> +	switch (mode) {
> +	case RTE_POWER_MGMT_TYPE_WAIT:
> +	{
> +		/* check if rte_power_monitor is supported */
> +		uint64_t dummy_expected, dummy_mask;
> +		struct rte_cpu_intrinsics i;
> +		volatile void *dummy_addr;
> +		uint8_t dummy_sz;
> +
> +		rte_cpu_get_intrinsics_support(&i);
> +
> +		if (!i.power_monitor) {
> +			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +
> +		/* check if the device supports the necessary PMD API */
> +		if (rte_eth_get_wake_addr(port_id, queue_id,
> +				&dummy_addr, &dummy_expected,
> +				&dummy_mask, &dummy_sz) == -ENOTSUP) {
> +			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +		/* initialize UMWAIT spinlock */
> +		rte_spinlock_init(&queue_cfg->umwait_lock);

I think don't need to do that.
It supposed to be in valid state (otherwise you are probably in trouble anyway).

> +
> +		/* initialize data before enabling the callback */
> +		queue_cfg->empty_poll_stats = 0;
> +		queue_cfg->cb_mode = mode;
> +		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +				clb_umwait, NULL);

Would be a bit cleaner/nicer to move add_rx_callback out of switch() {}
As you have to do it always anyway.
Same thought for disable() and remove_rx_callback().

> +		break;
> +	}
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +	{
> +		enum power_management_env env;
> +		/* only PSTATE and ACPI modes are supported */
> +		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> +				!rte_power_check_env_supported(
> +					PM_ENV_PSTATE_CPUFREQ)) {
> +			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +		/* ensure we could initialize the power library */
> +		if (rte_power_init(lcore_id)) {
> +			ret = -EINVAL;
> +			goto end;
> +		}
> +		/* ensure we initialized the correct env */
> +		env = rte_power_get_env();
> +		if (env != PM_ENV_ACPI_CPUFREQ &&
> +				env != PM_ENV_PSTATE_CPUFREQ) {
> +			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> +			ret = -ENOTSUP;
> +			goto end;
> +		}
> +		/* initialize data before enabling the callback */
> +		queue_cfg->empty_poll_stats = 0;
> +		queue_cfg->cb_mode = mode;
> +		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
> +				queue_id, clb_scale_freq, NULL);
> +		break;
> +	}
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		/* initialize data before enabling the callback */
> +		queue_cfg->empty_poll_stats = 0;
> +		queue_cfg->cb_mode = mode;
> +		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +
> +		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> +				clb_pause, NULL);
> +		break;
> +	}
> +	ret = 0;
> +
> +end:
> +	return ret;
> +}
> +
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +		uint16_t port_id, uint16_t queue_id)
> +{
> +	struct pmd_queue_cfg *queue_cfg;
> +	int ret;
> +
> +	queue_cfg = &port_cfg[port_id][queue_id];
> +
> +	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
> +		ret = -EINVAL;
> +		goto end;
> +	}
> +
> +	switch (queue_cfg->cb_mode) {
> +	case RTE_POWER_MGMT_TYPE_WAIT:
> +		rte_spinlock_lock(&queue_cfg->umwait_lock);
> +
> +		/* wake up the core from UMWAIT sleep, if any */
> +		if (queue_cfg->wait_addr != NULL)
> +			umwait_wakeup(queue_cfg->wait_addr);
> +
> +		rte_spinlock_unlock(&queue_cfg->umwait_lock);
> +		/* fall-through */
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +				queue_cfg->cur_cb);
> +		break;
> +	case RTE_POWER_MGMT_TYPE_SCALE:
> +		rte_power_freq_max(lcore_id);
> +		rte_eth_remove_rx_callback(port_id, queue_id,
> +				queue_cfg->cur_cb);
> +		rte_power_exit(lcore_id);
> +		break;
> +	}
> +	/*
> +	 * we don't free the RX callback here because it is unsafe to do so
> +	 * unless we know for a fact that all data plane threads have stopped.
> +	 */
> +	queue_cfg->cur_cb = NULL;
> +	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +	ret = 0;
> +end:
> +	return ret;
> +}
> diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
> new file mode 100644
> index 0000000000..a7a3f98268
> --- /dev/null
> +++ b/lib/librte_power/rte_power_pmd_mgmt.h
> @@ -0,0 +1,92 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_PMD_MGMT_H
> +#define _RTE_POWER_PMD_MGMT_H
> +
> +/**
> + * @file
> + * RTE PMD Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_power.h>
> +#include <rte_atomic.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * PMD Power Management Type
> + */
> +enum rte_power_pmd_mgmt_type {
> +	/** WAIT callback mode. */
> +	RTE_POWER_MGMT_TYPE_WAIT = 1,
> +	/** PAUSE callback mode. */
> +	RTE_POWER_MGMT_TYPE_PAUSE,
> +	/** Freq Scaling callback mode. */
> +	RTE_POWER_MGMT_TYPE_SCALE,
> +};
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Setup per-queue power management callback.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @param lcore_id
> + *   lcore_id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The queue identifier of the Ethernet device.
> + * @param mode
> + *   The power management callback function type.
> +
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
> +				uint16_t port_id,
> +				uint16_t queue_id,
> +				enum rte_power_pmd_mgmt_type mode);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Remove per-queue power management callback.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @param lcore_id
> + *   lcore_id.
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The queue identifier of the Ethernet device.
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
> +				uint16_t port_id,
> +				uint16_t queue_id);
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> index 69ca9af616..3f2f6cd6f6 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -34,4 +34,8 @@ EXPERIMENTAL {
>  	rte_power_guest_channel_receive_msg;
>  	rte_power_poll_stat_fetch;
>  	rte_power_poll_stat_update;
> +	# added in 20.11
> +	rte_power_pmd_mgmt_queue_enable;
> +	rte_power_pmd_mgmt_queue_disable;
> +
>  };
> --
> 2.17.1
  
Anatoly Burakov Oct. 15, 2020, 10:31 a.m. UTC | #3
On 14-Oct-20 7:41 PM, Ananyev, Konstantin wrote:
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> Add a simple on/off switch that will enable saving power when no
>> packets are arriving. It is based on counting the number of empty
>> polls and, when the number reaches a certain threshold, entering an
>> architecture-defined optimized power state that will either wait
>> until a TSC timestamp expires, or when packets arrive.
>>
>> This API mandates a core-to-single-queue mapping (that is, multiple
>> queued per device are supported, but they have to be polled on different
>> cores).
>>
>> This design is using PMD RX callbacks.
>>
>> 1. UMWAIT/UMONITOR:
>>
>>     When a certain threshold of empty polls is reached, the core will go
>>     into a power optimized sleep while waiting on an address of next RX
>>     descriptor to be written to.
>>
>> 2. Pause instruction
>>
>>     Instead of move the core into deeper C state, this method uses the
>>     pause instruction to avoid busy polling.
>>
>> 3. Frequency scaling
>>     Reuse existing DPDK power library to scale up/down core frequency
>>     depending on traffic volume.
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v6:
>>      - Added wakeup mechanism for UMWAIT
>>      - Removed memory allocation (everything is now allocated statically)
>>      - Fixed various typos and comments
>>      - Check for invalid queue ID
>>      - Moved release notes to this patch
>>
>>      v5:
>>      - Make error checking more robust
>>        - Prevent initializing scaling if ACPI or PSTATE env wasn't set
>>        - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
>>      - Add some debug logging
>>      - Replace x86-specific code path to generic path using the intrinsic check
>>


<snip>

> 
> 
> I think you need to check state here, and _disable() have to set state with lock grabbed.
> Otherwise this lock wouldn't protect you from race conditions.
> As an example:
> 
> CP@T0:
> rte_spinlock_lock(&queue_cfg->umwait_lock);
> if (queue_cfg->wait_addr != NULL) //wait_addr == NULL, fallthrough
> rte_spinlock_unlock(&queue_cfg->umwait_lock);
> 
> DP@T1:
> rte_spinlock_lock(&queue_cfg->umwait_lock);
> queue_cfg->wait_addr = target_addr;
> monitor_sync(...);  // DP was put to sleep
> 
> CP@T2:
> queue_cfg->cur_cb = NULL;
> queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> ret = 0;
> 
> rte_power_pmd_mgmt_queue_disable() finished with success,
> but DP core wasn't wokenup.
> 
> To be more specific:
> clb_umwait(...) {
> ...
> lock(&qcfg->lck);
> if (qcfg->state == ENABLED)  {
> qcfg->wake_addr = addr;
> monitor_sync(addr, ...,&qcfg->lck);
> }
> unlock(&qcfg->lck);
> ...
> }
> 
> _disable(...) {
> ...
> lock(&qcfg->lck);
> qcfg->state = DISABLED;
> if (qcfg->wake_addr != NULL)
> monitor_wakeup(qcfg->wake_addr);
> unlock(&qcfg->lock);
> ...
> }

True, didn't think of that. Will fix.

>> +
>> +if (!i.power_monitor) {
>> +RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>> +ret = -ENOTSUP;
>> +goto end;
>> +}
>> +
>> +/* check if the device supports the necessary PMD API */
>> +if (rte_eth_get_wake_addr(port_id, queue_id,
>> +&dummy_addr, &dummy_expected,
>> +&dummy_mask, &dummy_sz) == -ENOTSUP) {
>> +RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
>> +ret = -ENOTSUP;
>> +goto end;
>> +}
>> +/* initialize UMWAIT spinlock */
>> +rte_spinlock_init(&queue_cfg->umwait_lock);
> 
> I think don't need to do that.
> It supposed to be in valid state (otherwise you are probably in trouble anyway).

This is mostly for initialization, for when we first run the callback. I 
suppose we could do it in an RTE_INIT() function or just leave it be 
since the spinlocks are part of a statically allocated structure and 
will default to 0 anyway (although it wouldn't be proper usage of the 
API as that would be relying on implementation detail).

> 
>> +
>> +/* initialize data before enabling the callback */
>> +queue_cfg->empty_poll_stats = 0;
>> +queue_cfg->cb_mode = mode;
>> +queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> +
>> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +clb_umwait, NULL);
> 
> Would be a bit cleaner/nicer to move add_rx_callback out of switch() {}
> As you have to do it always anyway.
> Same thought for disable() and remove_rx_callback().

The functions are different for each, so we can't move them out of 
switch (unless you're suggesting to unify the callback to handle all 
three modes?).
  
Ananyev, Konstantin Oct. 15, 2020, 4:02 p.m. UTC | #4
> On 14-Oct-20 7:41 PM, Ananyev, Konstantin wrote:
> >> From: Liang Ma <liang.j.ma@intel.com>
> >>
> >> Add a simple on/off switch that will enable saving power when no
> >> packets are arriving. It is based on counting the number of empty
> >> polls and, when the number reaches a certain threshold, entering an
> >> architecture-defined optimized power state that will either wait
> >> until a TSC timestamp expires, or when packets arrive.
> >>
> >> This API mandates a core-to-single-queue mapping (that is, multiple
> >> queued per device are supported, but they have to be polled on different
> >> cores).
> >>
> >> This design is using PMD RX callbacks.
> >>
> >> 1. UMWAIT/UMONITOR:
> >>
> >>     When a certain threshold of empty polls is reached, the core will go
> >>     into a power optimized sleep while waiting on an address of next RX
> >>     descriptor to be written to.
> >>
> >> 2. Pause instruction
> >>
> >>     Instead of move the core into deeper C state, this method uses the
> >>     pause instruction to avoid busy polling.
> >>
> >> 3. Frequency scaling
> >>     Reuse existing DPDK power library to scale up/down core frequency
> >>     depending on traffic volume.
> >>
> >> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>
> >> Notes:
> >>      v6:
> >>      - Added wakeup mechanism for UMWAIT
> >>      - Removed memory allocation (everything is now allocated statically)
> >>      - Fixed various typos and comments
> >>      - Check for invalid queue ID
> >>      - Moved release notes to this patch
> >>
> >>      v5:
> >>      - Make error checking more robust
> >>        - Prevent initializing scaling if ACPI or PSTATE env wasn't set
> >>        - Prevent initializing UMWAIT path if PMD doesn't support get_wake_addr
> >>      - Add some debug logging
> >>      - Replace x86-specific code path to generic path using the intrinsic check
> >>
> 
> 
> <snip>
> 
> >
> >
> > I think you need to check state here, and _disable() have to set state with lock grabbed.
> > Otherwise this lock wouldn't protect you from race conditions.
> > As an example:
> >
> > CP@T0:
> > rte_spinlock_lock(&queue_cfg->umwait_lock);
> > if (queue_cfg->wait_addr != NULL) //wait_addr == NULL, fallthrough
> > rte_spinlock_unlock(&queue_cfg->umwait_lock);
> >
> > DP@T1:
> > rte_spinlock_lock(&queue_cfg->umwait_lock);
> > queue_cfg->wait_addr = target_addr;
> > monitor_sync(...);  // DP was put to sleep
> >
> > CP@T2:
> > queue_cfg->cur_cb = NULL;
> > queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> > ret = 0;
> >
> > rte_power_pmd_mgmt_queue_disable() finished with success,
> > but DP core wasn't wokenup.
> >
> > To be more specific:
> > clb_umwait(...) {
> > ...
> > lock(&qcfg->lck);
> > if (qcfg->state == ENABLED)  {
> > qcfg->wake_addr = addr;
> > monitor_sync(addr, ...,&qcfg->lck);
> > }
> > unlock(&qcfg->lck);
> > ...
> > }
> >
> > _disable(...) {
> > ...
> > lock(&qcfg->lck);
> > qcfg->state = DISABLED;
> > if (qcfg->wake_addr != NULL)
> > monitor_wakeup(qcfg->wake_addr);
> > unlock(&qcfg->lock);
> > ...
> > }
> 
> True, didn't think of that. Will fix.
> 
> >> +
> >> +if (!i.power_monitor) {
> >> +RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> >> +ret = -ENOTSUP;
> >> +goto end;
> >> +}
> >> +
> >> +/* check if the device supports the necessary PMD API */
> >> +if (rte_eth_get_wake_addr(port_id, queue_id,
> >> +&dummy_addr, &dummy_expected,
> >> +&dummy_mask, &dummy_sz) == -ENOTSUP) {
> >> +RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
> >> +ret = -ENOTSUP;
> >> +goto end;
> >> +}
> >> +/* initialize UMWAIT spinlock */
> >> +rte_spinlock_init(&queue_cfg->umwait_lock);
> >
> > I think don't need to do that.
> > It supposed to be in valid state (otherwise you are probably in trouble anyway).
> 
> This is mostly for initialization, for when we first run the callback. I
> suppose we could do it in an RTE_INIT() function or just leave it be
> since the spinlocks are part of a statically allocated structure and
> will default to 0 anyway

Yes static initialization should be sufficient here.

 (although it wouldn't be proper usage of the
> API as that would be relying on implementation detail).

What I am trying to say - at that moment spinlock value should be zero.
If it is not, then your DP callback is still running and holding the lock.
I understand that it should be *very* rare situation, but it seems 
strange to me to silently over-write the locked value from other thread.
It shouldn't cause any major trouble, but still...

> 
> >
> >> +
> >> +/* initialize data before enabling the callback */
> >> +queue_cfg->empty_poll_stats = 0;
> >> +queue_cfg->cb_mode = mode;
> >> +queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> >> +
> >> +queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> >> +clb_umwait, NULL);
> >
> > Would be a bit cleaner/nicer to move add_rx_callback out of switch() {}
> > As you have to do it always anyway.
> > Same thought for disable() and remove_rx_callback().
> 
> The functions are different for each, so we can't move them out of
> switch (unless you're suggesting to unify the callback to handle all
> three modes?).
> 
> --
> Thanks,
> Anatoly
  

Patch

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index ca4f43f7f9..06b822aa36 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -197,6 +197,17 @@  New Features
   * Added new ``RTE_ACL_CLASSIFY_AVX512X32`` vector implementation,
     which can process up to 32 flows in parallel. Requires AVX512 support.
 
+* **Add PMD power management mechanism**
+
+  3 new Ethernet PMD power management mechanism is added through existing
+  RX callback infrastructure.
+
+  * Add power saving scheme based on UMWAIT instruction (x86 only)
+  * Add power saving scheme based on ``rte_pause()``
+  * Add power saving scheme based on frequency scaling through the power library
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_enable()``
+  * Add new EXPERIMENTAL API ``rte_power_pmd_mgmt_queue_disable()``
+
 
 Removed Items
 -------------
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 78c031c943..cc3c7a8646 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -9,6 +9,7 @@  sources = files('rte_power.c', 'power_acpi_cpufreq.c',
 		'power_kvm_vm.c', 'guest_channel.c',
 		'rte_power_empty_poll.c',
 		'power_pstate_cpufreq.c',
+		'rte_power_pmd_mgmt.c',
 		'power_common.c')
-headers = files('rte_power.h','rte_power_empty_poll.h')
-deps += ['timer']
+headers = files('rte_power.h','rte_power_empty_poll.h','rte_power_pmd_mgmt.h')
+deps += ['timer' ,'ethdev']
diff --git a/lib/librte_power/rte_power_pmd_mgmt.c b/lib/librte_power/rte_power_pmd_mgmt.c
new file mode 100644
index 0000000000..2b7d2a1a46
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.c
@@ -0,0 +1,300 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_cpuflags.h>
+#include <rte_malloc.h>
+#include <rte_ethdev.h>
+#include <rte_power_intrinsics.h>
+
+#include "rte_power_pmd_mgmt.h"
+
+#define EMPTYPOLL_MAX  512
+
+/**
+ * Possible power management states of an ethdev port.
+ */
+enum pmd_mgmt_state {
+	/** Device power management is disabled. */
+	PMD_MGMT_DISABLED = 0,
+	/** Device power management is enabled. */
+	PMD_MGMT_ENABLED,
+};
+
+struct pmd_queue_cfg {
+	enum pmd_mgmt_state pwr_mgmt_state;
+	/**< State of power management for this queue */
+	enum rte_power_pmd_mgmt_type cb_mode;
+	/**< Callback mode for this queue */
+	const struct rte_eth_rxtx_callback *cur_cb;
+	/**< Callback instance */
+	rte_spinlock_t umwait_lock;
+	/**< Per-queue status lock - used only for UMWAIT mode */
+	volatile void *wait_addr;
+	/**< UMWAIT wakeup address */
+	uint64_t empty_poll_stats;
+	/**< Number of empty polls */
+} __rte_cache_aligned;
+
+static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+
+/* trigger a write to the cache line we're waiting on */
+static void umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
+static uint16_t
+clb_umwait(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+			volatile void *target_addr;
+			uint64_t expected, mask;
+			uint16_t ret;
+			uint8_t data_sz;
+
+			/*
+			 * get address of next descriptor in the RX
+			 * ring for this queue, as well as expected
+			 * value and a mask.
+			 */
+			ret = rte_eth_get_wake_addr(port_id, qidx,
+					&target_addr, &expected, &mask,
+					&data_sz);
+			if (ret == 0) {
+				/*
+				 * we need to ensure we can wake up by another
+				 * thread triggering a write, so we need the
+				 * address to always be up to date.
+				 */
+				rte_spinlock_lock(&q_conf->umwait_lock);
+				q_conf->wait_addr = target_addr;
+				/* -1ULL is maximum value for TSC */
+				rte_power_monitor_sync(target_addr, expected,
+						mask, -1ULL, data_sz,
+						&q_conf->umwait_lock);
+				/* erase the address */
+				q_conf->wait_addr = NULL;
+				rte_spinlock_unlock(&q_conf->umwait_lock);
+			}
+		}
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_pause(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *addr __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		/* sleep for 1 microsecond */
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			rte_delay_us(1);
+	} else
+		q_conf->empty_poll_stats = 0;
+
+	return nb_rx;
+}
+
+static uint16_t
+clb_scale_freq(uint16_t port_id, uint16_t qidx,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+{
+	struct pmd_queue_cfg *q_conf;
+
+	q_conf = &port_cfg[port_id][qidx];
+
+	if (unlikely(nb_rx == 0)) {
+		q_conf->empty_poll_stats++;
+		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
+			/* scale down freq */
+			rte_power_freq_min(rte_lcore_id());
+	} else {
+		q_conf->empty_poll_stats = 0;
+		/* scale up freq */
+		rte_power_freq_max(rte_lcore_id());
+	}
+
+	return nb_rx;
+}
+
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id,
+		enum rte_power_pmd_mgmt_type mode)
+{
+	struct rte_eth_dev *dev;
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+	dev = &rte_eth_devices[port_id];
+
+	/* check if queue id is valid */
+	if (queue_id >= dev->data->nb_rx_queues ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		return -EINVAL;
+	}
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_ENABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+	{
+		/* check if rte_power_monitor is supported */
+		uint64_t dummy_expected, dummy_mask;
+		struct rte_cpu_intrinsics i;
+		volatile void *dummy_addr;
+		uint8_t dummy_sz;
+
+		rte_cpu_get_intrinsics_support(&i);
+
+		if (!i.power_monitor) {
+			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+
+		/* check if the device supports the necessary PMD API */
+		if (rte_eth_get_wake_addr(port_id, queue_id,
+				&dummy_addr, &dummy_expected,
+				&dummy_mask, &dummy_sz) == -ENOTSUP) {
+			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_rxq_ring_addr_get\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize UMWAIT spinlock */
+		rte_spinlock_init(&queue_cfg->umwait_lock);
+
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_umwait, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	{
+		enum power_management_env env;
+		/* only PSTATE and ACPI modes are supported */
+		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+				!rte_power_check_env_supported(
+					PM_ENV_PSTATE_CPUFREQ)) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* ensure we could initialize the power library */
+		if (rte_power_init(lcore_id)) {
+			ret = -EINVAL;
+			goto end;
+		}
+		/* ensure we initialized the correct env */
+		env = rte_power_get_env();
+		if (env != PM_ENV_ACPI_CPUFREQ &&
+				env != PM_ENV_PSTATE_CPUFREQ) {
+			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+			ret = -ENOTSUP;
+			goto end;
+		}
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id,
+				queue_id, clb_scale_freq, NULL);
+		break;
+	}
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* initialize data before enabling the callback */
+		queue_cfg->empty_poll_stats = 0;
+		queue_cfg->cb_mode = mode;
+		queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+
+		queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
+				clb_pause, NULL);
+		break;
+	}
+	ret = 0;
+
+end:
+	return ret;
+}
+
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+		uint16_t port_id, uint16_t queue_id)
+{
+	struct pmd_queue_cfg *queue_cfg;
+	int ret;
+
+	queue_cfg = &port_cfg[port_id][queue_id];
+
+	if (queue_cfg->pwr_mgmt_state == PMD_MGMT_DISABLED) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	switch (queue_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_WAIT:
+		rte_spinlock_lock(&queue_cfg->umwait_lock);
+
+		/* wake up the core from UMWAIT sleep, if any */
+		if (queue_cfg->wait_addr != NULL)
+			umwait_wakeup(queue_cfg->wait_addr);
+
+		rte_spinlock_unlock(&queue_cfg->umwait_lock);
+		/* fall-through */
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		break;
+	case RTE_POWER_MGMT_TYPE_SCALE:
+		rte_power_freq_max(lcore_id);
+		rte_eth_remove_rx_callback(port_id, queue_id,
+				queue_cfg->cur_cb);
+		rte_power_exit(lcore_id);
+		break;
+	}
+	/*
+	 * we don't free the RX callback here because it is unsafe to do so
+	 * unless we know for a fact that all data plane threads have stopped.
+	 */
+	queue_cfg->cur_cb = NULL;
+	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	ret = 0;
+end:
+	return ret;
+}
diff --git a/lib/librte_power/rte_power_pmd_mgmt.h b/lib/librte_power/rte_power_pmd_mgmt.h
new file mode 100644
index 0000000000..a7a3f98268
--- /dev/null
+++ b/lib/librte_power/rte_power_pmd_mgmt.h
@@ -0,0 +1,92 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_PMD_MGMT_H
+#define _RTE_POWER_PMD_MGMT_H
+
+/**
+ * @file
+ * RTE PMD Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_power.h>
+#include <rte_atomic.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * PMD Power Management Type
+ */
+enum rte_power_pmd_mgmt_type {
+	/** WAIT callback mode. */
+	RTE_POWER_MGMT_TYPE_WAIT = 1,
+	/** PAUSE callback mode. */
+	RTE_POWER_MGMT_TYPE_PAUSE,
+	/** Freq Scaling callback mode. */
+	RTE_POWER_MGMT_TYPE_SCALE,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Setup per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @param mode
+ *   The power management callback function type.
+
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_enable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id,
+				enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Remove per-queue power management callback.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @param lcore_id
+ *   lcore_id.
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The queue identifier of the Ethernet device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
+				uint16_t port_id,
+				uint16_t queue_id);
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 69ca9af616..3f2f6cd6f6 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -34,4 +34,8 @@  EXPERIMENTAL {
 	rte_power_guest_channel_receive_msg;
 	rte_power_poll_stat_fetch;
 	rte_power_poll_stat_update;
+	# added in 20.11
+	rte_power_pmd_mgmt_queue_enable;
+	rte_power_pmd_mgmt_queue_disable;
+
 };