diff mbox series

[v5,5/7] power: support callbacks for multiple Rx queues

Message ID 8f5d030a77aa2f0e95e9680cb911b4e8db30c879.1624981670.git.anatoly.burakov@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers show
Series Enhancements for PMD power management | expand

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Anatoly Burakov June 29, 2021, 3:48 p.m. UTC
Currently, there is a hard limitation on the PMD power management
support that only allows it to support a single queue per lcore. This is
not ideal as most DPDK use cases will poll multiple queues per core.

The PMD power management mechanism relies on ethdev Rx callbacks, so it
is very difficult to implement such support because callbacks are
effectively stateless and have no visibility into what the other ethdev
devices are doing. This places limitations on what we can do within the
framework of Rx callbacks, but the basics of this implementation are as
follows:

- Replace per-queue structures with per-lcore ones, so that any device
  polled from the same lcore can share data
- Any queue that is going to be polled from a specific lcore has to be
  added to the list of queues to poll, so that the callback is aware of
  other queues being polled by the same lcore
- Both the empty poll counter and the actual power saving mechanism is
  shared between all queues polled on a particular lcore, and is only
  activated when all queues in the list were polled and were determined
  to have no traffic.
- The limitation on UMWAIT-based polling is not removed because UMWAIT
  is incapable of monitoring more than one address.

Also, while we're at it, update and improve the docs.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    v5:
    - Remove the "power save queue" API and replace it with mechanism suggested by
      Konstantin
    
    v3:
    - Move the list of supported NICs to NIC feature table
    
    v2:
    - Use a TAILQ for queues instead of a static array
    - Address feedback from Konstantin
    - Add additional checks for stopped queues

 doc/guides/nics/features.rst           |  10 +
 doc/guides/prog_guide/power_man.rst    |  65 ++--
 doc/guides/rel_notes/release_21_08.rst |   3 +
 lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
 4 files changed, 373 insertions(+), 136 deletions(-)

Comments

David Hunt June 30, 2021, 9:52 a.m. UTC | #1
Hi Anatoly,

On 29/6/2021 4:48 PM, Anatoly Burakov wrote:
> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
>
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
>
> - Replace per-queue structures with per-lcore ones, so that any device
>    polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
>    added to the list of queues to poll, so that the callback is aware of
>    other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
>    shared between all queues polled on a particular lcore, and is only
>    activated when all queues in the list were polled and were determined
>    to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>    is incapable of monitoring more than one address.
>
> Also, while we're at it, update and improve the docs.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>
> Notes:
>      v5:
>      - Remove the "power save queue" API and replace it with mechanism suggested by
>        Konstantin
>      
>      v3:
>      - Move the list of supported NICs to NIC feature table
>      
>      v2:
>      - Use a TAILQ for queues instead of a static array
>      - Address feedback from Konstantin
>      - Add additional checks for stopped queues
>
>   doc/guides/nics/features.rst           |  10 +
>   doc/guides/prog_guide/power_man.rst    |  65 ++--
>   doc/guides/rel_notes/release_21_08.rst |   3 +
>   lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>   4 files changed, 373 insertions(+), 136 deletions(-)
>

--snip--

>   int
>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>   		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>   {
> -	struct pmd_queue_cfg *queue_cfg;
> +	const union queue qdata = {.portid = port_id, .qid = queue_id};
> +	struct pmd_core_cfg *lcore_cfg;
> +	struct queue_list_entry *queue_cfg;
>   	struct rte_eth_dev_info info;
>   	rte_rx_callback_fn clb;
>   	int ret;
> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>   		goto end;
>   	}
>   
> -	queue_cfg = &port_cfg[port_id][queue_id];
> +	lcore_cfg = &lcore_cfgs[lcore_id];
>   
> -	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
> +	/* check if other queues are stopped as well */
> +	ret = cfg_queues_stopped(lcore_cfg);
> +	if (ret != 1) {
> +		/* error means invalid queue, 0 means queue wasn't stopped */
> +		ret = ret < 0 ? -EINVAL : -EBUSY;
> +		goto end;
> +	}
> +
> +	/* if callback was already enabled, check current callback type */
> +	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
> +			lcore_cfg->cb_mode != mode) {
>   		ret = -EINVAL;
>   		goto end;
>   	}
> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>   
>   	switch (mode) {
>   	case RTE_POWER_MGMT_TYPE_MONITOR:
> -	{
> -		struct rte_power_monitor_cond dummy;
> -
> -		/* check if rte_power_monitor is supported */
> -		if (!global_data.intrinsics_support.power_monitor) {
> -			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> -			ret = -ENOTSUP;
> +		/* check if we can add a new queue */
> +		ret = check_monitor(lcore_cfg, &qdata);
> +		if (ret < 0)
>   			goto end;
> -		}
>   
> -		/* check if the device supports the necessary PMD API */
> -		if (rte_eth_get_monitor_addr(port_id, queue_id,
> -				&dummy) == -ENOTSUP) {
> -			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
> -			ret = -ENOTSUP;
> -			goto end;
> -		}
>   		clb = clb_umwait;
>   		break;
> -	}
>   	case RTE_POWER_MGMT_TYPE_SCALE:
> -	{
> -		enum power_management_env env;
> -		/* only PSTATE and ACPI modes are supported */
> -		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> -				!rte_power_check_env_supported(
> -					PM_ENV_PSTATE_CPUFREQ)) {
> -			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> -			ret = -ENOTSUP;
> +		/* check if we can add a new queue */
> +		ret = check_scale(lcore_id);
> +		if (ret < 0)
>   			goto end;
> -		}
> -		/* ensure we could initialize the power library */
> -		if (rte_power_init(lcore_id)) {
> -			ret = -EINVAL;
> -			goto end;
> -		}
> -		/* ensure we initialized the correct env */
> -		env = rte_power_get_env();
> -		if (env != PM_ENV_ACPI_CPUFREQ &&
> -				env != PM_ENV_PSTATE_CPUFREQ) {
> -			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> -			ret = -ENOTSUP;
> -			goto end;
> -		}
>   		clb = clb_scale_freq;
>   		break;
> -	}
>   	case RTE_POWER_MGMT_TYPE_PAUSE:
>   		/* figure out various time-to-tsc conversions */
>   		if (global_data.tsc_per_us == 0)
> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>   		ret = -EINVAL;
>   		goto end;
>   	}
> +	/* add this queue to the list */
> +	ret = queue_list_add(lcore_cfg, &qdata);
> +	if (ret < 0) {
> +		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
> +				strerror(-ret));
> +		goto end;
> +	}
> +	/* new queue is always added last */
> +	queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);


Need to ensure that queue_cfg gets set here, otherwise we'll get a 
segfault below.



>   
>   	/* initialize data before enabling the callback */
> -	queue_cfg->empty_poll_stats = 0;
> -	queue_cfg->cb_mode = mode;
> -	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> -	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> -			clb, NULL);
> +	if (lcore_cfg->n_queues == 1) {
> +		lcore_cfg->cb_mode = mode;
> +		lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +	}
> +	queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
> +			clb, queue_cfg);
--snip--
Ananyev, Konstantin June 30, 2021, 11:04 a.m. UTC | #2
> Currently, there is a hard limitation on the PMD power management
> support that only allows it to support a single queue per lcore. This is
> not ideal as most DPDK use cases will poll multiple queues per core.
> 
> The PMD power management mechanism relies on ethdev Rx callbacks, so it
> is very difficult to implement such support because callbacks are
> effectively stateless and have no visibility into what the other ethdev
> devices are doing. This places limitations on what we can do within the
> framework of Rx callbacks, but the basics of this implementation are as
> follows:
> 
> - Replace per-queue structures with per-lcore ones, so that any device
>   polled from the same lcore can share data
> - Any queue that is going to be polled from a specific lcore has to be
>   added to the list of queues to poll, so that the callback is aware of
>   other queues being polled by the same lcore
> - Both the empty poll counter and the actual power saving mechanism is
>   shared between all queues polled on a particular lcore, and is only
>   activated when all queues in the list were polled and were determined
>   to have no traffic.
> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>   is incapable of monitoring more than one address.
> 
> Also, while we're at it, update and improve the docs.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     v5:
>     - Remove the "power save queue" API and replace it with mechanism suggested by
>       Konstantin
> 
>     v3:
>     - Move the list of supported NICs to NIC feature table
> 
>     v2:
>     - Use a TAILQ for queues instead of a static array
>     - Address feedback from Konstantin
>     - Add additional checks for stopped queues
> 
>  doc/guides/nics/features.rst           |  10 +
>  doc/guides/prog_guide/power_man.rst    |  65 ++--
>  doc/guides/rel_notes/release_21_08.rst |   3 +
>  lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>  4 files changed, 373 insertions(+), 136 deletions(-)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 403c2b03a3..a96e12d155 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
>  * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
>  * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
> 
> +.. _nic_features_get_monitor_addr:
> +
> +PMD power management using monitor addresses
> +--------------------------------------------
> +
> +Supports getting a monitoring condition to use together with Ethernet PMD power
> +management (see :doc:`../prog_guide/power_man` for more details).
> +
> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
> +
>  .. _nic_features_other:
> 
>  Other dev ops not represented by a Feature
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index c70ae128ac..ec04a72108 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -198,34 +198,41 @@ Ethernet PMD Power Management API
>  Abstract
>  ~~~~~~~~
> 
> -Existing power management mechanisms require developers
> -to change application design or change code to make use of it.
> -The PMD power management API provides a convenient alternative
> -by utilizing Ethernet PMD RX callbacks,
> -and triggering power saving whenever empty poll count reaches a certain number.
> -
> -Monitor
> -   This power saving scheme will put the CPU into optimized power state
> -   and use the ``rte_power_monitor()`` function
> -   to monitor the Ethernet PMD RX descriptor address,
> -   and wake the CPU up whenever there's new traffic.
> -
> -Pause
> -   This power saving scheme will avoid busy polling
> -   by either entering power-optimized sleep state
> -   with ``rte_power_pause()`` function,
> -   or, if it's not available, use ``rte_pause()``.
> -
> -Frequency scaling
> -   This power saving scheme will use ``librte_power`` library
> -   functionality to scale the core frequency up/down
> -   depending on traffic volume.
> -
> -.. note::
> -
> -   Currently, this power management API is limited to mandatory mapping
> -   of 1 queue to 1 core (multiple queues are supported,
> -   but they must be polled from different cores).
> +Existing power management mechanisms require developers to change application
> +design or change code to make use of it. The PMD power management API provides a
> +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
> +power saving whenever empty poll count reaches a certain number.
> +
> +* Monitor
> +   This power saving scheme will put the CPU into optimized power state and
> +   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
> +   there's new traffic. Support for this scheme may not be available on all
> +   platforms, and further limitations may apply (see below).
> +
> +* Pause
> +   This power saving scheme will avoid busy polling by either entering
> +   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
> +   not supported by the underlying platform, use ``rte_pause()``.
> +
> +* Frequency scaling
> +   This power saving scheme will use ``librte_power`` library functionality to
> +   scale the core frequency up/down depending on traffic volume.
> +
> +The "monitor" mode is only supported in the following configurations and scenarios:
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that
> +  ``rte_power_monitor()`` is supported by the platform, then monitoring will be
> +  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
> +  monitored from a different lcore).
> +
> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
> +  ``rte_power_monitor()`` function is not supported, then monitor mode will not
> +  be supported.
> +
> +* Not all Ethernet drivers support monitoring, even if the underlying
> +  platform may support the necessary CPU instructions. Please refer to
> +  :doc:`../nics/overview` for more information.
> +
> 
>  API Overview for Ethernet PMD Power Management
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ -242,3 +249,5 @@ References
> 
>  *   The :doc:`../sample_app_ug/vm_power_management`
>      chapter in the :doc:`../sample_app_ug/index` section.
> +
> +*   The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
> diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
> index f015c509fc..3926d45ef8 100644
> --- a/doc/guides/rel_notes/release_21_08.rst
> +++ b/doc/guides/rel_notes/release_21_08.rst
> @@ -57,6 +57,9 @@ New Features
> 
>  * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
> 
> +* rte_power: The experimental PMD power management API now supports managing
> +  multiple Ethernet Rx queues per lcore.
> +
> 
>  Removed Items
>  -------------
> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
> index 9b95cf1794..fccfd236c2 100644
> --- a/lib/power/rte_power_pmd_mgmt.c
> +++ b/lib/power/rte_power_pmd_mgmt.c
> @@ -33,18 +33,96 @@ enum pmd_mgmt_state {
>  	PMD_MGMT_ENABLED
>  };
> 
> -struct pmd_queue_cfg {
> +union queue {
> +	uint32_t val;
> +	struct {
> +		uint16_t portid;
> +		uint16_t qid;
> +	};
> +};
> +
> +struct queue_list_entry {
> +	TAILQ_ENTRY(queue_list_entry) next;
> +	union queue queue;
> +	uint64_t n_empty_polls;
> +	const struct rte_eth_rxtx_callback *cb;
> +};
> +
> +struct pmd_core_cfg {
> +	TAILQ_HEAD(queue_list_head, queue_list_entry) head;
> +	/**< List of queues associated with this lcore */
> +	size_t n_queues;
> +	/**< How many queues are in the list? */
>  	volatile enum pmd_mgmt_state pwr_mgmt_state;
>  	/**< State of power management for this queue */
>  	enum rte_power_pmd_mgmt_type cb_mode;
>  	/**< Callback mode for this queue */
> -	const struct rte_eth_rxtx_callback *cur_cb;
> -	/**< Callback instance */
> -	uint64_t empty_poll_stats;
> -	/**< Number of empty polls */
> +	uint64_t n_queues_ready_to_sleep;
> +	/**< Number of queues ready to enter power optimized state */
>  } __rte_cache_aligned;
> +static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
> 
> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
> +static inline bool
> +queue_equal(const union queue *l, const union queue *r)
> +{
> +	return l->val == r->val;
> +}
> +
> +static inline void
> +queue_copy(union queue *dst, const union queue *src)
> +{
> +	dst->val = src->val;
> +}
> +
> +static struct queue_list_entry *
> +queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
> +{
> +	struct queue_list_entry *cur;
> +
> +	TAILQ_FOREACH(cur, &cfg->head, next) {
> +		if (queue_equal(&cur->queue, q))
> +			return cur;
> +	}
> +	return NULL;
> +}
> +
> +static int
> +queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
> +{
> +	struct queue_list_entry *qle;
> +
> +	/* is it already in the list? */
> +	if (queue_list_find(cfg, q) != NULL)
> +		return -EEXIST;
> +
> +	qle = malloc(sizeof(*qle));
> +	if (qle == NULL)
> +		return -ENOMEM;
> +	memset(qle, 0, sizeof(*qle));
> +
> +	queue_copy(&qle->queue, q);
> +	TAILQ_INSERT_TAIL(&cfg->head, qle, next);
> +	cfg->n_queues++;
> +	qle->n_empty_polls = 0;
> +
> +	return 0;
> +}
> +
> +static struct queue_list_entry *
> +queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
> +{
> +	struct queue_list_entry *found;
> +
> +	found = queue_list_find(cfg, q);
> +	if (found == NULL)
> +		return NULL;
> +
> +	TAILQ_REMOVE(&cfg->head, found, next);
> +	cfg->n_queues--;
> +
> +	/* freeing is responsibility of the caller */
> +	return found;
> +}
> 
>  static void
>  calc_tsc(void)
> @@ -74,21 +152,56 @@ calc_tsc(void)
>  	}
>  }
> 
> +static inline void
> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> +	/* reset empty poll counter for this queue */
> +	qcfg->n_empty_polls = 0;
> +	/* reset the sleep counter too */
> +	cfg->n_queues_ready_to_sleep = 0;
> +}
> +
> +static inline bool
> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
> +{
> +	/* this function is called - that means we have an empty poll */
> +	qcfg->n_empty_polls++;
> +
> +	/* if we haven't reached threshold for empty polls, we can't sleep */
> +	if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
> +		return false;
> +
> +	/* we're ready to sleep */
> +	cfg->n_queues_ready_to_sleep++;
> +
> +	return true;
> +}
> +
> +static inline bool
> +lcore_can_sleep(struct pmd_core_cfg *cfg)
> +{
> +	/* are all queues ready to sleep? */
> +	if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
> +		return false;
> +
> +	/* we've reached an iteration where we can sleep, reset sleep counter */
> +	cfg->n_queues_ready_to_sleep = 0;
> +
> +	return true;
> +}

As I can see it a slightly modified one from what was discussed.
I understand that it seems simpler, but I think there are some problems with it:
- each queue can be counted more than once at lcore_cfg->n_queues_ready_to_sleep
- queues n_empty_polls are not reset after sleep().

To illustrate the problem, let say we have 2 queues, and at some moment we have:
q0.n_empty_polls == EMPTYPOLL_MAX + 1
q1.n_empty_polls == EMPTYPOLL_MAX + 1
cfg->n_queues_ready_to_sleep == 2

So lcore_can_sleep() returns 'true' and sets:
cfg->n_queues_ready_to_sleep == 0

Now, after sleep():
q0.n_empty_polls == EMPTYPOLL_MAX + 1
q1.n_empty_polls == EMPTYPOLL_MAX + 1

 So after:
queue_can_sleep(q0);
queue_can_sleep(q1);

will have:
cfg->n_queues_ready_to_sleep == 2 
again, and we'll go to another sleep after just one rx_burst() attempt for each queue.

> +
>  static uint16_t
>  clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> -		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> -		void *addr __rte_unused)
> +		uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
>  {
> +	struct queue_list_entry *queue_conf = arg;
> 
> -	struct pmd_queue_cfg *q_conf;
> -
> -	q_conf = &port_cfg[port_id][qidx];
> -
> +	/* this callback can't do more than one queue, omit multiqueue logic */
>  	if (unlikely(nb_rx == 0)) {
> -		q_conf->empty_poll_stats++;
> -		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> +		queue_conf->n_empty_polls++;
> +		if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
>  			struct rte_power_monitor_cond pmc;
> -			uint16_t ret;
> +			int ret;
> 
>  			/* use monitoring condition to sleep */
>  			ret = rte_eth_get_monitor_addr(port_id, qidx,
> @@ -97,60 +210,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>  				rte_power_monitor(&pmc, UINT64_MAX);
>  		}
>  	} else
> -		q_conf->empty_poll_stats = 0;
> +		queue_conf->n_empty_polls = 0;
> 
>  	return nb_rx;
>  }
> 
>  static uint16_t
> -clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
> -		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
> -		void *addr __rte_unused)
> +clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
> +		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> +		uint16_t max_pkts __rte_unused, void *arg)
>  {
> -	struct pmd_queue_cfg *q_conf;
> +	const unsigned int lcore = rte_lcore_id();
> +	struct queue_list_entry *queue_conf = arg;
> +	struct pmd_core_cfg *lcore_conf;
> +	const bool empty = nb_rx == 0;
> 
> -	q_conf = &port_cfg[port_id][qidx];
> +	lcore_conf = &lcore_cfgs[lcore];
> 
> -	if (unlikely(nb_rx == 0)) {
> -		q_conf->empty_poll_stats++;
> -		/* sleep for 1 microsecond */
> -		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
> -			/* use tpause if we have it */
> -			if (global_data.intrinsics_support.power_pause) {
> -				const uint64_t cur = rte_rdtsc();
> -				const uint64_t wait_tsc =
> -						cur + global_data.tsc_per_us;
> -				rte_power_pause(wait_tsc);
> -			} else {
> -				uint64_t i;
> -				for (i = 0; i < global_data.pause_per_us; i++)
> -					rte_pause();
> -			}
> +	if (likely(!empty))
> +		/* early exit */
> +		queue_reset(lcore_conf, queue_conf);
> +	else {
> +		/* can this queue sleep? */
> +		if (!queue_can_sleep(lcore_conf, queue_conf))
> +			return nb_rx;
> +
> +		/* can this lcore sleep? */
> +		if (!lcore_can_sleep(lcore_conf))
> +			return nb_rx;
> +
> +		/* sleep for 1 microsecond, use tpause if we have it */
> +		if (global_data.intrinsics_support.power_pause) {
> +			const uint64_t cur = rte_rdtsc();
> +			const uint64_t wait_tsc =
> +					cur + global_data.tsc_per_us;
> +			rte_power_pause(wait_tsc);
> +		} else {
> +			uint64_t i;
> +			for (i = 0; i < global_data.pause_per_us; i++)
> +				rte_pause();
>  		}
> -	} else
> -		q_conf->empty_poll_stats = 0;
> +	}
> 
>  	return nb_rx;
>  }
> 
>  static uint16_t
> -clb_scale_freq(uint16_t port_id, uint16_t qidx,
> +clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>  		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
> -		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> +		uint16_t max_pkts __rte_unused, void *arg)
>  {
> -	struct pmd_queue_cfg *q_conf;
> +	const unsigned int lcore = rte_lcore_id();
> +	const bool empty = nb_rx == 0;
> +	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
> +	struct queue_list_entry *queue_conf = arg;
> 
> -	q_conf = &port_cfg[port_id][qidx];
> +	if (likely(!empty)) {
> +		/* early exit */
> +		queue_reset(lcore_conf, queue_conf);
> 
> -	if (unlikely(nb_rx == 0)) {
> -		q_conf->empty_poll_stats++;
> -		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
> -			/* scale down freq */
> -			rte_power_freq_min(rte_lcore_id());
> -	} else {
> -		q_conf->empty_poll_stats = 0;
> -		/* scale up freq */
> +		/* scale up freq immediately */
>  		rte_power_freq_max(rte_lcore_id());
> +	} else {
> +		/* can this queue sleep? */
> +		if (!queue_can_sleep(lcore_conf, queue_conf))
> +			return nb_rx;
> +
> +		/* can this lcore sleep? */
> +		if (!lcore_can_sleep(lcore_conf))
> +			return nb_rx;
> +
> +		rte_power_freq_min(rte_lcore_id());
>  	}
> 
>  	return nb_rx;
> @@ -167,11 +297,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
>  	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
>  }
> 
> +static int
> +cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
> +{
> +	const struct queue_list_entry *entry;
> +
> +	TAILQ_FOREACH(entry, &queue_cfg->head, next) {
> +		const union queue *q = &entry->queue;
> +		int ret = queue_stopped(q->portid, q->qid);
> +		if (ret != 1)
> +			return ret;
> +	}
> +	return 1;
> +}
> +
> +static int
> +check_scale(unsigned int lcore)
> +{
> +	enum power_management_env env;
> +
> +	/* only PSTATE and ACPI modes are supported */
> +	if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> +	    !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
> +		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> +		return -ENOTSUP;
> +	}
> +	/* ensure we could initialize the power library */
> +	if (rte_power_init(lcore))
> +		return -EINVAL;
> +
> +	/* ensure we initialized the correct env */
> +	env = rte_power_get_env();
> +	if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
> +		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> +		return -ENOTSUP;
> +	}
> +
> +	/* we're done */
> +	return 0;
> +}
> +
> +static int
> +check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
> +{
> +	struct rte_power_monitor_cond dummy;
> +
> +	/* check if rte_power_monitor is supported */
> +	if (!global_data.intrinsics_support.power_monitor) {
> +		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> +		return -ENOTSUP;
> +	}
> +
> +	if (cfg->n_queues > 0) {
> +		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
> +		return -ENOTSUP;
> +	}
> +
> +	/* check if the device supports the necessary PMD API */
> +	if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
> +			&dummy) == -ENOTSUP) {
> +		RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
> +		return -ENOTSUP;
> +	}
> +
> +	/* we're done */
> +	return 0;
> +}
> +
>  int
>  rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>  {
> -	struct pmd_queue_cfg *queue_cfg;
> +	const union queue qdata = {.portid = port_id, .qid = queue_id};
> +	struct pmd_core_cfg *lcore_cfg;
> +	struct queue_list_entry *queue_cfg;
>  	struct rte_eth_dev_info info;
>  	rte_rx_callback_fn clb;
>  	int ret;
> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		goto end;
>  	}
> 
> -	queue_cfg = &port_cfg[port_id][queue_id];
> +	lcore_cfg = &lcore_cfgs[lcore_id];
> 
> -	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
> +	/* check if other queues are stopped as well */
> +	ret = cfg_queues_stopped(lcore_cfg);
> +	if (ret != 1) {
> +		/* error means invalid queue, 0 means queue wasn't stopped */
> +		ret = ret < 0 ? -EINVAL : -EBUSY;
> +		goto end;
> +	}
> +
> +	/* if callback was already enabled, check current callback type */
> +	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
> +			lcore_cfg->cb_mode != mode) {
>  		ret = -EINVAL;
>  		goto end;
>  	}
> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
> 
>  	switch (mode) {
>  	case RTE_POWER_MGMT_TYPE_MONITOR:
> -	{
> -		struct rte_power_monitor_cond dummy;
> -
> -		/* check if rte_power_monitor is supported */
> -		if (!global_data.intrinsics_support.power_monitor) {
> -			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
> -			ret = -ENOTSUP;
> +		/* check if we can add a new queue */
> +		ret = check_monitor(lcore_cfg, &qdata);
> +		if (ret < 0)
>  			goto end;
> -		}
> 
> -		/* check if the device supports the necessary PMD API */
> -		if (rte_eth_get_monitor_addr(port_id, queue_id,
> -				&dummy) == -ENOTSUP) {
> -			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
> -			ret = -ENOTSUP;
> -			goto end;
> -		}
>  		clb = clb_umwait;
>  		break;
> -	}
>  	case RTE_POWER_MGMT_TYPE_SCALE:
> -	{
> -		enum power_management_env env;
> -		/* only PSTATE and ACPI modes are supported */
> -		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
> -				!rte_power_check_env_supported(
> -					PM_ENV_PSTATE_CPUFREQ)) {
> -			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
> -			ret = -ENOTSUP;
> +		/* check if we can add a new queue */
> +		ret = check_scale(lcore_id);
> +		if (ret < 0)
>  			goto end;
> -		}
> -		/* ensure we could initialize the power library */
> -		if (rte_power_init(lcore_id)) {
> -			ret = -EINVAL;
> -			goto end;
> -		}
> -		/* ensure we initialized the correct env */
> -		env = rte_power_get_env();
> -		if (env != PM_ENV_ACPI_CPUFREQ &&
> -				env != PM_ENV_PSTATE_CPUFREQ) {
> -			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
> -			ret = -ENOTSUP;
> -			goto end;
> -		}
>  		clb = clb_scale_freq;
>  		break;
> -	}
>  	case RTE_POWER_MGMT_TYPE_PAUSE:
>  		/* figure out various time-to-tsc conversions */
>  		if (global_data.tsc_per_us == 0)
> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>  		ret = -EINVAL;
>  		goto end;
>  	}
> +	/* add this queue to the list */
> +	ret = queue_list_add(lcore_cfg, &qdata);
> +	if (ret < 0) {
> +		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
> +				strerror(-ret));
> +		goto end;
> +	}
> +	/* new queue is always added last */
> +	queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
> 
>  	/* initialize data before enabling the callback */
> -	queue_cfg->empty_poll_stats = 0;
> -	queue_cfg->cb_mode = mode;
> -	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> -	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
> -			clb, NULL);
> +	if (lcore_cfg->n_queues == 1) {
> +		lcore_cfg->cb_mode = mode;
> +		lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
> +	}
> +	queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
> +			clb, queue_cfg);
> 
>  	ret = 0;
>  end:
> @@ -290,7 +476,9 @@ int
>  rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>  		uint16_t port_id, uint16_t queue_id)
>  {
> -	struct pmd_queue_cfg *queue_cfg;
> +	const union queue qdata = {.portid = port_id, .qid = queue_id};
> +	struct pmd_core_cfg *lcore_cfg;
> +	struct queue_list_entry *queue_cfg;
>  	int ret;
> 
>  	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
> @@ -306,24 +494,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>  	}
> 
>  	/* no need to check queue id as wrong queue id would not be enabled */
> -	queue_cfg = &port_cfg[port_id][queue_id];
> +	lcore_cfg = &lcore_cfgs[lcore_id];
> 
> -	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
> +	/* check if other queues are stopped as well */
> +	ret = cfg_queues_stopped(lcore_cfg);
> +	if (ret != 1) {
> +		/* error means invalid queue, 0 means queue wasn't stopped */
> +		return ret < 0 ? -EINVAL : -EBUSY;
> +	}
> +
> +	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>  		return -EINVAL;
> 
> -	/* stop any callbacks from progressing */
> -	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +	/*
> +	 * There is no good/easy way to do this without race conditions, so we
> +	 * are just going to throw our hands in the air and hope that the user
> +	 * has read the documentation and has ensured that ports are stopped at
> +	 * the time we enter the API functions.
> +	 */
> +	queue_cfg = queue_list_take(lcore_cfg, &qdata);
> +	if (queue_cfg == NULL)
> +		return -ENOENT;
> 
> -	switch (queue_cfg->cb_mode) {
> +	/* if we've removed all queues from the lists, set state to disabled */
> +	if (lcore_cfg->n_queues == 0)
> +		lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
> +
> +	switch (lcore_cfg->cb_mode) {
>  	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
>  	case RTE_POWER_MGMT_TYPE_PAUSE:
> -		rte_eth_remove_rx_callback(port_id, queue_id,
> -				queue_cfg->cur_cb);
> +		rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>  		break;
>  	case RTE_POWER_MGMT_TYPE_SCALE:
>  		rte_power_freq_max(lcore_id);
> -		rte_eth_remove_rx_callback(port_id, queue_id,
> -				queue_cfg->cur_cb);
> +		rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>  		rte_power_exit(lcore_id);
>  		break;
>  	}
> @@ -332,7 +536,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>  	 * ports before calling any of these API's, so we can assume that the
>  	 * callbacks can be freed. we're intentionally casting away const-ness.
>  	 */
> -	rte_free((void *)queue_cfg->cur_cb);
> +	rte_free((void *)queue_cfg->cb);
> +	free(queue_cfg);
> 
>  	return 0;
>  }
> +
> +RTE_INIT(rte_power_ethdev_pmgmt_init) {
> +	size_t i;
> +
> +	/* initialize all tailqs */
> +	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
> +		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
> +		TAILQ_INIT(&cfg->head);
> +	}
> +}
> --
> 2.25.1
David Hunt July 1, 2021, 9:01 a.m. UTC | #3
On 30/6/2021 10:52 AM, David Hunt wrote:
> Hi Anatoly,
>
> On 29/6/2021 4:48 PM, Anatoly Burakov wrote:
>> Currently, there is a hard limitation on the PMD power management
>> support that only allows it to support a single queue per lcore. This is
>> not ideal as most DPDK use cases will poll multiple queues per core.
>>
>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>> is very difficult to implement such support because callbacks are
>> effectively stateless and have no visibility into what the other ethdev
>> devices are doing. This places limitations on what we can do within the
>> framework of Rx callbacks, but the basics of this implementation are as
>> follows:
>>
>> - Replace per-queue structures with per-lcore ones, so that any device
>>    polled from the same lcore can share data
>> - Any queue that is going to be polled from a specific lcore has to be
>>    added to the list of queues to poll, so that the callback is aware of
>>    other queues being polled by the same lcore
>> - Both the empty poll counter and the actual power saving mechanism is
>>    shared between all queues polled on a particular lcore, and is only
>>    activated when all queues in the list were polled and were determined
>>    to have no traffic.
>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>    is incapable of monitoring more than one address.
>>
>> Also, while we're at it, update and improve the docs.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v5:
>>      - Remove the "power save queue" API and replace it with 
>> mechanism suggested by
>>        Konstantin
>>           v3:
>>      - Move the list of supported NICs to NIC feature table
>>           v2:
>>      - Use a TAILQ for queues instead of a static array
>>      - Address feedback from Konstantin
>>      - Add additional checks for stopped queues
>>
>>   doc/guides/nics/features.rst           |  10 +
>>   doc/guides/prog_guide/power_man.rst    |  65 ++--
>>   doc/guides/rel_notes/release_21_08.rst |   3 +
>>   lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>>   4 files changed, 373 insertions(+), 136 deletions(-)
>>
>
> --snip--
>
>>   int
>>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t 
>> port_id,
>>           uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>>   {
>> -    struct pmd_queue_cfg *queue_cfg;
>> +    const union queue qdata = {.portid = port_id, .qid = queue_id};
>> +    struct pmd_core_cfg *lcore_cfg;
>> +    struct queue_list_entry *queue_cfg;
>>       struct rte_eth_dev_info info;
>>       rte_rx_callback_fn clb;
>>       int ret;
>> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int 
>> lcore_id, uint16_t port_id,
>>           goto end;
>>       }
>>   -    queue_cfg = &port_cfg[port_id][queue_id];
>> +    lcore_cfg = &lcore_cfgs[lcore_id];
>>   -    if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
>> +    /* check if other queues are stopped as well */
>> +    ret = cfg_queues_stopped(lcore_cfg);
>> +    if (ret != 1) {
>> +        /* error means invalid queue, 0 means queue wasn't stopped */
>> +        ret = ret < 0 ? -EINVAL : -EBUSY;
>> +        goto end;
>> +    }
>> +
>> +    /* if callback was already enabled, check current callback type */
>> +    if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
>> +            lcore_cfg->cb_mode != mode) {
>>           ret = -EINVAL;
>>           goto end;
>>       }
>> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned 
>> int lcore_id, uint16_t port_id,
>>         switch (mode) {
>>       case RTE_POWER_MGMT_TYPE_MONITOR:
>> -    {
>> -        struct rte_power_monitor_cond dummy;
>> -
>> -        /* check if rte_power_monitor is supported */
>> -        if (!global_data.intrinsics_support.power_monitor) {
>> -            RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not 
>> supported\n");
>> -            ret = -ENOTSUP;
>> +        /* check if we can add a new queue */
>> +        ret = check_monitor(lcore_cfg, &qdata);
>> +        if (ret < 0)
>>               goto end;
>> -        }
>>   -        /* check if the device supports the necessary PMD API */
>> -        if (rte_eth_get_monitor_addr(port_id, queue_id,
>> -                &dummy) == -ENOTSUP) {
>> -            RTE_LOG(DEBUG, POWER, "The device does not support 
>> rte_eth_get_monitor_addr\n");
>> -            ret = -ENOTSUP;
>> -            goto end;
>> -        }
>>           clb = clb_umwait;
>>           break;
>> -    }
>>       case RTE_POWER_MGMT_TYPE_SCALE:
>> -    {
>> -        enum power_management_env env;
>> -        /* only PSTATE and ACPI modes are supported */
>> -        if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>> -                !rte_power_check_env_supported(
>> -                    PM_ENV_PSTATE_CPUFREQ)) {
>> -            RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are 
>> supported\n");
>> -            ret = -ENOTSUP;
>> +        /* check if we can add a new queue */
>> +        ret = check_scale(lcore_id);
>> +        if (ret < 0)
>>               goto end;
>> -        }
>> -        /* ensure we could initialize the power library */
>> -        if (rte_power_init(lcore_id)) {
>> -            ret = -EINVAL;
>> -            goto end;
>> -        }
>> -        /* ensure we initialized the correct env */
>> -        env = rte_power_get_env();
>> -        if (env != PM_ENV_ACPI_CPUFREQ &&
>> -                env != PM_ENV_PSTATE_CPUFREQ) {
>> -            RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes 
>> were initialized\n");
>> -            ret = -ENOTSUP;
>> -            goto end;
>> -        }
>>           clb = clb_scale_freq;
>>           break;
>> -    }
>>       case RTE_POWER_MGMT_TYPE_PAUSE:
>>           /* figure out various time-to-tsc conversions */
>>           if (global_data.tsc_per_us == 0)
>> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned 
>> int lcore_id, uint16_t port_id,
>>           ret = -EINVAL;
>>           goto end;
>>       }
>> +    /* add this queue to the list */
>> +    ret = queue_list_add(lcore_cfg, &qdata);
>> +    if (ret < 0) {
>> +        RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
>> +                strerror(-ret));
>> +        goto end;
>> +    }
>> +    /* new queue is always added last */
>> +    queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>
>
> Need to ensure that queue_cfg gets set here, otherwise we'll get a 
> segfault below.
>

Or, looking at this again, shouldn't "lcore_cfgs" be "lcore_cfg"?


>
>
>>         /* initialize data before enabling the callback */
>> -    queue_cfg->empty_poll_stats = 0;
>> -    queue_cfg->cb_mode = mode;
>> -    queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> -    queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> -            clb, NULL);
>> +    if (lcore_cfg->n_queues == 1) {
>> +        lcore_cfg->cb_mode = mode;
>> +        lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> +    }
>> +    queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +            clb, queue_cfg);
> --snip--
>
Anatoly Burakov July 5, 2021, 10:23 a.m. UTC | #4
On 30-Jun-21 12:04 PM, Ananyev, Konstantin wrote:
> 
> 
> 
>> Currently, there is a hard limitation on the PMD power management
>> support that only allows it to support a single queue per lcore. This is
>> not ideal as most DPDK use cases will poll multiple queues per core.
>>
>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>> is very difficult to implement such support because callbacks are
>> effectively stateless and have no visibility into what the other ethdev
>> devices are doing. This places limitations on what we can do within the
>> framework of Rx callbacks, but the basics of this implementation are as
>> follows:
>>
>> - Replace per-queue structures with per-lcore ones, so that any device
>>    polled from the same lcore can share data
>> - Any queue that is going to be polled from a specific lcore has to be
>>    added to the list of queues to poll, so that the callback is aware of
>>    other queues being polled by the same lcore
>> - Both the empty poll counter and the actual power saving mechanism is
>>    shared between all queues polled on a particular lcore, and is only
>>    activated when all queues in the list were polled and were determined
>>    to have no traffic.
>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>    is incapable of monitoring more than one address.
>>
>> Also, while we're at it, update and improve the docs.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>
>> Notes:
>>      v5:
>>      - Remove the "power save queue" API and replace it with mechanism suggested by
>>        Konstantin
>>
>>      v3:
>>      - Move the list of supported NICs to NIC feature table
>>
>>      v2:
>>      - Use a TAILQ for queues instead of a static array
>>      - Address feedback from Konstantin
>>      - Add additional checks for stopped queues
>>
>>   doc/guides/nics/features.rst           |  10 +
>>   doc/guides/prog_guide/power_man.rst    |  65 ++--
>>   doc/guides/rel_notes/release_21_08.rst |   3 +
>>   lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>>   4 files changed, 373 insertions(+), 136 deletions(-)
>>
>> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
>> index 403c2b03a3..a96e12d155 100644
>> --- a/doc/guides/nics/features.rst
>> +++ b/doc/guides/nics/features.rst
>> @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information.
>>   * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
>>   * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
>>
>> +.. _nic_features_get_monitor_addr:
>> +
>> +PMD power management using monitor addresses
>> +--------------------------------------------
>> +
>> +Supports getting a monitoring condition to use together with Ethernet PMD power
>> +management (see :doc:`../prog_guide/power_man` for more details).
>> +
>> +* **[implements] eth_dev_ops**: ``get_monitor_addr``
>> +
>>   .. _nic_features_other:
>>
>>   Other dev ops not represented by a Feature
>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>> index c70ae128ac..ec04a72108 100644
>> --- a/doc/guides/prog_guide/power_man.rst
>> +++ b/doc/guides/prog_guide/power_man.rst
>> @@ -198,34 +198,41 @@ Ethernet PMD Power Management API
>>   Abstract
>>   ~~~~~~~~
>>
>> -Existing power management mechanisms require developers
>> -to change application design or change code to make use of it.
>> -The PMD power management API provides a convenient alternative
>> -by utilizing Ethernet PMD RX callbacks,
>> -and triggering power saving whenever empty poll count reaches a certain number.
>> -
>> -Monitor
>> -   This power saving scheme will put the CPU into optimized power state
>> -   and use the ``rte_power_monitor()`` function
>> -   to monitor the Ethernet PMD RX descriptor address,
>> -   and wake the CPU up whenever there's new traffic.
>> -
>> -Pause
>> -   This power saving scheme will avoid busy polling
>> -   by either entering power-optimized sleep state
>> -   with ``rte_power_pause()`` function,
>> -   or, if it's not available, use ``rte_pause()``.
>> -
>> -Frequency scaling
>> -   This power saving scheme will use ``librte_power`` library
>> -   functionality to scale the core frequency up/down
>> -   depending on traffic volume.
>> -
>> -.. note::
>> -
>> -   Currently, this power management API is limited to mandatory mapping
>> -   of 1 queue to 1 core (multiple queues are supported,
>> -   but they must be polled from different cores).
>> +Existing power management mechanisms require developers to change application
>> +design or change code to make use of it. The PMD power management API provides a
>> +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
>> +power saving whenever empty poll count reaches a certain number.
>> +
>> +* Monitor
>> +   This power saving scheme will put the CPU into optimized power state and
>> +   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
>> +   there's new traffic. Support for this scheme may not be available on all
>> +   platforms, and further limitations may apply (see below).
>> +
>> +* Pause
>> +   This power saving scheme will avoid busy polling by either entering
>> +   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
>> +   not supported by the underlying platform, use ``rte_pause()``.
>> +
>> +* Frequency scaling
>> +   This power saving scheme will use ``librte_power`` library functionality to
>> +   scale the core frequency up/down depending on traffic volume.
>> +
>> +The "monitor" mode is only supported in the following configurations and scenarios:
>> +
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that
>> +  ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>> +  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>> +  monitored from a different lcore).
>> +
>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
>> +  ``rte_power_monitor()`` function is not supported, then monitor mode will not
>> +  be supported.
>> +
>> +* Not all Ethernet drivers support monitoring, even if the underlying
>> +  platform may support the necessary CPU instructions. Please refer to
>> +  :doc:`../nics/overview` for more information.
>> +
>>
>>   API Overview for Ethernet PMD Power Management
>>   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> @@ -242,3 +249,5 @@ References
>>
>>   *   The :doc:`../sample_app_ug/vm_power_management`
>>       chapter in the :doc:`../sample_app_ug/index` section.
>> +
>> +*   The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
>> diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
>> index f015c509fc..3926d45ef8 100644
>> --- a/doc/guides/rel_notes/release_21_08.rst
>> +++ b/doc/guides/rel_notes/release_21_08.rst
>> @@ -57,6 +57,9 @@ New Features
>>
>>   * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
>>
>> +* rte_power: The experimental PMD power management API now supports managing
>> +  multiple Ethernet Rx queues per lcore.
>> +
>>
>>   Removed Items
>>   -------------
>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>> index 9b95cf1794..fccfd236c2 100644
>> --- a/lib/power/rte_power_pmd_mgmt.c
>> +++ b/lib/power/rte_power_pmd_mgmt.c
>> @@ -33,18 +33,96 @@ enum pmd_mgmt_state {
>>        PMD_MGMT_ENABLED
>>   };
>>
>> -struct pmd_queue_cfg {
>> +union queue {
>> +     uint32_t val;
>> +     struct {
>> +             uint16_t portid;
>> +             uint16_t qid;
>> +     };
>> +};
>> +
>> +struct queue_list_entry {
>> +     TAILQ_ENTRY(queue_list_entry) next;
>> +     union queue queue;
>> +     uint64_t n_empty_polls;
>> +     const struct rte_eth_rxtx_callback *cb;
>> +};
>> +
>> +struct pmd_core_cfg {
>> +     TAILQ_HEAD(queue_list_head, queue_list_entry) head;
>> +     /**< List of queues associated with this lcore */
>> +     size_t n_queues;
>> +     /**< How many queues are in the list? */
>>        volatile enum pmd_mgmt_state pwr_mgmt_state;
>>        /**< State of power management for this queue */
>>        enum rte_power_pmd_mgmt_type cb_mode;
>>        /**< Callback mode for this queue */
>> -     const struct rte_eth_rxtx_callback *cur_cb;
>> -     /**< Callback instance */
>> -     uint64_t empty_poll_stats;
>> -     /**< Number of empty polls */
>> +     uint64_t n_queues_ready_to_sleep;
>> +     /**< Number of queues ready to enter power optimized state */
>>   } __rte_cache_aligned;
>> +static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
>>
>> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
>> +static inline bool
>> +queue_equal(const union queue *l, const union queue *r)
>> +{
>> +     return l->val == r->val;
>> +}
>> +
>> +static inline void
>> +queue_copy(union queue *dst, const union queue *src)
>> +{
>> +     dst->val = src->val;
>> +}
>> +
>> +static struct queue_list_entry *
>> +queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
>> +{
>> +     struct queue_list_entry *cur;
>> +
>> +     TAILQ_FOREACH(cur, &cfg->head, next) {
>> +             if (queue_equal(&cur->queue, q))
>> +                     return cur;
>> +     }
>> +     return NULL;
>> +}
>> +
>> +static int
>> +queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
>> +{
>> +     struct queue_list_entry *qle;
>> +
>> +     /* is it already in the list? */
>> +     if (queue_list_find(cfg, q) != NULL)
>> +             return -EEXIST;
>> +
>> +     qle = malloc(sizeof(*qle));
>> +     if (qle == NULL)
>> +             return -ENOMEM;
>> +     memset(qle, 0, sizeof(*qle));
>> +
>> +     queue_copy(&qle->queue, q);
>> +     TAILQ_INSERT_TAIL(&cfg->head, qle, next);
>> +     cfg->n_queues++;
>> +     qle->n_empty_polls = 0;
>> +
>> +     return 0;
>> +}
>> +
>> +static struct queue_list_entry *
>> +queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
>> +{
>> +     struct queue_list_entry *found;
>> +
>> +     found = queue_list_find(cfg, q);
>> +     if (found == NULL)
>> +             return NULL;
>> +
>> +     TAILQ_REMOVE(&cfg->head, found, next);
>> +     cfg->n_queues--;
>> +
>> +     /* freeing is responsibility of the caller */
>> +     return found;
>> +}
>>
>>   static void
>>   calc_tsc(void)
>> @@ -74,21 +152,56 @@ calc_tsc(void)
>>        }
>>   }
>>
>> +static inline void
>> +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>> +{
>> +     /* reset empty poll counter for this queue */
>> +     qcfg->n_empty_polls = 0;
>> +     /* reset the sleep counter too */
>> +     cfg->n_queues_ready_to_sleep = 0;
>> +}
>> +
>> +static inline bool
>> +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
>> +{
>> +     /* this function is called - that means we have an empty poll */
>> +     qcfg->n_empty_polls++;
>> +
>> +     /* if we haven't reached threshold for empty polls, we can't sleep */
>> +     if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
>> +             return false;
>> +
>> +     /* we're ready to sleep */
>> +     cfg->n_queues_ready_to_sleep++;
>> +
>> +     return true;
>> +}
>> +
>> +static inline bool
>> +lcore_can_sleep(struct pmd_core_cfg *cfg)
>> +{
>> +     /* are all queues ready to sleep? */
>> +     if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
>> +             return false;
>> +
>> +     /* we've reached an iteration where we can sleep, reset sleep counter */
>> +     cfg->n_queues_ready_to_sleep = 0;
>> +
>> +     return true;
>> +}
> 
> As I can see it a slightly modified one from what was discussed.
> I understand that it seems simpler, but I think there are some problems with it:
> - each queue can be counted more than once at lcore_cfg->n_queues_ready_to_sleep
> - queues n_empty_polls are not reset after sleep().
> 

The latter is intentional: we *want* to sleep constantly once we pass 
the empty poll counter.

The former shouldn't be a big problem in the conventional case as i 
don't think there are situations where people would poll core-pinned 
queues in different orders, but you're right, this is a potential issue 
and should be fixed. I'll add back the n_sleeps in the next iteration.

> To illustrate the problem, let say we have 2 queues, and at some moment we have:
> q0.n_empty_polls == EMPTYPOLL_MAX + 1
> q1.n_empty_polls == EMPTYPOLL_MAX + 1
> cfg->n_queues_ready_to_sleep == 2
> 
> So lcore_can_sleep() returns 'true' and sets:
> cfg->n_queues_ready_to_sleep == 0
> 
> Now, after sleep():
> q0.n_empty_polls == EMPTYPOLL_MAX + 1
> q1.n_empty_polls == EMPTYPOLL_MAX + 1
> 
>   So after:
> queue_can_sleep(q0);
> queue_can_sleep(q1);
> 
> will have:
> cfg->n_queues_ready_to_sleep == 2
> again, and we'll go to another sleep after just one rx_burst() attempt for each queue.
> 
>> +
>>   static uint16_t
>>   clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>> -             uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>> -             void *addr __rte_unused)
>> +             uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
>>   {
>> +     struct queue_list_entry *queue_conf = arg;
>>
>> -     struct pmd_queue_cfg *q_conf;
>> -
>> -     q_conf = &port_cfg[port_id][qidx];
>> -
>> +     /* this callback can't do more than one queue, omit multiqueue logic */
>>        if (unlikely(nb_rx == 0)) {
>> -             q_conf->empty_poll_stats++;
>> -             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> +             queue_conf->n_empty_polls++;
>> +             if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
>>                        struct rte_power_monitor_cond pmc;
>> -                     uint16_t ret;
>> +                     int ret;
>>
>>                        /* use monitoring condition to sleep */
>>                        ret = rte_eth_get_monitor_addr(port_id, qidx,
>> @@ -97,60 +210,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>>                                rte_power_monitor(&pmc, UINT64_MAX);
>>                }
>>        } else
>> -             q_conf->empty_poll_stats = 0;
>> +             queue_conf->n_empty_polls = 0;
>>
>>        return nb_rx;
>>   }
>>
>>   static uint16_t
>> -clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
>> -             uint16_t nb_rx, uint16_t max_pkts __rte_unused,
>> -             void *addr __rte_unused)
>> +clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> +             uint16_t max_pkts __rte_unused, void *arg)
>>   {
>> -     struct pmd_queue_cfg *q_conf;
>> +     const unsigned int lcore = rte_lcore_id();
>> +     struct queue_list_entry *queue_conf = arg;
>> +     struct pmd_core_cfg *lcore_conf;
>> +     const bool empty = nb_rx == 0;
>>
>> -     q_conf = &port_cfg[port_id][qidx];
>> +     lcore_conf = &lcore_cfgs[lcore];
>>
>> -     if (unlikely(nb_rx == 0)) {
>> -             q_conf->empty_poll_stats++;
>> -             /* sleep for 1 microsecond */
>> -             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>> -                     /* use tpause if we have it */
>> -                     if (global_data.intrinsics_support.power_pause) {
>> -                             const uint64_t cur = rte_rdtsc();
>> -                             const uint64_t wait_tsc =
>> -                                             cur + global_data.tsc_per_us;
>> -                             rte_power_pause(wait_tsc);
>> -                     } else {
>> -                             uint64_t i;
>> -                             for (i = 0; i < global_data.pause_per_us; i++)
>> -                                     rte_pause();
>> -                     }
>> +     if (likely(!empty))
>> +             /* early exit */
>> +             queue_reset(lcore_conf, queue_conf);
>> +     else {
>> +             /* can this queue sleep? */
>> +             if (!queue_can_sleep(lcore_conf, queue_conf))
>> +                     return nb_rx;
>> +
>> +             /* can this lcore sleep? */
>> +             if (!lcore_can_sleep(lcore_conf))
>> +                     return nb_rx;
>> +
>> +             /* sleep for 1 microsecond, use tpause if we have it */
>> +             if (global_data.intrinsics_support.power_pause) {
>> +                     const uint64_t cur = rte_rdtsc();
>> +                     const uint64_t wait_tsc =
>> +                                     cur + global_data.tsc_per_us;
>> +                     rte_power_pause(wait_tsc);
>> +             } else {
>> +                     uint64_t i;
>> +                     for (i = 0; i < global_data.pause_per_us; i++)
>> +                             rte_pause();
>>                }
>> -     } else
>> -             q_conf->empty_poll_stats = 0;
>> +     }
>>
>>        return nb_rx;
>>   }
>>
>>   static uint16_t
>> -clb_scale_freq(uint16_t port_id, uint16_t qidx,
>> +clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
>>                struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>> -             uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> +             uint16_t max_pkts __rte_unused, void *arg)
>>   {
>> -     struct pmd_queue_cfg *q_conf;
>> +     const unsigned int lcore = rte_lcore_id();
>> +     const bool empty = nb_rx == 0;
>> +     struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
>> +     struct queue_list_entry *queue_conf = arg;
>>
>> -     q_conf = &port_cfg[port_id][qidx];
>> +     if (likely(!empty)) {
>> +             /* early exit */
>> +             queue_reset(lcore_conf, queue_conf);
>>
>> -     if (unlikely(nb_rx == 0)) {
>> -             q_conf->empty_poll_stats++;
>> -             if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
>> -                     /* scale down freq */
>> -                     rte_power_freq_min(rte_lcore_id());
>> -     } else {
>> -             q_conf->empty_poll_stats = 0;
>> -             /* scale up freq */
>> +             /* scale up freq immediately */
>>                rte_power_freq_max(rte_lcore_id());
>> +     } else {
>> +             /* can this queue sleep? */
>> +             if (!queue_can_sleep(lcore_conf, queue_conf))
>> +                     return nb_rx;
>> +
>> +             /* can this lcore sleep? */
>> +             if (!lcore_can_sleep(lcore_conf))
>> +                     return nb_rx;
>> +
>> +             rte_power_freq_min(rte_lcore_id());
>>        }
>>
>>        return nb_rx;
>> @@ -167,11 +297,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id)
>>        return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
>>   }
>>
>> +static int
>> +cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
>> +{
>> +     const struct queue_list_entry *entry;
>> +
>> +     TAILQ_FOREACH(entry, &queue_cfg->head, next) {
>> +             const union queue *q = &entry->queue;
>> +             int ret = queue_stopped(q->portid, q->qid);
>> +             if (ret != 1)
>> +                     return ret;
>> +     }
>> +     return 1;
>> +}
>> +
>> +static int
>> +check_scale(unsigned int lcore)
>> +{
>> +     enum power_management_env env;
>> +
>> +     /* only PSTATE and ACPI modes are supported */
>> +     if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>> +         !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
>> +             RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
>> +             return -ENOTSUP;
>> +     }
>> +     /* ensure we could initialize the power library */
>> +     if (rte_power_init(lcore))
>> +             return -EINVAL;
>> +
>> +     /* ensure we initialized the correct env */
>> +     env = rte_power_get_env();
>> +     if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
>> +             RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
>> +             return -ENOTSUP;
>> +     }
>> +
>> +     /* we're done */
>> +     return 0;
>> +}
>> +
>> +static int
>> +check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
>> +{
>> +     struct rte_power_monitor_cond dummy;
>> +
>> +     /* check if rte_power_monitor is supported */
>> +     if (!global_data.intrinsics_support.power_monitor) {
>> +             RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>> +             return -ENOTSUP;
>> +     }
>> +
>> +     if (cfg->n_queues > 0) {
>> +             RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
>> +             return -ENOTSUP;
>> +     }
>> +
>> +     /* check if the device supports the necessary PMD API */
>> +     if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
>> +                     &dummy) == -ENOTSUP) {
>> +             RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
>> +             return -ENOTSUP;
>> +     }
>> +
>> +     /* we're done */
>> +     return 0;
>> +}
>> +
>>   int
>>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>>   {
>> -     struct pmd_queue_cfg *queue_cfg;
>> +     const union queue qdata = {.portid = port_id, .qid = queue_id};
>> +     struct pmd_core_cfg *lcore_cfg;
>> +     struct queue_list_entry *queue_cfg;
>>        struct rte_eth_dev_info info;
>>        rte_rx_callback_fn clb;
>>        int ret;
>> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                goto end;
>>        }
>>
>> -     queue_cfg = &port_cfg[port_id][queue_id];
>> +     lcore_cfg = &lcore_cfgs[lcore_id];
>>
>> -     if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
>> +     /* check if other queues are stopped as well */
>> +     ret = cfg_queues_stopped(lcore_cfg);
>> +     if (ret != 1) {
>> +             /* error means invalid queue, 0 means queue wasn't stopped */
>> +             ret = ret < 0 ? -EINVAL : -EBUSY;
>> +             goto end;
>> +     }
>> +
>> +     /* if callback was already enabled, check current callback type */
>> +     if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
>> +                     lcore_cfg->cb_mode != mode) {
>>                ret = -EINVAL;
>>                goto end;
>>        }
>> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>
>>        switch (mode) {
>>        case RTE_POWER_MGMT_TYPE_MONITOR:
>> -     {
>> -             struct rte_power_monitor_cond dummy;
>> -
>> -             /* check if rte_power_monitor is supported */
>> -             if (!global_data.intrinsics_support.power_monitor) {
>> -                     RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
>> -                     ret = -ENOTSUP;
>> +             /* check if we can add a new queue */
>> +             ret = check_monitor(lcore_cfg, &qdata);
>> +             if (ret < 0)
>>                        goto end;
>> -             }
>>
>> -             /* check if the device supports the necessary PMD API */
>> -             if (rte_eth_get_monitor_addr(port_id, queue_id,
>> -                             &dummy) == -ENOTSUP) {
>> -                     RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
>> -                     ret = -ENOTSUP;
>> -                     goto end;
>> -             }
>>                clb = clb_umwait;
>>                break;
>> -     }
>>        case RTE_POWER_MGMT_TYPE_SCALE:
>> -     {
>> -             enum power_management_env env;
>> -             /* only PSTATE and ACPI modes are supported */
>> -             if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>> -                             !rte_power_check_env_supported(
>> -                                     PM_ENV_PSTATE_CPUFREQ)) {
>> -                     RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
>> -                     ret = -ENOTSUP;
>> +             /* check if we can add a new queue */
>> +             ret = check_scale(lcore_id);
>> +             if (ret < 0)
>>                        goto end;
>> -             }
>> -             /* ensure we could initialize the power library */
>> -             if (rte_power_init(lcore_id)) {
>> -                     ret = -EINVAL;
>> -                     goto end;
>> -             }
>> -             /* ensure we initialized the correct env */
>> -             env = rte_power_get_env();
>> -             if (env != PM_ENV_ACPI_CPUFREQ &&
>> -                             env != PM_ENV_PSTATE_CPUFREQ) {
>> -                     RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
>> -                     ret = -ENOTSUP;
>> -                     goto end;
>> -             }
>>                clb = clb_scale_freq;
>>                break;
>> -     }
>>        case RTE_POWER_MGMT_TYPE_PAUSE:
>>                /* figure out various time-to-tsc conversions */
>>                if (global_data.tsc_per_us == 0)
>> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
>>                ret = -EINVAL;
>>                goto end;
>>        }
>> +     /* add this queue to the list */
>> +     ret = queue_list_add(lcore_cfg, &qdata);
>> +     if (ret < 0) {
>> +             RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
>> +                             strerror(-ret));
>> +             goto end;
>> +     }
>> +     /* new queue is always added last */
>> +     queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>>
>>        /* initialize data before enabling the callback */
>> -     queue_cfg->empty_poll_stats = 0;
>> -     queue_cfg->cb_mode = mode;
>> -     queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> -     queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>> -                     clb, NULL);
>> +     if (lcore_cfg->n_queues == 1) {
>> +             lcore_cfg->cb_mode = mode;
>> +             lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>> +     }
>> +     queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
>> +                     clb, queue_cfg);
>>
>>        ret = 0;
>>   end:
>> @@ -290,7 +476,9 @@ int
>>   rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>>                uint16_t port_id, uint16_t queue_id)
>>   {
>> -     struct pmd_queue_cfg *queue_cfg;
>> +     const union queue qdata = {.portid = port_id, .qid = queue_id};
>> +     struct pmd_core_cfg *lcore_cfg;
>> +     struct queue_list_entry *queue_cfg;
>>        int ret;
>>
>>        RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
>> @@ -306,24 +494,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>>        }
>>
>>        /* no need to check queue id as wrong queue id would not be enabled */
>> -     queue_cfg = &port_cfg[port_id][queue_id];
>> +     lcore_cfg = &lcore_cfgs[lcore_id];
>>
>> -     if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>> +     /* check if other queues are stopped as well */
>> +     ret = cfg_queues_stopped(lcore_cfg);
>> +     if (ret != 1) {
>> +             /* error means invalid queue, 0 means queue wasn't stopped */
>> +             return ret < 0 ? -EINVAL : -EBUSY;
>> +     }
>> +
>> +     if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
>>                return -EINVAL;
>>
>> -     /* stop any callbacks from progressing */
>> -     queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> +     /*
>> +      * There is no good/easy way to do this without race conditions, so we
>> +      * are just going to throw our hands in the air and hope that the user
>> +      * has read the documentation and has ensured that ports are stopped at
>> +      * the time we enter the API functions.
>> +      */
>> +     queue_cfg = queue_list_take(lcore_cfg, &qdata);
>> +     if (queue_cfg == NULL)
>> +             return -ENOENT;
>>
>> -     switch (queue_cfg->cb_mode) {
>> +     /* if we've removed all queues from the lists, set state to disabled */
>> +     if (lcore_cfg->n_queues == 0)
>> +             lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
>> +
>> +     switch (lcore_cfg->cb_mode) {
>>        case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
>>        case RTE_POWER_MGMT_TYPE_PAUSE:
>> -             rte_eth_remove_rx_callback(port_id, queue_id,
>> -                             queue_cfg->cur_cb);
>> +             rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>>                break;
>>        case RTE_POWER_MGMT_TYPE_SCALE:
>>                rte_power_freq_max(lcore_id);
>> -             rte_eth_remove_rx_callback(port_id, queue_id,
>> -                             queue_cfg->cur_cb);
>> +             rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
>>                rte_power_exit(lcore_id);
>>                break;
>>        }
>> @@ -332,7 +536,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
>>         * ports before calling any of these API's, so we can assume that the
>>         * callbacks can be freed. we're intentionally casting away const-ness.
>>         */
>> -     rte_free((void *)queue_cfg->cur_cb);
>> +     rte_free((void *)queue_cfg->cb);
>> +     free(queue_cfg);
>>
>>        return 0;
>>   }
>> +
>> +RTE_INIT(rte_power_ethdev_pmgmt_init) {
>> +     size_t i;
>> +
>> +     /* initialize all tailqs */
>> +     for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
>> +             struct pmd_core_cfg *cfg = &lcore_cfgs[i];
>> +             TAILQ_INIT(&cfg->head);
>> +     }
>> +}
>> --
>> 2.25.1
>
Anatoly Burakov July 5, 2021, 10:24 a.m. UTC | #5
On 01-Jul-21 10:01 AM, David Hunt wrote:
> 
> On 30/6/2021 10:52 AM, David Hunt wrote:
>> Hi Anatoly,
>>
>> On 29/6/2021 4:48 PM, Anatoly Burakov wrote:
>>> Currently, there is a hard limitation on the PMD power management
>>> support that only allows it to support a single queue per lcore. This is
>>> not ideal as most DPDK use cases will poll multiple queues per core.
>>>
>>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>>> is very difficult to implement such support because callbacks are
>>> effectively stateless and have no visibility into what the other ethdev
>>> devices are doing. This places limitations on what we can do within the
>>> framework of Rx callbacks, but the basics of this implementation are as
>>> follows:
>>>
>>> - Replace per-queue structures with per-lcore ones, so that any device
>>>    polled from the same lcore can share data
>>> - Any queue that is going to be polled from a specific lcore has to be
>>>    added to the list of queues to poll, so that the callback is aware of
>>>    other queues being polled by the same lcore
>>> - Both the empty poll counter and the actual power saving mechanism is
>>>    shared between all queues polled on a particular lcore, and is only
>>>    activated when all queues in the list were polled and were determined
>>>    to have no traffic.
>>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>>    is incapable of monitoring more than one address.
>>>
>>> Also, while we're at it, update and improve the docs.
>>>
>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>> ---
>>>
>>> Notes:
>>>      v5:
>>>      - Remove the "power save queue" API and replace it with 
>>> mechanism suggested by
>>>        Konstantin
>>>           v3:
>>>      - Move the list of supported NICs to NIC feature table
>>>           v2:
>>>      - Use a TAILQ for queues instead of a static array
>>>      - Address feedback from Konstantin
>>>      - Add additional checks for stopped queues
>>>
>>>   doc/guides/nics/features.rst           |  10 +
>>>   doc/guides/prog_guide/power_man.rst    |  65 ++--
>>>   doc/guides/rel_notes/release_21_08.rst |   3 +
>>>   lib/power/rte_power_pmd_mgmt.c         | 431 ++++++++++++++++++-------
>>>   4 files changed, 373 insertions(+), 136 deletions(-)
>>>
>>
>> --snip--
>>
>>>   int
>>>   rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t 
>>> port_id,
>>>           uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
>>>   {
>>> -    struct pmd_queue_cfg *queue_cfg;
>>> +    const union queue qdata = {.portid = port_id, .qid = queue_id};
>>> +    struct pmd_core_cfg *lcore_cfg;
>>> +    struct queue_list_entry *queue_cfg;
>>>       struct rte_eth_dev_info info;
>>>       rte_rx_callback_fn clb;
>>>       int ret;
>>> @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int 
>>> lcore_id, uint16_t port_id,
>>>           goto end;
>>>       }
>>>   -    queue_cfg = &port_cfg[port_id][queue_id];
>>> +    lcore_cfg = &lcore_cfgs[lcore_id];
>>>   -    if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
>>> +    /* check if other queues are stopped as well */
>>> +    ret = cfg_queues_stopped(lcore_cfg);
>>> +    if (ret != 1) {
>>> +        /* error means invalid queue, 0 means queue wasn't stopped */
>>> +        ret = ret < 0 ? -EINVAL : -EBUSY;
>>> +        goto end;
>>> +    }
>>> +
>>> +    /* if callback was already enabled, check current callback type */
>>> +    if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
>>> +            lcore_cfg->cb_mode != mode) {
>>>           ret = -EINVAL;
>>>           goto end;
>>>       }
>>> @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned 
>>> int lcore_id, uint16_t port_id,
>>>         switch (mode) {
>>>       case RTE_POWER_MGMT_TYPE_MONITOR:
>>> -    {
>>> -        struct rte_power_monitor_cond dummy;
>>> -
>>> -        /* check if rte_power_monitor is supported */
>>> -        if (!global_data.intrinsics_support.power_monitor) {
>>> -            RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not 
>>> supported\n");
>>> -            ret = -ENOTSUP;
>>> +        /* check if we can add a new queue */
>>> +        ret = check_monitor(lcore_cfg, &qdata);
>>> +        if (ret < 0)
>>>               goto end;
>>> -        }
>>>   -        /* check if the device supports the necessary PMD API */
>>> -        if (rte_eth_get_monitor_addr(port_id, queue_id,
>>> -                &dummy) == -ENOTSUP) {
>>> -            RTE_LOG(DEBUG, POWER, "The device does not support 
>>> rte_eth_get_monitor_addr\n");
>>> -            ret = -ENOTSUP;
>>> -            goto end;
>>> -        }
>>>           clb = clb_umwait;
>>>           break;
>>> -    }
>>>       case RTE_POWER_MGMT_TYPE_SCALE:
>>> -    {
>>> -        enum power_management_env env;
>>> -        /* only PSTATE and ACPI modes are supported */
>>> -        if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
>>> -                !rte_power_check_env_supported(
>>> -                    PM_ENV_PSTATE_CPUFREQ)) {
>>> -            RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are 
>>> supported\n");
>>> -            ret = -ENOTSUP;
>>> +        /* check if we can add a new queue */
>>> +        ret = check_scale(lcore_id);
>>> +        if (ret < 0)
>>>               goto end;
>>> -        }
>>> -        /* ensure we could initialize the power library */
>>> -        if (rte_power_init(lcore_id)) {
>>> -            ret = -EINVAL;
>>> -            goto end;
>>> -        }
>>> -        /* ensure we initialized the correct env */
>>> -        env = rte_power_get_env();
>>> -        if (env != PM_ENV_ACPI_CPUFREQ &&
>>> -                env != PM_ENV_PSTATE_CPUFREQ) {
>>> -            RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes 
>>> were initialized\n");
>>> -            ret = -ENOTSUP;
>>> -            goto end;
>>> -        }
>>>           clb = clb_scale_freq;
>>>           break;
>>> -    }
>>>       case RTE_POWER_MGMT_TYPE_PAUSE:
>>>           /* figure out various time-to-tsc conversions */
>>>           if (global_data.tsc_per_us == 0)
>>> @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned 
>>> int lcore_id, uint16_t port_id,
>>>           ret = -EINVAL;
>>>           goto end;
>>>       }
>>> +    /* add this queue to the list */
>>> +    ret = queue_list_add(lcore_cfg, &qdata);
>>> +    if (ret < 0) {
>>> +        RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
>>> +                strerror(-ret));
>>> +        goto end;
>>> +    }
>>> +    /* new queue is always added last */
>>> +    queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
>>
>>
>> Need to ensure that queue_cfg gets set here, otherwise we'll get a 
>> segfault below.
>>
> 
> Or, looking at this again, shouldn't "lcore_cfgs" be "lcore_cfg"?

Good catch, will fix!

> 
> 
>>
>>
>>>         /* initialize data before enabling the callback */
>>> -    queue_cfg->empty_poll_stats = 0;
>>> -    queue_cfg->cb_mode = mode;
>>> -    queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>>> -    queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
>>> -            clb, NULL);
>>> +    if (lcore_cfg->n_queues == 1) {
>>> +        lcore_cfg->cb_mode = mode;
>>> +        lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
>>> +    }
>>> +    queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
>>> +            clb, queue_cfg);
>> --snip--
>>
diff mbox series

Patch

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 403c2b03a3..a96e12d155 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -912,6 +912,16 @@  Supports to get Rx/Tx packet burst mode information.
 * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``.
 * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``.
 
+.. _nic_features_get_monitor_addr:
+
+PMD power management using monitor addresses
+--------------------------------------------
+
+Supports getting a monitoring condition to use together with Ethernet PMD power
+management (see :doc:`../prog_guide/power_man` for more details).
+
+* **[implements] eth_dev_ops**: ``get_monitor_addr``
+
 .. _nic_features_other:
 
 Other dev ops not represented by a Feature
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index c70ae128ac..ec04a72108 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -198,34 +198,41 @@  Ethernet PMD Power Management API
 Abstract
 ~~~~~~~~
 
-Existing power management mechanisms require developers
-to change application design or change code to make use of it.
-The PMD power management API provides a convenient alternative
-by utilizing Ethernet PMD RX callbacks,
-and triggering power saving whenever empty poll count reaches a certain number.
-
-Monitor
-   This power saving scheme will put the CPU into optimized power state
-   and use the ``rte_power_monitor()`` function
-   to monitor the Ethernet PMD RX descriptor address,
-   and wake the CPU up whenever there's new traffic.
-
-Pause
-   This power saving scheme will avoid busy polling
-   by either entering power-optimized sleep state
-   with ``rte_power_pause()`` function,
-   or, if it's not available, use ``rte_pause()``.
-
-Frequency scaling
-   This power saving scheme will use ``librte_power`` library
-   functionality to scale the core frequency up/down
-   depending on traffic volume.
-
-.. note::
-
-   Currently, this power management API is limited to mandatory mapping
-   of 1 queue to 1 core (multiple queues are supported,
-   but they must be polled from different cores).
+Existing power management mechanisms require developers to change application
+design or change code to make use of it. The PMD power management API provides a
+convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering
+power saving whenever empty poll count reaches a certain number.
+
+* Monitor
+   This power saving scheme will put the CPU into optimized power state and
+   monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever
+   there's new traffic. Support for this scheme may not be available on all
+   platforms, and further limitations may apply (see below).
+
+* Pause
+   This power saving scheme will avoid busy polling by either entering
+   power-optimized sleep state with ``rte_power_pause()`` function, or, if it's
+   not supported by the underlying platform, use ``rte_pause()``.
+
+* Frequency scaling
+   This power saving scheme will use ``librte_power`` library functionality to
+   scale the core frequency up/down depending on traffic volume.
+
+The "monitor" mode is only supported in the following configurations and scenarios:
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that
+  ``rte_power_monitor()`` is supported by the platform, then monitoring will be
+  limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
+  monitored from a different lcore).
+
+* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
+  ``rte_power_monitor()`` function is not supported, then monitor mode will not
+  be supported.
+
+* Not all Ethernet drivers support monitoring, even if the underlying
+  platform may support the necessary CPU instructions. Please refer to
+  :doc:`../nics/overview` for more information.
+
 
 API Overview for Ethernet PMD Power Management
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -242,3 +249,5 @@  References
 
 *   The :doc:`../sample_app_ug/vm_power_management`
     chapter in the :doc:`../sample_app_ug/index` section.
+
+*   The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section
diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst
index f015c509fc..3926d45ef8 100644
--- a/doc/guides/rel_notes/release_21_08.rst
+++ b/doc/guides/rel_notes/release_21_08.rst
@@ -57,6 +57,9 @@  New Features
 
 * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events.
 
+* rte_power: The experimental PMD power management API now supports managing
+  multiple Ethernet Rx queues per lcore.
+
 
 Removed Items
 -------------
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 9b95cf1794..fccfd236c2 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -33,18 +33,96 @@  enum pmd_mgmt_state {
 	PMD_MGMT_ENABLED
 };
 
-struct pmd_queue_cfg {
+union queue {
+	uint32_t val;
+	struct {
+		uint16_t portid;
+		uint16_t qid;
+	};
+};
+
+struct queue_list_entry {
+	TAILQ_ENTRY(queue_list_entry) next;
+	union queue queue;
+	uint64_t n_empty_polls;
+	const struct rte_eth_rxtx_callback *cb;
+};
+
+struct pmd_core_cfg {
+	TAILQ_HEAD(queue_list_head, queue_list_entry) head;
+	/**< List of queues associated with this lcore */
+	size_t n_queues;
+	/**< How many queues are in the list? */
 	volatile enum pmd_mgmt_state pwr_mgmt_state;
 	/**< State of power management for this queue */
 	enum rte_power_pmd_mgmt_type cb_mode;
 	/**< Callback mode for this queue */
-	const struct rte_eth_rxtx_callback *cur_cb;
-	/**< Callback instance */
-	uint64_t empty_poll_stats;
-	/**< Number of empty polls */
+	uint64_t n_queues_ready_to_sleep;
+	/**< Number of queues ready to enter power optimized state */
 } __rte_cache_aligned;
+static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
 
-static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
+static inline bool
+queue_equal(const union queue *l, const union queue *r)
+{
+	return l->val == r->val;
+}
+
+static inline void
+queue_copy(union queue *dst, const union queue *src)
+{
+	dst->val = src->val;
+}
+
+static struct queue_list_entry *
+queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *cur;
+
+	TAILQ_FOREACH(cur, &cfg->head, next) {
+		if (queue_equal(&cur->queue, q))
+			return cur;
+	}
+	return NULL;
+}
+
+static int
+queue_list_add(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *qle;
+
+	/* is it already in the list? */
+	if (queue_list_find(cfg, q) != NULL)
+		return -EEXIST;
+
+	qle = malloc(sizeof(*qle));
+	if (qle == NULL)
+		return -ENOMEM;
+	memset(qle, 0, sizeof(*qle));
+
+	queue_copy(&qle->queue, q);
+	TAILQ_INSERT_TAIL(&cfg->head, qle, next);
+	cfg->n_queues++;
+	qle->n_empty_polls = 0;
+
+	return 0;
+}
+
+static struct queue_list_entry *
+queue_list_take(struct pmd_core_cfg *cfg, const union queue *q)
+{
+	struct queue_list_entry *found;
+
+	found = queue_list_find(cfg, q);
+	if (found == NULL)
+		return NULL;
+
+	TAILQ_REMOVE(&cfg->head, found, next);
+	cfg->n_queues--;
+
+	/* freeing is responsibility of the caller */
+	return found;
+}
 
 static void
 calc_tsc(void)
@@ -74,21 +152,56 @@  calc_tsc(void)
 	}
 }
 
+static inline void
+queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+	/* reset empty poll counter for this queue */
+	qcfg->n_empty_polls = 0;
+	/* reset the sleep counter too */
+	cfg->n_queues_ready_to_sleep = 0;
+}
+
+static inline bool
+queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg)
+{
+	/* this function is called - that means we have an empty poll */
+	qcfg->n_empty_polls++;
+
+	/* if we haven't reached threshold for empty polls, we can't sleep */
+	if (qcfg->n_empty_polls <= EMPTYPOLL_MAX)
+		return false;
+
+	/* we're ready to sleep */
+	cfg->n_queues_ready_to_sleep++;
+
+	return true;
+}
+
+static inline bool
+lcore_can_sleep(struct pmd_core_cfg *cfg)
+{
+	/* are all queues ready to sleep? */
+	if (cfg->n_queues_ready_to_sleep != cfg->n_queues)
+		return false;
+
+	/* we've reached an iteration where we can sleep, reset sleep counter */
+	cfg->n_queues_ready_to_sleep = 0;
+
+	return true;
+}
+
 static uint16_t
 clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
-		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
-		void *addr __rte_unused)
+		uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg)
 {
+	struct queue_list_entry *queue_conf = arg;
 
-	struct pmd_queue_cfg *q_conf;
-
-	q_conf = &port_cfg[port_id][qidx];
-
+	/* this callback can't do more than one queue, omit multiqueue logic */
 	if (unlikely(nb_rx == 0)) {
-		q_conf->empty_poll_stats++;
-		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
+		queue_conf->n_empty_polls++;
+		if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) {
 			struct rte_power_monitor_cond pmc;
-			uint16_t ret;
+			int ret;
 
 			/* use monitoring condition to sleep */
 			ret = rte_eth_get_monitor_addr(port_id, qidx,
@@ -97,60 +210,77 @@  clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
 				rte_power_monitor(&pmc, UINT64_MAX);
 		}
 	} else
-		q_conf->empty_poll_stats = 0;
+		queue_conf->n_empty_polls = 0;
 
 	return nb_rx;
 }
 
 static uint16_t
-clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused,
-		uint16_t nb_rx, uint16_t max_pkts __rte_unused,
-		void *addr __rte_unused)
+clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
+		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
+		uint16_t max_pkts __rte_unused, void *arg)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	struct queue_list_entry *queue_conf = arg;
+	struct pmd_core_cfg *lcore_conf;
+	const bool empty = nb_rx == 0;
 
-	q_conf = &port_cfg[port_id][qidx];
+	lcore_conf = &lcore_cfgs[lcore];
 
-	if (unlikely(nb_rx == 0)) {
-		q_conf->empty_poll_stats++;
-		/* sleep for 1 microsecond */
-		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
-			/* use tpause if we have it */
-			if (global_data.intrinsics_support.power_pause) {
-				const uint64_t cur = rte_rdtsc();
-				const uint64_t wait_tsc =
-						cur + global_data.tsc_per_us;
-				rte_power_pause(wait_tsc);
-			} else {
-				uint64_t i;
-				for (i = 0; i < global_data.pause_per_us; i++)
-					rte_pause();
-			}
+	if (likely(!empty))
+		/* early exit */
+		queue_reset(lcore_conf, queue_conf);
+	else {
+		/* can this queue sleep? */
+		if (!queue_can_sleep(lcore_conf, queue_conf))
+			return nb_rx;
+
+		/* can this lcore sleep? */
+		if (!lcore_can_sleep(lcore_conf))
+			return nb_rx;
+
+		/* sleep for 1 microsecond, use tpause if we have it */
+		if (global_data.intrinsics_support.power_pause) {
+			const uint64_t cur = rte_rdtsc();
+			const uint64_t wait_tsc =
+					cur + global_data.tsc_per_us;
+			rte_power_pause(wait_tsc);
+		} else {
+			uint64_t i;
+			for (i = 0; i < global_data.pause_per_us; i++)
+				rte_pause();
 		}
-	} else
-		q_conf->empty_poll_stats = 0;
+	}
 
 	return nb_rx;
 }
 
 static uint16_t
-clb_scale_freq(uint16_t port_id, uint16_t qidx,
+clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
-		uint16_t max_pkts __rte_unused, void *_  __rte_unused)
+		uint16_t max_pkts __rte_unused, void *arg)
 {
-	struct pmd_queue_cfg *q_conf;
+	const unsigned int lcore = rte_lcore_id();
+	const bool empty = nb_rx == 0;
+	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct queue_list_entry *queue_conf = arg;
 
-	q_conf = &port_cfg[port_id][qidx];
+	if (likely(!empty)) {
+		/* early exit */
+		queue_reset(lcore_conf, queue_conf);
 
-	if (unlikely(nb_rx == 0)) {
-		q_conf->empty_poll_stats++;
-		if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX))
-			/* scale down freq */
-			rte_power_freq_min(rte_lcore_id());
-	} else {
-		q_conf->empty_poll_stats = 0;
-		/* scale up freq */
+		/* scale up freq immediately */
 		rte_power_freq_max(rte_lcore_id());
+	} else {
+		/* can this queue sleep? */
+		if (!queue_can_sleep(lcore_conf, queue_conf))
+			return nb_rx;
+
+		/* can this lcore sleep? */
+		if (!lcore_can_sleep(lcore_conf))
+			return nb_rx;
+
+		rte_power_freq_min(rte_lcore_id());
 	}
 
 	return nb_rx;
@@ -167,11 +297,80 @@  queue_stopped(const uint16_t port_id, const uint16_t queue_id)
 	return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED;
 }
 
+static int
+cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
+{
+	const struct queue_list_entry *entry;
+
+	TAILQ_FOREACH(entry, &queue_cfg->head, next) {
+		const union queue *q = &entry->queue;
+		int ret = queue_stopped(q->portid, q->qid);
+		if (ret != 1)
+			return ret;
+	}
+	return 1;
+}
+
+static int
+check_scale(unsigned int lcore)
+{
+	enum power_management_env env;
+
+	/* only PSTATE and ACPI modes are supported */
+	if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
+	    !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
+		return -ENOTSUP;
+	}
+	/* ensure we could initialize the power library */
+	if (rte_power_init(lcore))
+		return -EINVAL;
+
+	/* ensure we initialized the correct env */
+	env = rte_power_get_env();
+	if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) {
+		RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
+static int
+check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata)
+{
+	struct rte_power_monitor_cond dummy;
+
+	/* check if rte_power_monitor is supported */
+	if (!global_data.intrinsics_support.power_monitor) {
+		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+		return -ENOTSUP;
+	}
+
+	if (cfg->n_queues > 0) {
+		RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n");
+		return -ENOTSUP;
+	}
+
+	/* check if the device supports the necessary PMD API */
+	if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid,
+			&dummy) == -ENOTSUP) {
+		RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *lcore_cfg;
+	struct queue_list_entry *queue_cfg;
 	struct rte_eth_dev_info info;
 	rte_rx_callback_fn clb;
 	int ret;
@@ -202,9 +401,19 @@  rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	queue_cfg = &port_cfg[port_id][queue_id];
+	lcore_cfg = &lcore_cfgs[lcore_id];
 
-	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) {
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(lcore_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		ret = ret < 0 ? -EINVAL : -EBUSY;
+		goto end;
+	}
+
+	/* if callback was already enabled, check current callback type */
+	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+			lcore_cfg->cb_mode != mode) {
 		ret = -EINVAL;
 		goto end;
 	}
@@ -214,53 +423,20 @@  rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 
 	switch (mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR:
-	{
-		struct rte_power_monitor_cond dummy;
-
-		/* check if rte_power_monitor is supported */
-		if (!global_data.intrinsics_support.power_monitor) {
-			RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_monitor(lcore_cfg, &qdata);
+		if (ret < 0)
 			goto end;
-		}
 
-		/* check if the device supports the necessary PMD API */
-		if (rte_eth_get_monitor_addr(port_id, queue_id,
-				&dummy) == -ENOTSUP) {
-			RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_umwait;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_SCALE:
-	{
-		enum power_management_env env;
-		/* only PSTATE and ACPI modes are supported */
-		if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) &&
-				!rte_power_check_env_supported(
-					PM_ENV_PSTATE_CPUFREQ)) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n");
-			ret = -ENOTSUP;
+		/* check if we can add a new queue */
+		ret = check_scale(lcore_id);
+		if (ret < 0)
 			goto end;
-		}
-		/* ensure we could initialize the power library */
-		if (rte_power_init(lcore_id)) {
-			ret = -EINVAL;
-			goto end;
-		}
-		/* ensure we initialized the correct env */
-		env = rte_power_get_env();
-		if (env != PM_ENV_ACPI_CPUFREQ &&
-				env != PM_ENV_PSTATE_CPUFREQ) {
-			RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n");
-			ret = -ENOTSUP;
-			goto end;
-		}
 		clb = clb_scale_freq;
 		break;
-	}
 	case RTE_POWER_MGMT_TYPE_PAUSE:
 		/* figure out various time-to-tsc conversions */
 		if (global_data.tsc_per_us == 0)
@@ -273,13 +449,23 @@  rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		ret = -EINVAL;
 		goto end;
 	}
+	/* add this queue to the list */
+	ret = queue_list_add(lcore_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+				strerror(-ret));
+		goto end;
+	}
+	/* new queue is always added last */
+	queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head);
 
 	/* initialize data before enabling the callback */
-	queue_cfg->empty_poll_stats = 0;
-	queue_cfg->cb_mode = mode;
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
-	queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id,
-			clb, NULL);
+	if (lcore_cfg->n_queues == 1) {
+		lcore_cfg->cb_mode = mode;
+		lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	}
+	queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id,
+			clb, queue_cfg);
 
 	ret = 0;
 end:
@@ -290,7 +476,9 @@  int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id)
 {
-	struct pmd_queue_cfg *queue_cfg;
+	const union queue qdata = {.portid = port_id, .qid = queue_id};
+	struct pmd_core_cfg *lcore_cfg;
+	struct queue_list_entry *queue_cfg;
 	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
@@ -306,24 +494,40 @@  rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	queue_cfg = &port_cfg[port_id][queue_id];
+	lcore_cfg = &lcore_cfgs[lcore_id];
 
-	if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+	/* check if other queues are stopped as well */
+	ret = cfg_queues_stopped(lcore_cfg);
+	if (ret != 1) {
+		/* error means invalid queue, 0 means queue wasn't stopped */
+		return ret < 0 ? -EINVAL : -EBUSY;
+	}
+
+	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
 		return -EINVAL;
 
-	/* stop any callbacks from progressing */
-	queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+	/*
+	 * There is no good/easy way to do this without race conditions, so we
+	 * are just going to throw our hands in the air and hope that the user
+	 * has read the documentation and has ensured that ports are stopped at
+	 * the time we enter the API functions.
+	 */
+	queue_cfg = queue_list_take(lcore_cfg, &qdata);
+	if (queue_cfg == NULL)
+		return -ENOENT;
 
-	switch (queue_cfg->cb_mode) {
+	/* if we've removed all queues from the lists, set state to disabled */
+	if (lcore_cfg->n_queues == 0)
+		lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	switch (lcore_cfg->cb_mode) {
 	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
 	case RTE_POWER_MGMT_TYPE_PAUSE:
-		rte_eth_remove_rx_callback(port_id, queue_id,
-				queue_cfg->cur_cb);
+		rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
 		break;
 	case RTE_POWER_MGMT_TYPE_SCALE:
 		rte_power_freq_max(lcore_id);
-		rte_eth_remove_rx_callback(port_id, queue_id,
-				queue_cfg->cur_cb);
+		rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb);
 		rte_power_exit(lcore_id);
 		break;
 	}
@@ -332,7 +536,18 @@  rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	 * ports before calling any of these API's, so we can assume that the
 	 * callbacks can be freed. we're intentionally casting away const-ness.
 	 */
-	rte_free((void *)queue_cfg->cur_cb);
+	rte_free((void *)queue_cfg->cb);
+	free(queue_cfg);
 
 	return 0;
 }
+
+RTE_INIT(rte_power_ethdev_pmgmt_init) {
+	size_t i;
+
+	/* initialize all tailqs */
+	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
+		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
+		TAILQ_INIT(&cfg->head);
+	}
+}