[v7,1/4] eal: add lcore poll busyness telemetry

Message ID 20220914092929.1159773-2-kevin.laatz@intel.com (mailing list archive)
State Rejected, archived
Delegated to: David Marchand
Headers
Series Add lcore poll busyness telemetry |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Kevin Laatz Sept. 14, 2022, 9:29 a.m. UTC
  From: Anatoly Burakov <anatoly.burakov@intel.com>

Currently, there is no way to measure lcore poll busyness in a passive way,
without any modifications to the application. This patch adds a new EAL API
that will be able to passively track core polling busyness.

The poll busyness is calculated by relying on the fact that most DPDK API's
will poll for work (packets, completions, eventdev events, etc). Empty
polls can be counted as "idle", while non-empty polls can be counted as
busy. To measure lcore poll busyness, we simply call the telemetry
timestamping function with the number of polls a particular code section
has processed, and count the number of cycles we've spent processing empty
bursts. The more empty bursts we encounter, the less cycles we spend in
"busy" state, and the less core poll busyness will be reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts of DPDK
are instrumented with lcore poll busyness timestamping calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a global
variable is exported by EAL, and a call to timestamping function is wrapped
into a macro, so that whenever telemetry is disabled, it only takes one
additional branch and no function calls are performed. It is disabled at
compile time by default.

This patch also adds a telemetry endpoint to report lcore poll busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the usage
of the new telemetry endpoints and API.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

---
v7:
  * Rename funcs, vars, files to include "poll" where missing.

v5:
  * Fix Windows build
  * Make lcore_telemetry_free() an internal interface
  * Minor cleanup

v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missed renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup
---
 config/meson.build                            |   1 +
 config/rte_config.h                           |   1 +
 lib/bbdev/rte_bbdev.h                         |  17 +-
 lib/compressdev/rte_compressdev.c             |   2 +
 lib/cryptodev/rte_cryptodev.h                 |   2 +
 lib/distributor/rte_distributor.c             |  21 +-
 lib/distributor/rte_distributor_single.c      |  14 +-
 lib/dmadev/rte_dmadev.h                       |  15 +-
 .../common/eal_common_lcore_poll_telemetry.c  | 303 ++++++++++++++++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   1 +
 lib/eal/include/rte_lcore.h                   |  85 ++++-
 lib/eal/linux/eal.c                           |   1 +
 lib/eal/meson.build                           |   3 +
 lib/eal/version.map                           |   7 +
 lib/ethdev/rte_ethdev.h                       |   2 +
 lib/eventdev/rte_eventdev.h                   |  10 +-
 lib/rawdev/rte_rawdev.c                       |   6 +-
 lib/regexdev/rte_regexdev.h                   |   5 +-
 lib/ring/rte_ring_elem_pvt.h                  |   1 +
 meson_options.txt                             |   2 +
 21 files changed, 475 insertions(+), 25 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c
  

Comments

Stephen Hemminger Sept. 14, 2022, 2:30 p.m. UTC | #1
On Wed, 14 Sep 2022 10:29:26 +0100
Kevin Laatz <kevin.laatz@intel.com> wrote:

> +struct lcore_poll_telemetry {
> +	int poll_busyness;
> +	/**< Calculated poll busyness (gets set/returned by the API) */
> +	int raw_poll_busyness;
> +	/**< Calculated poll busyness times 100. */
> +	uint64_t interval_ts;
> +	/**< when previous telemetry interval started */
> +	uint64_t empty_cycles;
> +	/**< empty cycle count since last interval */
> +	uint64_t last_poll_ts;
> +	/**< last poll timestamp */
> +	bool last_empty;
> +	/**< if last poll was empty */
> +	unsigned int contig_poll_cnt;
> +	/**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +

For api's always prefer to use fix size types.
Is there any reason the poll_busyness values could be negative.
If not please use unsigned types.
  
Kevin Laatz Sept. 16, 2022, 12:35 p.m. UTC | #2
On 14/09/2022 15:30, Stephen Hemminger wrote:
> On Wed, 14 Sep 2022 10:29:26 +0100
> Kevin Laatz <kevin.laatz@intel.com> wrote:
>
>> +struct lcore_poll_telemetry {
>> +	int poll_busyness;
>> +	/**< Calculated poll busyness (gets set/returned by the API) */
>> +	int raw_poll_busyness;
>> +	/**< Calculated poll busyness times 100. */
>> +	uint64_t interval_ts;
>> +	/**< when previous telemetry interval started */
>> +	uint64_t empty_cycles;
>> +	/**< empty cycle count since last interval */
>> +	uint64_t last_poll_ts;
>> +	/**< last poll timestamp */
>> +	bool last_empty;
>> +	/**< if last poll was empty */
>> +	unsigned int contig_poll_cnt;
>> +	/**< contiguous (always empty/non empty) poll counter */
>> +} __rte_cache_aligned;
>> +
> For api's always prefer to use fix size types.
> Is there any reason the poll_busyness values could be negative.
> If not please use unsigned types.

We use -1 to indicate the core is "inactive" or a "non-polling" core.

These are cores that have either a) never called the timestamp macro, or 
b) haven't called the timestamp macro for some time and have therefore 
been marked as "inactive" until they next call the timestamp macro.
  
Konstantin Ananyev Sept. 19, 2022, 10:19 a.m. UTC | #3
Hi everyone,

> 
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> Currently, there is no way to measure lcore poll busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL API
> that will be able to passively track core polling busyness.
> 
> The poll busyness is calculated by relying on the fact that most DPDK API's
> will poll for work (packets, completions, eventdev events, etc). Empty
> polls can be counted as "idle", while non-empty polls can be counted as
> busy. To measure lcore poll busyness, we simply call the telemetry
> timestamping function with the number of polls a particular code section
> has processed, and count the number of cycles we've spent processing empty
> bursts. The more empty bursts we encounter, the less cycles we spend in
> "busy" state, and the less core poll busyness will be reported.
> 
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to the
> lcore telemetry busyness timestamping function. The following parts of DPDK
> are instrumented with lcore poll busyness timestamping calls:
> 
> - All major driver API's:
>   - ethdev
>   - cryptodev
>   - compressdev
>   - regexdev
>   - bbdev
>   - rawdev
>   - eventdev
>   - dmadev
> - Some additional libraries:
>   - ring
>   - distributor
> 
> To avoid performance impact from having lcore telemetry support, a global
> variable is exported by EAL, and a call to timestamping function is wrapped
> into a macro, so that whenever telemetry is disabled, it only takes one
> additional branch and no function calls are performed. It is disabled at
> compile time by default.
> 
> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry. A
> documentation entry has been added to the howto guides to explain the usage
> of the new telemetry endpoints and API.

As was already mentioned  by other reviewers, it would be much better
to let application itself decide when it is idle and when it is busy.
With current approach even for constant polling run-to-completion model there
are plenty of opportunities to get things wrong and provide misleading statistics.
My special concern - inserting it into ring dequeue code.
Ring is used for various different things, not only pass packets between threads (mempool, etc.).
Blindly assuming that ring dequeue returns empty means idle cycles seams wrong to me.
Which make me wonder should we really hard-code these calls into DPDK core functions?
If you still like to introduce such stats, might be better to implement it via callback mechanism.
As I remember nearly all our drivers (net, crypto, etc.) do support it.
That way our generic code   will remain unaffected, plus user will have ability to enable/disable
it on a per device basis.
 
Konstantin

> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> ---
> v7:
>   * Rename funcs, vars, files to include "poll" where missing.
> 
> v5:
>   * Fix Windows build
>   * Make lcore_telemetry_free() an internal interface
>   * Minor cleanup
> 
> v4:
>   * Fix doc build
>   * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
>   * Make enable/disable read and write atomic
>   * Change rte_lcore_poll_busyness_enabled_set() param to bool
>   * Move mem alloc from enable/disable to init/cleanup
>   * Other minor fixes
> 
> v3:
>   * Fix missed renaming to poll busyness
>   * Fix clang compilation
>   * Fix arm compilation
> 
> v2:
>   * Use rte_get_tsc_hz() to adjust the telemetry period
>   * Rename to reflect polling busyness vs general busyness
>   * Fix segfault when calling telemetry timestamp from an unregistered
>     non-EAL thread.
>   * Minor cleanup
> ---
>  config/meson.build                            |   1 +
>  config/rte_config.h                           |   1 +
>  lib/bbdev/rte_bbdev.h                         |  17 +-
>  lib/compressdev/rte_compressdev.c             |   2 +
>  lib/cryptodev/rte_cryptodev.h                 |   2 +
>  lib/distributor/rte_distributor.c             |  21 +-
>  lib/distributor/rte_distributor_single.c      |  14 +-
>  lib/dmadev/rte_dmadev.h                       |  15 +-
>  .../common/eal_common_lcore_poll_telemetry.c  | 303 ++++++++++++++++++
>  lib/eal/common/meson.build                    |   1 +
>  lib/eal/freebsd/eal.c                         |   1 +
>  lib/eal/include/rte_lcore.h                   |  85 ++++-
>  lib/eal/linux/eal.c                           |   1 +
>  lib/eal/meson.build                           |   3 +
>  lib/eal/version.map                           |   7 +
>  lib/ethdev/rte_ethdev.h                       |   2 +
>  lib/eventdev/rte_eventdev.h                   |  10 +-
>  lib/rawdev/rte_rawdev.c                       |   6 +-
>  lib/regexdev/rte_regexdev.h                   |   5 +-
>  lib/ring/rte_ring_elem_pvt.h                  |   1 +
>  meson_options.txt                             |   2 +
>  21 files changed, 475 insertions(+), 25 deletions(-)
>  create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c
> 
> diff --git a/config/meson.build b/config/meson.build
> index 7f7b6c92fd..d5954a059c 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -297,6 +297,7 @@ endforeach
>  dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
>  dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
>  dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
> +dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
>  # values which have defaults which may be overridden
>  dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
>  dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
> diff --git a/config/rte_config.h b/config/rte_config.h
> index ae56a86394..86ac3b8a6e 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,7 @@
>  #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>  #define RTE_BACKTRACE 1
>  #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
> 
>  /* bsd module defines */
>  #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
> diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
> index b88c88167e..d6a98d3f11 100644
> --- a/lib/bbdev/rte_bbdev.h
> +++ b/lib/bbdev/rte_bbdev.h
> @@ -28,6 +28,7 @@ extern "C" {
>  #include <stdbool.h>
> 
>  #include <rte_cpuflags.h>
> +#include <rte_lcore.h>
> 
>  #include "rte_bbdev_op.h"
> 
> @@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
>  {
>  	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>  	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /**
> @@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
>  {
>  	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>  	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
> 
> @@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
>  {
>  	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>  	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /**
> @@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
>  {
>  	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>  	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /** Definitions of device event types */
> diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
> index 22c438f2dd..fabc495a8e 100644
> --- a/lib/compressdev/rte_compressdev.c
> +++ b/lib/compressdev/rte_compressdev.c
> @@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>  	nb_ops = (*dev->dequeue_burst)
>  			(dev->data->queue_pairs[qp_id], ops, nb_ops);
> 
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +
>  	return nb_ops;
>  }
> 
> diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
> index 56f459c6a0..a5b1d7c594 100644
> --- a/lib/cryptodev/rte_cryptodev.h
> +++ b/lib/cryptodev/rte_cryptodev.h
> @@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>  		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
>  	}
>  #endif
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
>  	return nb_ops;
>  }
> 
> diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
> index 3035b7a999..428157ec64 100644
> --- a/lib/distributor/rte_distributor.c
> +++ b/lib/distributor/rte_distributor.c
> @@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
> 
>  		while (rte_rdtsc() < t)
>  			rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
>  	}
> 
>  	/*
> @@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
> 
>  	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
>  		if (return_count <= 1) {
> +			uint16_t cnt;
>  			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
> -				worker_id, return_count ? oldpkt[0] : NULL);
> -			return (pkts[0]) ? 1 : 0;
> -		} else
> -			return -EINVAL;
> +								 worker_id,
> +								 return_count ? oldpkt[0] : NULL);
> +			cnt = (pkts[0] != NULL) ? 1 : 0;
> +			RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(cnt);
> +			return cnt;
> +		}
> +		return -EINVAL;
>  	}
> 
>  	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
> 
> -	count = rte_distributor_poll_pkt(d, worker_id, pkts);
> -	while (count == -1) {
> +	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
>  		uint64_t t = rte_rdtsc() + 100;
> 
>  		while (rte_rdtsc() < t)
>  			rte_pause();
> 
> -		count = rte_distributor_poll_pkt(d, worker_id, pkts);
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
>  	}
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(count);
>  	return count;
>  }
> 
> diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
> index 2c77ac454a..4c916c0fd2 100644
> --- a/lib/distributor/rte_distributor_single.c
> +++ b/lib/distributor/rte_distributor_single.c
> @@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
>  	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
>  	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
>  			| RTE_DISTRIB_GET_BUF;
> -	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
> -		==, 0, __ATOMIC_RELAXED);
> +
> +	while ((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
> +			& RTE_DISTRIB_FLAGS_MASK) != 0) {
> +		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
> +	}
> 
>  	/* Sync with distributor on GET_BUF flag. */
>  	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
> @@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
>  {
>  	struct rte_mbuf *ret;
>  	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
> -	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
> +	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
>  		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
> +	}
>  	return ret;
>  }
> 
> diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
> index e7f992b734..3e27e0fd2b 100644
> --- a/lib/dmadev/rte_dmadev.h
> +++ b/lib/dmadev/rte_dmadev.h
> @@ -149,6 +149,7 @@
>  #include <rte_bitops.h>
>  #include <rte_common.h>
>  #include <rte_compat.h>
> +#include <rte_lcore.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
> @@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>  		  uint16_t *last_idx, bool *has_error)
>  {
>  	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>  	bool err;
> 
>  #ifdef RTE_DMADEV_DEBUG
> @@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>  		has_error = &err;
> 
>  	*has_error = false;
> -	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> -				 has_error);
> +	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> +				   has_error);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /**
> @@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>  			 enum rte_dma_status_code *status)
>  {
>  	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
> 
>  #ifdef RTE_DMADEV_DEBUG
>  	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
> @@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>  	if (last_idx == NULL)
>  		last_idx = &idx;
> 
> -	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
> +	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
>  					last_idx, status);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /**
> diff --git a/lib/eal/common/eal_common_lcore_poll_telemetry.c b/lib/eal/common/eal_common_lcore_poll_telemetry.c
> new file mode 100644
> index 0000000000..d97996e85f
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_poll_telemetry.c
> @@ -0,0 +1,303 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Intel Corporation
> + */
> +
> +#include <unistd.h>
> +#include <limits.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_lcore.h>
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#include <rte_telemetry.h>
> +#endif
> +
> +rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +
> +struct lcore_poll_telemetry {
> +	int poll_busyness;
> +	/**< Calculated poll busyness (gets set/returned by the API) */
> +	int raw_poll_busyness;
> +	/**< Calculated poll busyness times 100. */
> +	uint64_t interval_ts;
> +	/**< when previous telemetry interval started */
> +	uint64_t empty_cycles;
> +	/**< empty cycle count since last interval */
> +	uint64_t last_poll_ts;
> +	/**< last poll timestamp */
> +	bool last_empty;
> +	/**< if last poll was empty */
> +	unsigned int contig_poll_cnt;
> +	/**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +
> +static struct lcore_poll_telemetry *telemetry_data;
> +
> +#define LCORE_POLL_BUSYNESS_MAX 100
> +#define LCORE_POLL_BUSYNESS_NOT_SET -1
> +#define LCORE_POLL_BUSYNESS_MIN 0
> +
> +#define SMOOTH_COEFF 5
> +#define STATE_CHANGE_OPT 32
> +
> +static void lcore_config_init(void)
> +{
> +	int lcore_id;
> +
> +	RTE_LCORE_FOREACH(lcore_id) {
> +		struct lcore_poll_telemetry *td = &telemetry_data[lcore_id];
> +
> +		td->interval_ts = 0;
> +		td->last_poll_ts = 0;
> +		td->empty_cycles = 0;
> +		td->last_empty = true;
> +		td->contig_poll_cnt = 0;
> +		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
> +		td->raw_poll_busyness = 0;
> +	}
> +}
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id)
> +{
> +	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
> +	/* if more than 1000 busyness periods have passed, this core is considered inactive */
> +	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS * tsc_ms * 1000;
> +	struct lcore_poll_telemetry *tdata;
> +
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return -EINVAL;
> +	tdata = &telemetry_data[lcore_id];
> +
> +	/* if the lcore is not active */
> +	if (tdata->interval_ts == 0)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +	/* if the core hasn't been active in a while */
> +	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +
> +	/* this core is active, report its poll busyness */
> +	return telemetry_data[lcore_id].poll_busyness;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable)
> +{
> +	int set = rte_atomic32_cmpset((volatile uint32_t *)&__rte_lcore_poll_telemetry_enabled,
> +			(int)!enable, (int)enable);
> +
> +	/* Reset counters on successful disable */
> +	if (set && !enable)
> +		lcore_config_init();
> +}
> +
> +static inline int calc_raw_poll_busyness(const struct lcore_poll_telemetry *tdata,
> +				    const uint64_t empty, const uint64_t total)
> +{
> +	/*
> +	 * We don't want to use floating point math here, but we want for our poll
> +	 * busyness to react smoothly to sudden changes, while still keeping the
> +	 * accuracy and making sure that over time the average follows poll busyness
> +	 * as measured just-in-time. Therefore, we will calculate the average poll
> +	 * busyness using integer math, but shift the decimal point two places
> +	 * to the right, so that 100.0 becomes 10000. This allows us to report
> +	 * integer values (0..100) while still allowing ourselves to follow the
> +	 * just-in-time measurements when we calculate our averages.
> +	 */
> +	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
> +
> +	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
> +
> +	/* calculate rate of idle cycles, times 100 */
> +	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
> +
> +	/* smoothen the idleness */
> +	const int smoothened_idle =
> +			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
> +
> +	/* convert idleness to poll busyness */
> +	return max_raw_idle - smoothened_idle;
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
> +{
> +	const unsigned int lcore_id = rte_lcore_id();
> +	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> +	struct lcore_poll_telemetry *tdata;
> +	const bool empty = nb_rx == 0;
> +	uint64_t diff_int, diff_last;
> +	bool last_empty;
> +
> +	/* This telemetry is not supported for unregistered non-EAL threads */
> +	if (lcore_id >= RTE_MAX_LCORE) {
> +		RTE_LOG(DEBUG, EAL,
> +				"Lcore telemetry not supported on unregistered non-EAL thread %d",
> +				lcore_id);
> +		return;
> +	}
> +
> +	tdata = &telemetry_data[lcore_id];
> +	last_empty = tdata->last_empty;
> +
> +	/* optimization: don't do anything if status hasn't changed */
> +	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
> +		return;
> +	/* status changed or we're waiting for too long, reset counter */
> +	tdata->contig_poll_cnt = 0;
> +
> +	cur_tsc = rte_rdtsc();
> +
> +	interval_ts = tdata->interval_ts;
> +	empty_cycles = tdata->empty_cycles;
> +	last_poll_ts = tdata->last_poll_ts;
> +
> +	diff_int = cur_tsc - interval_ts;
> +	diff_last = cur_tsc - last_poll_ts;
> +
> +	/* is this the first time we're here? */
> +	if (interval_ts == 0) {
> +		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
> +		tdata->raw_poll_busyness = 0;
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->contig_poll_cnt = 0;
> +		goto end;
> +	}
> +
> +	/* update the empty counter if we got an empty poll earlier */
> +	if (last_empty)
> +		empty_cycles += diff_last;
> +
> +	/* have we passed the interval? */
> +	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
> +	if (diff_int > interval) {
> +		int raw_poll_busyness;
> +
> +		/* get updated poll_busyness value */
> +		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
> +
> +		/* set a new interval, reset empty counter */
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->raw_poll_busyness = raw_poll_busyness;
> +		/* bring poll busyness back to 0..100 range, biased to round up */
> +		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
> +	} else
> +		/* we may have updated empty counter */
> +		tdata->empty_cycles = empty_cycles;
> +
> +end:
> +	/* update status for next poll */
> +	tdata->last_poll_ts = cur_tsc;
> +	tdata->last_empty = empty;
> +}
> +
> +static int
> +lcore_poll_busyness_enable(const char *cmd __rte_unused,
> +		      const char *params __rte_unused,
> +		      struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(true);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_poll_busyness_disable(const char *cmd __rte_unused,
> +		       const char *params __rte_unused,
> +		       struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(false);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_handle_poll_busyness(const char *cmd __rte_unused,
> +		      const char *params __rte_unused, struct rte_tel_data *d)
> +{
> +	char corenum[64];
> +	int i;
> +
> +	rte_tel_data_start_dict(d);
> +
> +	RTE_LCORE_FOREACH(i) {
> +		if (!rte_lcore_is_enabled(i))
> +			continue;
> +		snprintf(corenum, sizeof(corenum), "%d", i);
> +		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +eal_lcore_poll_telemetry_free(void)
> +{
> +	if (telemetry_data != NULL) {
> +		free(telemetry_data);
> +		telemetry_data = NULL;
> +	}
> +}
> +
> +RTE_INIT(lcore_init_poll_telemetry)
> +{
> +	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
> +	if (telemetry_data == NULL)
> +		rte_panic("Could not init lcore telemetry data: Out of memory\n");
> +
> +	lcore_config_init();
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
> +				   "return percentage poll busyness of cores");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
> +				   "enable lcore poll busyness measurement");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
> +				   "disable lcore poll busyness measurement");
> +
> +	rte_atomic32_set(&__rte_lcore_poll_telemetry_enabled, true);
> +}
> +
> +#else
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
> +{
> +	return -ENOTSUP;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return -ENOTSUP;
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
> +{
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
> +{
> +}
> +
> +void eal_lcore_poll_telemetry_free(void)
> +{
> +}
> +
> +#endif
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..e5741ce9f9 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -17,6 +17,7 @@ sources += files(
>          'eal_common_hexdump.c',
>          'eal_common_interrupts.c',
>          'eal_common_launch.c',
> +        'eal_common_lcore_poll_telemetry.c',
>          'eal_common_lcore.c',
>          'eal_common_log.c',
>          'eal_common_mcfg.c',
> diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
> index 26fbc91b26..92c4af9c28 100644
> --- a/lib/eal/freebsd/eal.c
> +++ b/lib/eal/freebsd/eal.c
> @@ -895,6 +895,7 @@ rte_eal_cleanup(void)
>  	rte_mp_channel_cleanup();
>  	rte_trace_save();
>  	eal_trace_fini();
> +	eal_lcore_poll_telemetry_free();
>  	/* after this point, any DPDK pointers will become dangling */
>  	rte_eal_memory_detach();
>  	rte_eal_alarm_cleanup();
> diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
> index b598e1b9ec..2191c2473a 100644
> --- a/lib/eal/include/rte_lcore.h
> +++ b/lib/eal/include/rte_lcore.h
> @@ -16,6 +16,7 @@
>  #include <rte_eal.h>
>  #include <rte_launch.h>
>  #include <rte_thread.h>
> +#include <rte_atomic.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
> @@ -415,9 +416,91 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
>  		const pthread_attr_t *attr,
>  		void *(*start_routine)(void *), void *arg);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Read poll busyness value corresponding to an lcore.
> + *
> + * @param lcore_id
> + *   Lcore to read poll busyness value for.
> + * @return
> + *   - value between 0 and 100 on success
> + *   - -1 if lcore is not active
> + *   - -EINVAL if lcore is invalid
> + *   - -ENOMEM if not enough memory available
> + *   - -ENOTSUP if not supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness(unsigned int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Check if lcore poll busyness telemetry is enabled.
> + *
> + * @return
> + *   - true if lcore telemetry is enabled
> + *   - false if lcore telemetry is disabled
> + *   - -ENOTSUP if not lcore telemetry supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness_enabled(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Enable or disable poll busyness telemetry.
> + *
> + * @param enable
> + *   1 to enable, 0 to disable
> + */
> +__rte_experimental
> +void
> +rte_lcore_poll_busyness_enabled_set(bool enable);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Lcore poll busyness timestamping function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +__rte_experimental
> +void
> +__rte_lcore_poll_busyness_timestamp(uint16_t nb_rx);
> +
> +/** @internal lcore telemetry enabled status */
> +extern rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
> +
> +/** @internal free memory allocated for lcore telemetry */
> +void
> +eal_lcore_poll_telemetry_free(void);
> +
> +/**
> + * Call lcore poll busyness timestamp function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do {				\
> +	int enabled = (int)rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);	\
> +	if (enabled)								\
> +		__rte_lcore_poll_busyness_timestamp(nb_rx);			\
> +} while (0)
> +#else
> +#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do { } while (0)
> +#endif
> +
>  #ifdef __cplusplus
>  }
>  #endif
> 
> -
>  #endif /* _RTE_LCORE_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 37d29643a5..5e81352a81 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -1364,6 +1364,7 @@ rte_eal_cleanup(void)
>  	rte_mp_channel_cleanup();
>  	rte_trace_save();
>  	eal_trace_fini();
> +	eal_lcore_poll_telemetry_free();
>  	/* after this point, any DPDK pointers will become dangling */
>  	rte_eal_memory_detach();
>  	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/meson.build b/lib/eal/meson.build
> index 056beb9461..2fb90d446b 100644
> --- a/lib/eal/meson.build
> +++ b/lib/eal/meson.build
> @@ -25,6 +25,9 @@ subdir(arch_subdir)
>  deps += ['kvargs']
>  if not is_windows
>      deps += ['telemetry']
> +else
> +	# core poll busyness telemetry depends on telemetry library
> +	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
>  endif
>  if dpdk_conf.has('RTE_USE_LIBBSD')
>      ext_deps += libbsd
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 1f293e768b..3275d1fac4 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -424,6 +424,13 @@ EXPERIMENTAL {
>  	rte_thread_self;
>  	rte_thread_set_affinity_by_id;
>  	rte_thread_set_priority;
> +
> +	# added in 22.11
> +	__rte_lcore_poll_busyness_timestamp;
> +	__rte_lcore_poll_telemetry_enabled;
> +	rte_lcore_poll_busyness;
> +	rte_lcore_poll_busyness_enabled;
> +	rte_lcore_poll_busyness_enabled_set;
>  };
> 
>  INTERNAL {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index de9e970d4d..4c8113f31f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>  #endif
> 
>  	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx);
>  	return nb_rx;
>  }
> 
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 6a6f6ea4c1..a65b3c7c85 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>  			uint16_t nb_events, uint64_t timeout_ticks)
>  {
>  	const struct rte_event_fp_ops *fp_ops;
> +	uint16_t nb_evts;
>  	void *port;
> 
>  	fp_ops = &rte_event_fp_ops[dev_id];
> @@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>  	 * requests nb_events as const one
>  	 */
>  	if (nb_events == 1)
> -		return (fp_ops->dequeue)(port, ev, timeout_ticks);
> +		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
>  	else
> -		return (fp_ops->dequeue_burst)(port, ev, nb_events,
> -					       timeout_ticks);
> +		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
> +					timeout_ticks);
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_evts);
> +	return nb_evts;
>  }
> 
>  #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
> diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
> index 2f0a4f132e..1cba53270a 100644
> --- a/lib/rawdev/rte_rawdev.c
> +++ b/lib/rawdev/rte_rawdev.c
> @@ -16,6 +16,7 @@
>  #include <rte_common.h>
>  #include <rte_malloc.h>
>  #include <rte_telemetry.h>
> +#include <rte_lcore.h>
> 
>  #include "rte_rawdev.h"
>  #include "rte_rawdev_pmd.h"
> @@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
>  			   rte_rawdev_obj_t context)
>  {
>  	struct rte_rawdev *dev;
> +	int nb_ops;
> 
>  	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
>  	dev = &rte_rawdevs[dev_id];
> 
>  	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
> -	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  int
> diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
> index 3bce8090f6..8caaed502f 100644
> --- a/lib/regexdev/rte_regexdev.h
> +++ b/lib/regexdev/rte_regexdev.h
> @@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>  			   struct rte_regex_ops **ops, uint16_t nb_ops)
>  {
>  	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
> +	uint16_t deq_ops;
>  #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
>  	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
>  	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
> @@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>  		return -EINVAL;
>  	}
>  #endif
> -	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(deq_ops);
> +	return deq_ops;
>  }
> 
>  #ifdef __cplusplus
> diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
> index 83788c56e6..cf2370c238 100644
> --- a/lib/ring/rte_ring_elem_pvt.h
> +++ b/lib/ring/rte_ring_elem_pvt.h
> @@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
>  end:
>  	if (available != NULL)
>  		*available = entries - n;
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n);
>  	return n;
>  }
> 
> diff --git a/meson_options.txt b/meson_options.txt
> index 7c220ad68d..9b20a36fdb 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
>         'Install headers to build drivers.')
>  option('enable_kmods', type: 'boolean', value: false, description:
>         'build kernel modules')
> +option('enable_lcore_poll_busyness', type: 'boolean', value: false, description:
> +       'enable collection of lcore poll busyness telemetry')
>  option('examples', type: 'string', value: '', description:
>         'Comma-separated list of examples to build by default')
>  option('flexran_sdk', type: 'string', value: '', description:
> --
> 2.31.1
  
Kevin Laatz Sept. 22, 2022, 5:14 p.m. UTC | #4
On 19/09/2022 11:19, Konstantin Ananyev wrote:
> Hi everyone,
>
>> From: Anatoly Burakov <anatoly.burakov@intel.com>
>>
>> Currently, there is no way to measure lcore poll busyness in a passive way,
>> without any modifications to the application. This patch adds a new EAL API
>> that will be able to passively track core polling busyness.
>>
>> The poll busyness is calculated by relying on the fact that most DPDK API's
>> will poll for work (packets, completions, eventdev events, etc). Empty
>> polls can be counted as "idle", while non-empty polls can be counted as
>> busy. To measure lcore poll busyness, we simply call the telemetry
>> timestamping function with the number of polls a particular code section
>> has processed, and count the number of cycles we've spent processing empty
>> bursts. The more empty bursts we encounter, the less cycles we spend in
>> "busy" state, and the less core poll busyness will be reported.
>>
>> In order for all of the above to work without modifications to the
>> application, the library code needs to be instrumented with calls to the
>> lcore telemetry busyness timestamping function. The following parts of DPDK
>> are instrumented with lcore poll busyness timestamping calls:
>>
>> - All major driver API's:
>>    - ethdev
>>    - cryptodev
>>    - compressdev
>>    - regexdev
>>    - bbdev
>>    - rawdev
>>    - eventdev
>>    - dmadev
>> - Some additional libraries:
>>    - ring
>>    - distributor
>>
>> To avoid performance impact from having lcore telemetry support, a global
>> variable is exported by EAL, and a call to timestamping function is wrapped
>> into a macro, so that whenever telemetry is disabled, it only takes one
>> additional branch and no function calls are performed. It is disabled at
>> compile time by default.
>>
>> This patch also adds a telemetry endpoint to report lcore poll busyness, as
>> well as telemetry endpoints to enable/disable lcore telemetry. A
>> documentation entry has been added to the howto guides to explain the usage
>> of the new telemetry endpoints and API.
> As was already mentioned  by other reviewers, it would be much better
> to let application itself decide when it is idle and when it is busy.
> With current approach even for constant polling run-to-completion model there
> are plenty of opportunities to get things wrong and provide misleading statistics.
> My special concern - inserting it into ring dequeue code.
> Ring is used for various different things, not only pass packets between threads (mempool, etc.).
> Blindly assuming that ring dequeue returns empty means idle cycles seams wrong to me.
> Which make me wonder should we really hard-code these calls into DPDK core functions?
> If you still like to introduce such stats, might be better to implement it via callback mechanism.
> As I remember nearly all our drivers (net, crypto, etc.) do support it.
> That way our generic code   will remain unaffected, plus user will have ability to enable/disable
> it on a per device basis.

Thanks for your feedback, Konstantin.

You are right in saying that this approach won't be 100% suitable for 
all use-cases, but should be suitable for the majority of applications. 
It's worth keeping in mind that this feature is compile-time disabled by 
default, so there is no impact to any application/user that does not 
wish to use this, for example applications where this type of busyness 
is not useful, or for applications that already use other mechanisms to 
report similar telemetry. However, the upside for applications that do 
wish to use this is that there are no code changes required (for the 
most part), the feature simply needs to be enabled at compile-time via 
the meson option.

In scenarios where contextual awareness of the application is needed in 
order to report more accurate "busyness", the 
"RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark 
sections of code as "busy" or "idle". This way, the application can 
assume control of determining the poll busyness of its lcores while 
leveraging the telemetry hooks adding in this patchset.

We did initially consider implementing this via callbacks, however we 
found this approach to have 2 main drawbacks:
1. Application changes are required for all applications wanting to 
report this telemetry - rather than the majority getting it for free.
2. Ring does not have callback support, meaning pipelined applications 
could not report lcore poll busyness telemetry with this approach. 
Eventdev is another driver which would be completely missed with this 
approach.

BR,
Kevin
  
Konstantin Ananyev Sept. 26, 2022, 9:37 a.m. UTC | #5
Hi Kevin,

> >> Currently, there is no way to measure lcore poll busyness in a passive way,
> >> without any modifications to the application. This patch adds a new EAL API
> >> that will be able to passively track core polling busyness.
> >>
> >> The poll busyness is calculated by relying on the fact that most DPDK API's
> >> will poll for work (packets, completions, eventdev events, etc). Empty
> >> polls can be counted as "idle", while non-empty polls can be counted as
> >> busy. To measure lcore poll busyness, we simply call the telemetry
> >> timestamping function with the number of polls a particular code section
> >> has processed, and count the number of cycles we've spent processing empty
> >> bursts. The more empty bursts we encounter, the less cycles we spend in
> >> "busy" state, and the less core poll busyness will be reported.
> >>
> >> In order for all of the above to work without modifications to the
> >> application, the library code needs to be instrumented with calls to the
> >> lcore telemetry busyness timestamping function. The following parts of DPDK
> >> are instrumented with lcore poll busyness timestamping calls:
> >>
> >> - All major driver API's:
> >>    - ethdev
> >>    - cryptodev
> >>    - compressdev
> >>    - regexdev
> >>    - bbdev
> >>    - rawdev
> >>    - eventdev
> >>    - dmadev
> >> - Some additional libraries:
> >>    - ring
> >>    - distributor
> >>
> >> To avoid performance impact from having lcore telemetry support, a global
> >> variable is exported by EAL, and a call to timestamping function is wrapped
> >> into a macro, so that whenever telemetry is disabled, it only takes one
> >> additional branch and no function calls are performed. It is disabled at
> >> compile time by default.
> >>
> >> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> >> well as telemetry endpoints to enable/disable lcore telemetry. A
> >> documentation entry has been added to the howto guides to explain the usage
> >> of the new telemetry endpoints and API.
> > As was already mentioned  by other reviewers, it would be much better
> > to let application itself decide when it is idle and when it is busy.
> > With current approach even for constant polling run-to-completion model there
> > are plenty of opportunities to get things wrong and provide misleading statistics.
> > My special concern - inserting it into ring dequeue code.
> > Ring is used for various different things, not only pass packets between threads (mempool, etc.).
> > Blindly assuming that ring dequeue returns empty means idle cycles seams wrong to me.
> > Which make me wonder should we really hard-code these calls into DPDK core functions?
> > If you still like to introduce such stats, might be better to implement it via callback mechanism.
> > As I remember nearly all our drivers (net, crypto, etc.) do support it.
> > That way our generic code   will remain unaffected, plus user will have ability to enable/disable
> > it on a per device basis.
> 
> Thanks for your feedback, Konstantin.
> 
> You are right in saying that this approach won't be 100% suitable for
> all use-cases, but should be suitable for the majority of applications.

First of all - could you explain how did you measure what is the 'majority' of DPDK applications?
And how did you conclude that it definitely work for all the apps in that 'majority'?
Second what bother me with that approach - I don't see s clear and deterministic way
for the user to understand would that stats work properly for his app or not.
(except manually ananlyzing his app code).

> It's worth keeping in mind that this feature is compile-time disabled by
> default, so there is no impact to any application/user that does not
> wish to use this, for example applications where this type of busyness
> is not useful, or for applications that already use other mechanisms to
> report similar telemetry.

Not sure that adding in new compile-time option disabled by default is a good thing...
For me it would be much more preferable if we'll go through a more 'standard' way here:
a) define clear API to enable/disable/collect/report such type of stats.
b) use some of our sample apps to demonstrate how to use it properly with user-specific code.
c) if needed implement some 'silent' stats collection for limited scope of apps via callbacks -
let say for run-to-completion apps that do use ether and crypto devs only.

 However, the upside for applications that do
> wish to use this is that there are no code changes required (for the
> most part), the feature simply needs to be enabled at compile-time via
> the meson option.
> 
> In scenarios where contextual awareness of the application is needed in
> order to report more accurate "busyness", the
> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
> sections of code as "busy" or "idle". This way, the application can
> assume control of determining the poll busyness of its lcores while
> leveraging the telemetry hooks adding in this patchset.
> 
> We did initially consider implementing this via callbacks, however we
> found this approach to have 2 main drawbacks:
> 1. Application changes are required for all applications wanting to
> report this telemetry - rather than the majority getting it for free.

Didn't get it - why callbacks approach would require user-app changes?
In other situations - rte_power callbacks, pdump, etc. it works transparent to
user-leve code.  
Why it can't be done here in a similar way?

> 2. Ring does not have callback support, meaning pipelined applications
> could not report lcore poll busyness telemetry with this approach.

That's another big concern that I have:
Why you consider that all rings will be used for a pipilines between threads and should
always be accounted by your stats?
They could be used for dozens different purposes.
What if that ring is used for mempool, and ring_dequeue() just means we try to allocate
an object from the pool? In such case, why failing to allocate an object should mean
start of new 'idle cycle'?  

> Eventdev is another driver which would be completely missed with this
> approach.

Ok, I see two ways here:
- implement CB support for eventdev.
-meanwhile clearly document that this stats are not supported for eventdev  scenarios (yet).

> 
>
  
Kevin Laatz Sept. 29, 2022, 12:41 p.m. UTC | #6
On 26/09/2022 10:37, Konstantin Ananyev wrote:
> Hi Kevin,
>
>>>> Currently, there is no way to measure lcore poll busyness in a passive way,
>>>> without any modifications to the application. This patch adds a new EAL API
>>>> that will be able to passively track core polling busyness.
>>>>
>>>> The poll busyness is calculated by relying on the fact that most DPDK API's
>>>> will poll for work (packets, completions, eventdev events, etc). Empty
>>>> polls can be counted as "idle", while non-empty polls can be counted as
>>>> busy. To measure lcore poll busyness, we simply call the telemetry
>>>> timestamping function with the number of polls a particular code section
>>>> has processed, and count the number of cycles we've spent processing empty
>>>> bursts. The more empty bursts we encounter, the less cycles we spend in
>>>> "busy" state, and the less core poll busyness will be reported.
>>>>
>>>> In order for all of the above to work without modifications to the
>>>> application, the library code needs to be instrumented with calls to the
>>>> lcore telemetry busyness timestamping function. The following parts of DPDK
>>>> are instrumented with lcore poll busyness timestamping calls:
>>>>
>>>> - All major driver API's:
>>>>     - ethdev
>>>>     - cryptodev
>>>>     - compressdev
>>>>     - regexdev
>>>>     - bbdev
>>>>     - rawdev
>>>>     - eventdev
>>>>     - dmadev
>>>> - Some additional libraries:
>>>>     - ring
>>>>     - distributor
>>>>
>>>> To avoid performance impact from having lcore telemetry support, a global
>>>> variable is exported by EAL, and a call to timestamping function is wrapped
>>>> into a macro, so that whenever telemetry is disabled, it only takes one
>>>> additional branch and no function calls are performed. It is disabled at
>>>> compile time by default.
>>>>
>>>> This patch also adds a telemetry endpoint to report lcore poll busyness, as
>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>> documentation entry has been added to the howto guides to explain the usage
>>>> of the new telemetry endpoints and API.
>>> As was already mentioned  by other reviewers, it would be much better
>>> to let application itself decide when it is idle and when it is busy.
>>> With current approach even for constant polling run-to-completion model there
>>> are plenty of opportunities to get things wrong and provide misleading statistics.
>>> My special concern - inserting it into ring dequeue code.
>>> Ring is used for various different things, not only pass packets between threads (mempool, etc.).
>>> Blindly assuming that ring dequeue returns empty means idle cycles seams wrong to me.
>>> Which make me wonder should we really hard-code these calls into DPDK core functions?
>>> If you still like to introduce such stats, might be better to implement it via callback mechanism.
>>> As I remember nearly all our drivers (net, crypto, etc.) do support it.
>>> That way our generic code   will remain unaffected, plus user will have ability to enable/disable
>>> it on a per device basis.
>> Thanks for your feedback, Konstantin.
>>
>> You are right in saying that this approach won't be 100% suitable for
>> all use-cases, but should be suitable for the majority of applications.
> First of all - could you explain how did you measure what is the 'majority' of DPDK applications?
> And how did you conclude that it definitely work for all the apps in that 'majority'?
> Second what bother me with that approach - I don't see s clear and deterministic way
> for the user to understand would that stats work properly for his app or not.
> (except manually ananlyzing his app code).

All of the DPDK example applications we've tested with (l2fwd, l3fwd + friends, testpmd, distributor, dmafwd) report lcore poll busyness and respond to changing traffic rates etc. We've also compared the reported busyness to similar metrics reported by other projects such as VPP and OvS, and found the reported busyness matches with a difference of +/- 1%. In addition to the DPDK example applications, we've have shared our plans with end customers and they have confirmed that the design should work with their applications.

>> It's worth keeping in mind that this feature is compile-time disabled by
>> default, so there is no impact to any application/user that does not
>> wish to use this, for example applications where this type of busyness
>> is not useful, or for applications that already use other mechanisms to
>> report similar telemetry.
> Not sure that adding in new compile-time option disabled by default is a good thing...
> For me it would be much more preferable if we'll go through a more 'standard' way here:
> a) define clear API to enable/disable/collect/report such type of stats.
> b) use some of our sample apps to demonstrate how to use it properly with user-specific code.
> c) if needed implement some 'silent' stats collection for limited scope of apps via callbacks -
> let say for run-to-completion apps that do use ether and crypto devs only.

With the compile-time option, its just one build flag for lots of applications to silently benefit from this.

>   However, the upside for applications that do
>> wish to use this is that there are no code changes required (for the
>> most part), the feature simply needs to be enabled at compile-time via
>> the meson option.
>>
>> In scenarios where contextual awareness of the application is needed in
>> order to report more accurate "busyness", the
>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
>> sections of code as "busy" or "idle". This way, the application can
>> assume control of determining the poll busyness of its lcores while
>> leveraging the telemetry hooks adding in this patchset.
>>
>> We did initially consider implementing this via callbacks, however we
>> found this approach to have 2 main drawbacks:
>> 1. Application changes are required for all applications wanting to
>> report this telemetry - rather than the majority getting it for free.
> Didn't get it - why callbacks approach would require user-app changes?
> In other situations - rte_power callbacks, pdump, etc. it works transparent to
> user-leve code.
> Why it can't be done here in a similar way?

 From my understanding, the callbacks would need to be registered by the application at the very least (and the callback would have to be registered per device/pmd/lib).

>
>> 2. Ring does not have callback support, meaning pipelined applications
>> could not report lcore poll busyness telemetry with this approach.
> That's another big concern that I have:
> Why you consider that all rings will be used for a pipilines between threads and should
> always be accounted by your stats?
> They could be used for dozens different purposes.
> What if that ring is used for mempool, and ring_dequeue() just means we try to allocate
> an object from the pool? In such case, why failing to allocate an object should mean
> start of new 'idle cycle'?

Another approach could be taken here if the mempool interactions are of concern.

 From our understanding, mempool operations use the "_bulk" APIs, whereas polling operations use the "_burst" APIs. Would only timestamping on the "_burst" APIs be better here? That way the mempool interactions won't be counted towards the busyness.

Including support for pipelined applications using rings is key for a number of usecases, this was highlighted as part of the customer feedback when we shared the design.

>
>> Eventdev is another driver which would be completely missed with this
>> approach.
> Ok, I see two ways here:
> - implement CB support for eventdev.
> -meanwhile clearly document that this stats are not supported for eventdev  scenarios (yet).
  
Jerin Jacob Sept. 30, 2022, 12:32 p.m. UTC | #7
On Thu, Sep 29, 2022 at 6:11 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
>
> >
> >> 2. Ring does not have callback support, meaning pipelined applications
> >> could not report lcore poll busyness telemetry with this approach.
> > That's another big concern that I have:
> > Why you consider that all rings will be used for a pipilines between threads and should
> > always be accounted by your stats?
> > They could be used for dozens different purposes.
> > What if that ring is used for mempool, and ring_dequeue() just means we try to allocate
> > an object from the pool? In such case, why failing to allocate an object should mean
> > start of new 'idle cycle'?
>
> Another approach could be taken here if the mempool interactions are of concern.


Another method to solve the problem will be leveraging an existing
trace framework and leverage existing fastpath tracepoints.
Where existing lcore poll busyness could be monitored by another
application by looking at the timestamp where traces are emitted.
This also gives flexibility to add customer or application specific
tracepoint as needed. Also control enable/disable aspects of trace
points.

l2reflect is a similar problem to see latency.

The use case like above(other application need to observe the code
flow of an DPDK application) and analyse, can be implemented
as

Similar suggesiton provied for l2reflect at
https://mails.dpdk.org/archives/dev/2022-September/250583.html

I would suggest to take this path to accommodate more use case in future like
- finding CPU idle time
-latency for crypto/dmadev/eventdev enqueue to dequeue
-histogram of occupancy for different queues
etc

This would translate to
1)Adding app/proc-info style app to pull the live trace from primary process
2)Add plugin framework to operate on live trace
3)Add a plugin for this specific use case
4)If needed, a communication from secondary to primary to take action
based on live analysis
like in this case if stop the primary when latency exceeds certain limit

On the plus side,
If we move all analysis and presentation to new generic application,
your packet forwarding
logic can simply move as new fwd_engine in testpmd(see
app/test-pmd/noisy_vnf.c as a example for fwdengine)

Ideally "eal: add lcore poll busyness telemetry"[1] could converge to
this model.

[1]
https://patches.dpdk.org/project/dpdk/patch/20220914092929.1159773-2-kevin.laatz@intel.com/






>
>  From our understanding, mempool operations use the "_bulk" APIs, whereas polling operations use the "_burst" APIs. Would only timestamping on the "_burst" APIs be better here? That way the mempool interactions won't be counted towards the busyness.
>
> Including support for pipelined applications using rings is key for a number of usecases, this was highlighted as part of the customer feedback when we shared the design.
>
> >
> >> Eventdev is another driver which would be completely missed with this
> >> approach.
> > Ok, I see two ways here:
> > - implement CB support for eventdev.
> > -meanwhile clearly document that this stats are not supported for eventdev  scenarios (yet).
  
Mattias Rönnblom Sept. 30, 2022, 10:13 p.m. UTC | #8
On 2022-09-14 11:29, Kevin Laatz wrote:
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> Currently, there is no way to measure lcore poll busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL API
> that will be able to passively track core polling busyness.

I think it's more fair to say it "/../ attempts to track /../".

> 
> The poll busyness is calculated by relying on the fact that most DPDK API's
> will poll for work (packets, completions, eventdev events, etc). Empty
> polls can be counted as "idle", while non-empty polls can be counted as
> busy.

I think it would be more clear it said something like "After an empty 
poll, the calling EAL thread is considered idle /../". It's not the poll 
operation itself that we care about, but the resulting processing.

To measure lcore poll busyness, we simply call the telemetry
> timestamping function with the number of polls a particular code section
> has processed, and count the number of cycles we've spent processing empty
> bursts. The more empty bursts we encounter, the less cycles we spend in
> "busy" state, and the less core poll busyness will be reported.
> 
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to the
> lcore telemetry busyness timestamping function. The following parts of DPDK
> are instrumented with lcore poll busyness timestamping calls:
> 
> - All major driver API's:
>    - ethdev
>    - cryptodev
>    - compressdev
>    - regexdev
>    - bbdev
>    - rawdev
>    - eventdev
>    - dmadev
> - Some additional libraries:
>    - ring
>    - distributor

Shouldn't the timer library also be in the list? It's a source of work.

> 
> To avoid performance impact from having lcore telemetry support, a global
> variable is exported by EAL, and a call to timestamping function is wrapped
> into a macro, so that whenever telemetry is disabled, it only takes one
> additional branch and no function calls are performed. It is disabled at
> compile time by default.
> 
> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry. A
> documentation entry has been added to the howto guides to explain the usage
> of the new telemetry endpoints and API.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> ---
> v7:
>    * Rename funcs, vars, files to include "poll" where missing.
> 
> v5:
>    * Fix Windows build
>    * Make lcore_telemetry_free() an internal interface
>    * Minor cleanup
> 
> v4:
>    * Fix doc build
>    * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
>    * Make enable/disable read and write atomic
>    * Change rte_lcore_poll_busyness_enabled_set() param to bool
>    * Move mem alloc from enable/disable to init/cleanup
>    * Other minor fixes
> 
> v3:
>    * Fix missed renaming to poll busyness
>    * Fix clang compilation
>    * Fix arm compilation
> 
> v2:
>    * Use rte_get_tsc_hz() to adjust the telemetry period
>    * Rename to reflect polling busyness vs general busyness
>    * Fix segfault when calling telemetry timestamp from an unregistered
>      non-EAL thread.
>    * Minor cleanup
> ---
>   config/meson.build                            |   1 +
>   config/rte_config.h                           |   1 +
>   lib/bbdev/rte_bbdev.h                         |  17 +-
>   lib/compressdev/rte_compressdev.c             |   2 +
>   lib/cryptodev/rte_cryptodev.h                 |   2 +
>   lib/distributor/rte_distributor.c             |  21 +-
>   lib/distributor/rte_distributor_single.c      |  14 +-
>   lib/dmadev/rte_dmadev.h                       |  15 +-
>   .../common/eal_common_lcore_poll_telemetry.c  | 303 ++++++++++++++++++
>   lib/eal/common/meson.build                    |   1 +
>   lib/eal/freebsd/eal.c                         |   1 +
>   lib/eal/include/rte_lcore.h                   |  85 ++++-
>   lib/eal/linux/eal.c                           |   1 +
>   lib/eal/meson.build                           |   3 +
>   lib/eal/version.map                           |   7 +
>   lib/ethdev/rte_ethdev.h                       |   2 +
>   lib/eventdev/rte_eventdev.h                   |  10 +-
>   lib/rawdev/rte_rawdev.c                       |   6 +-
>   lib/regexdev/rte_regexdev.h                   |   5 +-
>   lib/ring/rte_ring_elem_pvt.h                  |   1 +
>   meson_options.txt                             |   2 +
>   21 files changed, 475 insertions(+), 25 deletions(-)
>   create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c
> 
> diff --git a/config/meson.build b/config/meson.build
> index 7f7b6c92fd..d5954a059c 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -297,6 +297,7 @@ endforeach
>   dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
>   dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
>   dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
> +dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
>   # values which have defaults which may be overridden
>   dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
>   dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
> diff --git a/config/rte_config.h b/config/rte_config.h
> index ae56a86394..86ac3b8a6e 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,7 @@
>   #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>   #define RTE_BACKTRACE 1
>   #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
>   
>   /* bsd module defines */
>   #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
> diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
> index b88c88167e..d6a98d3f11 100644
> --- a/lib/bbdev/rte_bbdev.h
> +++ b/lib/bbdev/rte_bbdev.h
> @@ -28,6 +28,7 @@ extern "C" {
>   #include <stdbool.h>
>   
>   #include <rte_cpuflags.h>
> +#include <rte_lcore.h>
>   
>   #include "rte_bbdev_op.h"
>   
> @@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   
> @@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /** Definitions of device event types */
> diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
> index 22c438f2dd..fabc495a8e 100644
> --- a/lib/compressdev/rte_compressdev.c
> +++ b/lib/compressdev/rte_compressdev.c
> @@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   	nb_ops = (*dev->dequeue_burst)
>   			(dev->data->queue_pairs[qp_id], ops, nb_ops);
>   
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +
>   	return nb_ops;
>   }
>   
> diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
> index 56f459c6a0..a5b1d7c594 100644
> --- a/lib/cryptodev/rte_cryptodev.h
> +++ b/lib/cryptodev/rte_cryptodev.h
> @@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
>   	}
>   #endif
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
>   	return nb_ops;
>   }
>   
> diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
> index 3035b7a999..428157ec64 100644
> --- a/lib/distributor/rte_distributor.c
> +++ b/lib/distributor/rte_distributor.c
> @@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
>   
>   		while (rte_rdtsc() < t)
>   			rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
>   	}
>   
>   	/*
> @@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
>   
>   	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
>   		if (return_count <= 1) {
> +			uint16_t cnt;
>   			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
> -				worker_id, return_count ? oldpkt[0] : NULL);
> -			return (pkts[0]) ? 1 : 0;
> -		} else
> -			return -EINVAL;
> +								 worker_id,
> +								 return_count ? oldpkt[0] : NULL);
> +			cnt = (pkts[0] != NULL) ? 1 : 0;
> +			RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(cnt);
> +			return cnt;
> +		}
> +		return -EINVAL;
>   	}
>   
>   	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
>   
> -	count = rte_distributor_poll_pkt(d, worker_id, pkts);
> -	while (count == -1) {
> +	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
>   		uint64_t t = rte_rdtsc() + 100;
>   
>   		while (rte_rdtsc() < t)
>   			rte_pause();
>   
> -		count = rte_distributor_poll_pkt(d, worker_id, pkts);
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
>   	}
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(count);
>   	return count;
>   }
>   
> diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
> index 2c77ac454a..4c916c0fd2 100644
> --- a/lib/distributor/rte_distributor_single.c
> +++ b/lib/distributor/rte_distributor_single.c
> @@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
>   	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
>   	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
>   			| RTE_DISTRIB_GET_BUF;
> -	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
> -		==, 0, __ATOMIC_RELAXED);
> +
> +	while ((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
> +			& RTE_DISTRIB_FLAGS_MASK) != 0) {
> +		rte_pause();
> +		/* this was an empty poll */

The idle period started before the pause.

> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
> +	}
>   
>   	/* Sync with distributor on GET_BUF flag. */
>   	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
> @@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
>   {
>   	struct rte_mbuf *ret;
>   	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
> -	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
> +	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
>   		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
> +	}
>   	return ret;
>   }
>   
> diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
> index e7f992b734..3e27e0fd2b 100644
> --- a/lib/dmadev/rte_dmadev.h
> +++ b/lib/dmadev/rte_dmadev.h
> @@ -149,6 +149,7 @@
>   #include <rte_bitops.h>
>   #include <rte_common.h>
>   #include <rte_compat.h>
> +#include <rte_lcore.h>
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>   		  uint16_t *last_idx, bool *has_error)
>   {
>   	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>   	bool err;
>   
>   #ifdef RTE_DMADEV_DEBUG
> @@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>   		has_error = &err;
>   
>   	*has_error = false;
> -	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> -				 has_error);
> +	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> +				   has_error);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>   			 enum rte_dma_status_code *status)
>   {
>   	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>   
>   #ifdef RTE_DMADEV_DEBUG
>   	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
> @@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>   	if (last_idx == NULL)
>   		last_idx = &idx;
>   
> -	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
> +	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
>   					last_idx, status);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> diff --git a/lib/eal/common/eal_common_lcore_poll_telemetry.c b/lib/eal/common/eal_common_lcore_poll_telemetry.c
> new file mode 100644
> index 0000000000..d97996e85f
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_poll_telemetry.c
> @@ -0,0 +1,303 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Intel Corporation
> + */
> +
> +#include <unistd.h>
> +#include <limits.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_lcore.h>
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#include <rte_telemetry.h>
> +#endif
> +
> +rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +
> +struct lcore_poll_telemetry {
> +	int poll_busyness;
> +	/**< Calculated poll busyness (gets set/returned by the API) */
> +	int raw_poll_busyness;
> +	/**< Calculated poll busyness times 100. */
> +	uint64_t interval_ts;
> +	/**< when previous telemetry interval started */
> +	uint64_t empty_cycles;
> +	/**< empty cycle count since last interval */
> +	uint64_t last_poll_ts;
> +	/**< last poll timestamp */
> +	bool last_empty;
> +	/**< if last poll was empty */
> +	unsigned int contig_poll_cnt;
> +	/**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +
> +static struct lcore_poll_telemetry *telemetry_data;
> +
> +#define LCORE_POLL_BUSYNESS_MAX 100
> +#define LCORE_POLL_BUSYNESS_NOT_SET -1
> +#define LCORE_POLL_BUSYNESS_MIN 0
> +
> +#define SMOOTH_COEFF 5
> +#define STATE_CHANGE_OPT 32
> +
> +static void lcore_config_init(void)
> +{
> +	int lcore_id;
> +
> +	RTE_LCORE_FOREACH(lcore_id) {
> +		struct lcore_poll_telemetry *td = &telemetry_data[lcore_id];
> +
> +		td->interval_ts = 0;
> +		td->last_poll_ts = 0;
> +		td->empty_cycles = 0;
> +		td->last_empty = true;
> +		td->contig_poll_cnt = 0;
> +		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
> +		td->raw_poll_busyness = 0;
> +	}
> +}
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id)
> +{
> +	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
> +	/* if more than 1000 busyness periods have passed, this core is considered inactive */
> +	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS * tsc_ms * 1000;
> +	struct lcore_poll_telemetry *tdata;
> +
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return -EINVAL;
> +	tdata = &telemetry_data[lcore_id];
> +
> +	/* if the lcore is not active */
> +	if (tdata->interval_ts == 0)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +	/* if the core hasn't been active in a while */
> +	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +
> +	/* this core is active, report its poll busyness */
> +	return telemetry_data[lcore_id].poll_busyness;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable)
> +{
> +	int set = rte_atomic32_cmpset((volatile uint32_t *)&__rte_lcore_poll_telemetry_enabled,
> +			(int)!enable, (int)enable);

Use GCC C11 atomics?

> +
> +	/* Reset counters on successful disable */
> +	if (set && !enable)
> +		lcore_config_init();
> +}
> +
> +static inline int calc_raw_poll_busyness(const struct lcore_poll_telemetry *tdata,
> +				    const uint64_t empty, const uint64_t total)
> +{
> +	/*
> +	 * We don't want to use floating point math here, but we want for our poll
> +	 * busyness to react smoothly to sudden changes, while still keeping the
> +	 * accuracy and making sure that over time the average follows poll busyness
> +	 * as measured just-in-time. Therefore, we will calculate the average poll
> +	 * busyness using integer math, but shift the decimal point two places
> +	 * to the right, so that 100.0 becomes 10000. This allows us to report
> +	 * integer values (0..100) while still allowing ourselves to follow the
> +	 * just-in-time measurements when we calculate our averages.
> +	 */
> +	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
> +
> +	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
> +
> +	/* calculate rate of idle cycles, times 100 */
> +	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
> +
> +	/* smoothen the idleness */
> +	const int smoothened_idle =
> +			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
> +
> +	/* convert idleness to poll busyness */
> +	return max_raw_idle - smoothened_idle;
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
> +{
> +	const unsigned int lcore_id = rte_lcore_id();
> +	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> +	struct lcore_poll_telemetry *tdata;
> +	const bool empty = nb_rx == 0;
> +	uint64_t diff_int, diff_last;
> +	bool last_empty;
> +
> +	/* This telemetry is not supported for unregistered non-EAL threads */
> +	if (lcore_id >= RTE_MAX_LCORE) {
> +		RTE_LOG(DEBUG, EAL,
> +				"Lcore telemetry not supported on unregistered non-EAL thread %d",
> +				lcore_id);
> +		return;
> +	}
> +
> +	tdata = &telemetry_data[lcore_id];
> +	last_empty = tdata->last_empty;
> +
> +	/* optimization: don't do anything if status hasn't changed */
> +	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
> +		return;
> +	/* status changed or we're waiting for too long, reset counter */
> +	tdata->contig_poll_cnt = 0;
> +
> +	cur_tsc = rte_rdtsc();
> +
> +	interval_ts = tdata->interval_ts;
> +	empty_cycles = tdata->empty_cycles;
> +	last_poll_ts = tdata->last_poll_ts;
> +
> +	diff_int = cur_tsc - interval_ts;
> +	diff_last = cur_tsc - last_poll_ts;
> +
> +	/* is this the first time we're here? */
> +	if (interval_ts == 0) {
> +		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
> +		tdata->raw_poll_busyness = 0;
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->contig_poll_cnt = 0;
> +		goto end;
> +	}
> +
> +	/* update the empty counter if we got an empty poll earlier */
> +	if (last_empty)
> +		empty_cycles += diff_last;
> +
> +	/* have we passed the interval? */
> +	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
> +	if (diff_int > interval) {
> +		int raw_poll_busyness;
> +
> +		/* get updated poll_busyness value */
> +		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
> +
> +		/* set a new interval, reset empty counter */
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->raw_poll_busyness = raw_poll_busyness;
> +		/* bring poll busyness back to 0..100 range, biased to round up */
> +		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;

You probably want to report the number of busy cycles as well, so the 
user can do her own averaging, without being forced to resorting to 
sampling.

> +	} else
> +		/* we may have updated empty counter */
> +		tdata->empty_cycles = empty_cycles;
> +
> +end:
> +	/* update status for next poll */
> +	tdata->last_poll_ts = cur_tsc;
> +	tdata->last_empty = empty;
> +}
> +
> +static int
> +lcore_poll_busyness_enable(const char *cmd __rte_unused,
> +		      const char *params __rte_unused,
> +		      struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(true);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_poll_busyness_disable(const char *cmd __rte_unused,
> +		       const char *params __rte_unused,
> +		       struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(false);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_handle_poll_busyness(const char *cmd __rte_unused,
> +		      const char *params __rte_unused, struct rte_tel_data *d)
> +{
> +	char corenum[64];
> +	int i;
> +
> +	rte_tel_data_start_dict(d);
> +
> +	RTE_LCORE_FOREACH(i) {
> +		if (!rte_lcore_is_enabled(i))
> +			continue;
> +		snprintf(corenum, sizeof(corenum), "%d", i);
> +		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +eal_lcore_poll_telemetry_free(void)
> +{
> +	if (telemetry_data != NULL) {
> +		free(telemetry_data);
> +		telemetry_data = NULL;
> +	}
> +}
> +
> +RTE_INIT(lcore_init_poll_telemetry)
> +{
> +	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
> +	if (telemetry_data == NULL)
> +		rte_panic("Could not init lcore telemetry data: Out of memory\n");
> +
> +	lcore_config_init();
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
> +				   "return percentage poll busyness of cores");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
> +				   "enable lcore poll busyness measurement");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
> +				   "disable lcore poll busyness measurement");
> +
> +	rte_atomic32_set(&__rte_lcore_poll_telemetry_enabled, true);
> +}
> +
> +#else
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
> +{
> +	return -ENOTSUP;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return -ENOTSUP;
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
> +{
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
> +{
> +}
> +
> +void eal_lcore_poll_telemetry_free(void)
> +{
> +}
> +
> +#endif
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..e5741ce9f9 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -17,6 +17,7 @@ sources += files(
>           'eal_common_hexdump.c',
>           'eal_common_interrupts.c',
>           'eal_common_launch.c',
> +        'eal_common_lcore_poll_telemetry.c',
>           'eal_common_lcore.c',
>           'eal_common_log.c',
>           'eal_common_mcfg.c',
> diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
> index 26fbc91b26..92c4af9c28 100644
> --- a/lib/eal/freebsd/eal.c
> +++ b/lib/eal/freebsd/eal.c
> @@ -895,6 +895,7 @@ rte_eal_cleanup(void)
>   	rte_mp_channel_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_lcore_poll_telemetry_free();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	rte_eal_alarm_cleanup();
> diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
> index b598e1b9ec..2191c2473a 100644
> --- a/lib/eal/include/rte_lcore.h
> +++ b/lib/eal/include/rte_lcore.h
> @@ -16,6 +16,7 @@
>   #include <rte_eal.h>
>   #include <rte_launch.h>
>   #include <rte_thread.h>
> +#include <rte_atomic.h>
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -415,9 +416,91 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
>   		const pthread_attr_t *attr,
>   		void *(*start_routine)(void *), void *arg);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Read poll busyness value corresponding to an lcore.
> + *
> + * @param lcore_id
> + *   Lcore to read poll busyness value for.
> + * @return
> + *   - value between 0 and 100 on success
> + *   - -1 if lcore is not active
> + *   - -EINVAL if lcore is invalid
> + *   - -ENOMEM if not enough memory available
> + *   - -ENOTSUP if not supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness(unsigned int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Check if lcore poll busyness telemetry is enabled.
> + *
> + * @return
> + *   - true if lcore telemetry is enabled
> + *   - false if lcore telemetry is disabled
> + *   - -ENOTSUP if not lcore telemetry supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness_enabled(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Enable or disable poll busyness telemetry.
> + *
> + * @param enable
> + *   1 to enable, 0 to disable
> + */
> +__rte_experimental
> +void
> +rte_lcore_poll_busyness_enabled_set(bool enable);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Lcore poll busyness timestamping function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +__rte_experimental
> +void
> +__rte_lcore_poll_busyness_timestamp(uint16_t nb_rx);
> +
> +/** @internal lcore telemetry enabled status */
> +extern rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
> +
> +/** @internal free memory allocated for lcore telemetry */
> +void
> +eal_lcore_poll_telemetry_free(void);
> +
> +/**
> + * Call lcore poll busyness timestamp function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do {				\
> +	int enabled = (int)rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);	\
> +	if (enabled)								\
> +		__rte_lcore_poll_busyness_timestamp(nb_rx);			\
> +} while (0)
> +#else
> +#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do { } while (0)
> +#endif
> +
>   #ifdef __cplusplus
>   }
>   #endif
>   
> -
>   #endif /* _RTE_LCORE_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 37d29643a5..5e81352a81 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -1364,6 +1364,7 @@ rte_eal_cleanup(void)
>   	rte_mp_channel_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_lcore_poll_telemetry_free();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/meson.build b/lib/eal/meson.build
> index 056beb9461..2fb90d446b 100644
> --- a/lib/eal/meson.build
> +++ b/lib/eal/meson.build
> @@ -25,6 +25,9 @@ subdir(arch_subdir)
>   deps += ['kvargs']
>   if not is_windows
>       deps += ['telemetry']
> +else
> +	# core poll busyness telemetry depends on telemetry library
> +	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
>   endif
>   if dpdk_conf.has('RTE_USE_LIBBSD')
>       ext_deps += libbsd
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 1f293e768b..3275d1fac4 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -424,6 +424,13 @@ EXPERIMENTAL {
>   	rte_thread_self;
>   	rte_thread_set_affinity_by_id;
>   	rte_thread_set_priority;
> +
> +	# added in 22.11
> +	__rte_lcore_poll_busyness_timestamp;
> +	__rte_lcore_poll_telemetry_enabled;
> +	rte_lcore_poll_busyness;
> +	rte_lcore_poll_busyness_enabled;
> +	rte_lcore_poll_busyness_enabled_set;
>   };
>   
>   INTERNAL {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index de9e970d4d..4c8113f31f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>   #endif
>   
>   	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx);
>   	return nb_rx;
>   }
>   
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 6a6f6ea4c1..a65b3c7c85 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>   			uint16_t nb_events, uint64_t timeout_ticks)
>   {
>   	const struct rte_event_fp_ops *fp_ops;
> +	uint16_t nb_evts;
>   	void *port;
>   
>   	fp_ops = &rte_event_fp_ops[dev_id];
> @@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>   	 * requests nb_events as const one
>   	 */
>   	if (nb_events == 1)
> -		return (fp_ops->dequeue)(port, ev, timeout_ticks);
> +		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
>   	else
> -		return (fp_ops->dequeue_burst)(port, ev, nb_events,
> -					       timeout_ticks);
> +		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
> +					timeout_ticks);
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_evts);
> +	return nb_evts;
>   }
>   

For software event devices, like SW or DSW, the timestamp macro will be 
invoked at least twice per dequeue call. This is a slight slow-down. Not 
a huge issue.

However, if the event device would use more than one ring per eventdev 
port, which it may well do (e.g., to handle events different priority, 
or for events from different sources), the first ring poll may well be 
empty, and always so, thus causing a state transition from busy to idle, 
and back again, *for every dequeue call*. That would definately show up 
as a noticable performance degradation, since the rdtsc and the other 
instructions, including a mul, in that code path. Taken twice.

>   #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
> diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
> index 2f0a4f132e..1cba53270a 100644
> --- a/lib/rawdev/rte_rawdev.c
> +++ b/lib/rawdev/rte_rawdev.c
> @@ -16,6 +16,7 @@
>   #include <rte_common.h>
>   #include <rte_malloc.h>
>   #include <rte_telemetry.h>
> +#include <rte_lcore.h>
>   
>   #include "rte_rawdev.h"
>   #include "rte_rawdev_pmd.h"
> @@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
>   			   rte_rawdev_obj_t context)
>   {
>   	struct rte_rawdev *dev;
> +	int nb_ops;
>   
>   	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
>   	dev = &rte_rawdevs[dev_id];
>   
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
> -	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   int
> diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
> index 3bce8090f6..8caaed502f 100644
> --- a/lib/regexdev/rte_regexdev.h
> +++ b/lib/regexdev/rte_regexdev.h
> @@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   			   struct rte_regex_ops **ops, uint16_t nb_ops)
>   {
>   	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
> +	uint16_t deq_ops;
>   #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
>   	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
> @@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   		return -EINVAL;
>   	}
>   #endif
> -	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(deq_ops);
> +	return deq_ops;
>   }
>   
>   #ifdef __cplusplus
> diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
> index 83788c56e6..cf2370c238 100644
> --- a/lib/ring/rte_ring_elem_pvt.h
> +++ b/lib/ring/rte_ring_elem_pvt.h
> @@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
>   end:
>   	if (available != NULL)
>   		*available = entries - n;
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n);
>   	return n;
>   }
>   
> diff --git a/meson_options.txt b/meson_options.txt
> index 7c220ad68d..9b20a36fdb 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
>          'Install headers to build drivers.')
>   option('enable_kmods', type: 'boolean', value: false, description:
>          'build kernel modules')
> +option('enable_lcore_poll_busyness', type: 'boolean', value: false, description:
> +       'enable collection of lcore poll busyness telemetry')
>   option('examples', type: 'string', value: '', description:
>          'Comma-separated list of examples to build by default')
>   option('flexran_sdk', type: 'string', value: '', description:
  
Konstantin Ananyev Oct. 1, 2022, 2:17 p.m. UTC | #9
>> Hi Kevin,
>>
>>>>> Currently, there is no way to measure lcore poll busyness in a 
>>>>> passive way,
>>>>> without any modifications to the application. This patch adds a new 
>>>>> EAL API
>>>>> that will be able to passively track core polling busyness.
>>>>>
>>>>> The poll busyness is calculated by relying on the fact that most 
>>>>> DPDK API's
>>>>> will poll for work (packets, completions, eventdev events, etc). Empty
>>>>> polls can be counted as "idle", while non-empty polls can be 
>>>>> counted as
>>>>> busy. To measure lcore poll busyness, we simply call the telemetry
>>>>> timestamping function with the number of polls a particular code 
>>>>> section
>>>>> has processed, and count the number of cycles we've spent 
>>>>> processing empty
>>>>> bursts. The more empty bursts we encounter, the less cycles we 
>>>>> spend in
>>>>> "busy" state, and the less core poll busyness will be reported.
>>>>>
>>>>> In order for all of the above to work without modifications to the
>>>>> application, the library code needs to be instrumented with calls 
>>>>> to the
>>>>> lcore telemetry busyness timestamping function. The following parts 
>>>>> of DPDK
>>>>> are instrumented with lcore poll busyness timestamping calls:
>>>>>
>>>>> - All major driver API's:
>>>>>     - ethdev
>>>>>     - cryptodev
>>>>>     - compressdev
>>>>>     - regexdev
>>>>>     - bbdev
>>>>>     - rawdev
>>>>>     - eventdev
>>>>>     - dmadev
>>>>> - Some additional libraries:
>>>>>     - ring
>>>>>     - distributor
>>>>>
>>>>> To avoid performance impact from having lcore telemetry support, a 
>>>>> global
>>>>> variable is exported by EAL, and a call to timestamping function is 
>>>>> wrapped
>>>>> into a macro, so that whenever telemetry is disabled, it only takes 
>>>>> one
>>>>> additional branch and no function calls are performed. It is 
>>>>> disabled at
>>>>> compile time by default.
>>>>>
>>>>> This patch also adds a telemetry endpoint to report lcore poll 
>>>>> busyness, as
>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>> documentation entry has been added to the howto guides to explain 
>>>>> the usage
>>>>> of the new telemetry endpoints and API.
>>>> As was already mentioned  by other reviewers, it would be much better
>>>> to let application itself decide when it is idle and when it is busy.
>>>> With current approach even for constant polling run-to-completion 
>>>> model there
>>>> are plenty of opportunities to get things wrong and provide 
>>>> misleading statistics.
>>>> My special concern - inserting it into ring dequeue code.
>>>> Ring is used for various different things, not only pass packets 
>>>> between threads (mempool, etc.).
>>>> Blindly assuming that ring dequeue returns empty means idle cycles 
>>>> seams wrong to me.
>>>> Which make me wonder should we really hard-code these calls into 
>>>> DPDK core functions?
>>>> If you still like to introduce such stats, might be better to 
>>>> implement it via callback mechanism.
>>>> As I remember nearly all our drivers (net, crypto, etc.) do support it.
>>>> That way our generic code   will remain unaffected, plus user will 
>>>> have ability to enable/disable
>>>> it on a per device basis.
>>> Thanks for your feedback, Konstantin.
>>>
>>> You are right in saying that this approach won't be 100% suitable for
>>> all use-cases, but should be suitable for the majority of applications.
>> First of all - could you explain how did you measure what is the 
>> 'majority' of DPDK applications?
>> And how did you conclude that it definitely work for all the apps in 
>> that 'majority'?
>> Second what bother me with that approach - I don't see s clear and 
>> deterministic way
>> for the user to understand would that stats work properly for his app 
>> or not.
>> (except manually ananlyzing his app code).
> 
> All of the DPDK example applications we've tested with (l2fwd, l3fwd + 
> friends, testpmd, distributor, dmafwd) report lcore poll busyness and 
> respond to changing traffic rates etc. We've also compared the reported 
> busyness to similar metrics reported by other projects such as VPP and 
> OvS, and found the reported busyness matches with a difference of +/- 
> 1%. In addition to the DPDK example applications, we've have shared our 
> plans with end customers and they have confirmed that the design should 
> work with their applications.

I am sure l3fwd and testpmd should be ok, I am talking about
something more complicated/unusual.
Below are few examples on top of my head when I think your approach
will generate invalid stats, feel free to correct me, if I am wrong.

1) App doing some sort of bonding itslef, i.e:

struct rte_mbuf pkts[N*2];
k = rte_eth_rx_burst(p0, q0, pkts, N);
n = rte_eth_rx_burst(p1, q1, pkts + k, N);

/*process all packets from both ports at once */
if (n + k != 0)
    process_pkts(pkts, n + k);

Now, as I understand, if n==0, then all cycles spent
in process_pkts() will be accounted as idle.

2) App doing something similar to what pdump library does
(creates a copy of a packet and sends it somewhere).

n =rte_eth_rx_burst(p0, q0, &pkt, 1);
if (n != 0) {
   dup_pkt = rte_pktmbuf_copy(pkt, dup_mp, ...);
   if (dup_pkt != NULL)
      process_dup_pkt(dup_pkt);
   process_pkt(pkt);
}

that relates to ring discussion below:
if there are no mbufs in dup_mp, then ring_deque() will fail
and process_pkt() will be accounted as idle.

3) App dequeues from ring in a bit of unusual way:

/* idle spin loop */
while ((n = rte_ring_count(ring)) == 0)
   ret_pause();

n = rte_ring_dequeue_bulk(ring, pkts, n, NULL);
if (n != 0)
   process_pkts(pkts, n);

here, we can end-up accounting cycles spent in
idle spin loop as busy.


4) Any thread that generates TX traffic on it's own
(something like testpmd tx_only fwd mode)

5) Any thread that depends on both dequeue and enqueue:

n = rte_ring_dequeue_burst(in_ring, pkts, n, ..);
...

/* loop till all packets are sent out successfully */
while(rte_ring_enqueue_bulk(out_ring, pkts, n, NULL) == 0)
    rte_pause();

Now, if n > 0, all cycles spent in enqueue() will be accounted
as 'busy', though from my perspective they probably should
be considered as 'idle'.


Also I expect some problems when packet processing is done inside
rx callbacks, but that probably can be easily fixed.


> 
>>> It's worth keeping in mind that this feature is compile-time disabled by
>>> default, so there is no impact to any application/user that does not
>>> wish to use this, for example applications where this type of busyness
>>> is not useful, or for applications that already use other mechanisms to
>>> report similar telemetry.
>> Not sure that adding in new compile-time option disabled by default is 
>> a good thing...
>> For me it would be much more preferable if we'll go through a more 
>> 'standard' way here:
>> a) define clear API to enable/disable/collect/report such type of stats.
>> b) use some of our sample apps to demonstrate how to use it properly 
>> with user-specific code.
>> c) if needed implement some 'silent' stats collection for limited 
>> scope of apps via callbacks -
>> let say for run-to-completion apps that do use ether and crypto devs 
>> only.
> 
> With the compile-time option, its just one build flag for lots of 
> applications to silently benefit from this.

There could be a lot of useful and helpfull stats
that user would like to collect (idle/busy, processing latency, etc.).
But, if for each such case we will hard-code new stats collection
into our fast data-path code, then very soon it will become
completely bloated and unmaintainable.
I think we need some generic approach for such extra stats collection.
Callbacks could be one way, Jerin in another mail suggested using 
existing trace-point hooks, might be it worth to explore it further.

> 
>>   However, the upside for applications that do
>>> wish to use this is that there are no code changes required (for the
>>> most part), the feature simply needs to be enabled at compile-time via
>>> the meson option.
>>>
>>> In scenarios where contextual awareness of the application is needed in
>>> order to report more accurate "busyness", the
>>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
>>> sections of code as "busy" or "idle". This way, the application can
>>> assume control of determining the poll busyness of its lcores while
>>> leveraging the telemetry hooks adding in this patchset.
>>>
>>> We did initially consider implementing this via callbacks, however we
>>> found this approach to have 2 main drawbacks:
>>> 1. Application changes are required for all applications wanting to
>>> report this telemetry - rather than the majority getting it for free.
>> Didn't get it - why callbacks approach would require user-app changes?
>> In other situations - rte_power callbacks, pdump, etc. it works 
>> transparent to
>> user-leve code.
>> Why it can't be done here in a similar way?
> 
>  From my understanding, the callbacks would need to be registered by the 
> application at the very least (and the callback would have to be 
> registered per device/pmd/lib).

Callbacks can be registered by library itself.
AFAIK, latenc-ystats, power and pdump libraries - all use similar 
approach. user calls something like xxx_stats_enable() and then library 
can iterate over all available devices and setup necessary callbacks.
same for xxx_stats_disable().

>>
>>> 2. Ring does not have callback support, meaning pipelined applications
>>> could not report lcore poll busyness telemetry with this approach.
>> That's another big concern that I have:
>> Why you consider that all rings will be used for a pipilines between 
>> threads and should
>> always be accounted by your stats?
>> They could be used for dozens different purposes.
>> What if that ring is used for mempool, and ring_dequeue() just means 
>> we try to allocate
>> an object from the pool? In such case, why failing to allocate an 
>> object should mean
>> start of new 'idle cycle'?
> 
> Another approach could be taken here if the mempool interactions are of 
> concern.
> 
>  From our understanding, mempool operations use the "_bulk" APIs, 
> whereas polling operations use the "_burst" APIs. Would only 
> timestamping on the "_burst" APIs be better here? That way the mempool 
> interactions won't be counted towards the busyness.

Well, it would help to solve one particular case,
but in general I still think it is incomplete and error-prone.
What if pipeline app will use ring_count/ring_dequeue_bulk(),
or even ZC ring API?
What if app will use something different from rte_ring to pass
packets between threads/processes?
As I said before, without some clues from the app, it is probably
not possible to collect such stats in a proper way.


> Including support for pipelined applications using rings is key for a 
> number of usecases, this was highlighted as part of the customer 
> feedback when we shared the design.
> 
>>
>>> Eventdev is another driver which would be completely missed with this
>>> approach.
>> Ok, I see two ways here:
>> - implement CB support for eventdev.
>> -meanwhile clearly document that this stats are not supported for 
>> eventdev  scenarios (yet).
  
Mattias Rönnblom Oct. 3, 2022, 8:02 p.m. UTC | #10
On 2022-10-01 16:17, Konstantin Ananyev wrote:
> 
>>> Hi Kevin,
>>>
>>>>>> Currently, there is no way to measure lcore poll busyness in a 
>>>>>> passive way,
>>>>>> without any modifications to the application. This patch adds a 
>>>>>> new EAL API
>>>>>> that will be able to passively track core polling busyness.
>>>>>>
>>>>>> The poll busyness is calculated by relying on the fact that most 
>>>>>> DPDK API's
>>>>>> will poll for work (packets, completions, eventdev events, etc). 
>>>>>> Empty
>>>>>> polls can be counted as "idle", while non-empty polls can be 
>>>>>> counted as
>>>>>> busy. To measure lcore poll busyness, we simply call the telemetry
>>>>>> timestamping function with the number of polls a particular code 
>>>>>> section
>>>>>> has processed, and count the number of cycles we've spent 
>>>>>> processing empty
>>>>>> bursts. The more empty bursts we encounter, the less cycles we 
>>>>>> spend in
>>>>>> "busy" state, and the less core poll busyness will be reported.
>>>>>>
>>>>>> In order for all of the above to work without modifications to the
>>>>>> application, the library code needs to be instrumented with calls 
>>>>>> to the
>>>>>> lcore telemetry busyness timestamping function. The following 
>>>>>> parts of DPDK
>>>>>> are instrumented with lcore poll busyness timestamping calls:
>>>>>>
>>>>>> - All major driver API's:
>>>>>>     - ethdev
>>>>>>     - cryptodev
>>>>>>     - compressdev
>>>>>>     - regexdev
>>>>>>     - bbdev
>>>>>>     - rawdev
>>>>>>     - eventdev
>>>>>>     - dmadev
>>>>>> - Some additional libraries:
>>>>>>     - ring
>>>>>>     - distributor
>>>>>>
>>>>>> To avoid performance impact from having lcore telemetry support, a 
>>>>>> global
>>>>>> variable is exported by EAL, and a call to timestamping function 
>>>>>> is wrapped
>>>>>> into a macro, so that whenever telemetry is disabled, it only 
>>>>>> takes one
>>>>>> additional branch and no function calls are performed. It is 
>>>>>> disabled at
>>>>>> compile time by default.
>>>>>>
>>>>>> This patch also adds a telemetry endpoint to report lcore poll 
>>>>>> busyness, as
>>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>>> documentation entry has been added to the howto guides to explain 
>>>>>> the usage
>>>>>> of the new telemetry endpoints and API.
>>>>> As was already mentioned  by other reviewers, it would be much better
>>>>> to let application itself decide when it is idle and when it is busy.
>>>>> With current approach even for constant polling run-to-completion 
>>>>> model there
>>>>> are plenty of opportunities to get things wrong and provide 
>>>>> misleading statistics.
>>>>> My special concern - inserting it into ring dequeue code.
>>>>> Ring is used for various different things, not only pass packets 
>>>>> between threads (mempool, etc.).
>>>>> Blindly assuming that ring dequeue returns empty means idle cycles 
>>>>> seams wrong to me.
>>>>> Which make me wonder should we really hard-code these calls into 
>>>>> DPDK core functions?
>>>>> If you still like to introduce such stats, might be better to 
>>>>> implement it via callback mechanism.
>>>>> As I remember nearly all our drivers (net, crypto, etc.) do support 
>>>>> it.
>>>>> That way our generic code   will remain unaffected, plus user will 
>>>>> have ability to enable/disable
>>>>> it on a per device basis.
>>>> Thanks for your feedback, Konstantin.
>>>>
>>>> You are right in saying that this approach won't be 100% suitable for
>>>> all use-cases, but should be suitable for the majority of applications.
>>> First of all - could you explain how did you measure what is the 
>>> 'majority' of DPDK applications?
>>> And how did you conclude that it definitely work for all the apps in 
>>> that 'majority'?
>>> Second what bother me with that approach - I don't see s clear and 
>>> deterministic way
>>> for the user to understand would that stats work properly for his app 
>>> or not.
>>> (except manually ananlyzing his app code).
>>
>> All of the DPDK example applications we've tested with (l2fwd, l3fwd + 
>> friends, testpmd, distributor, dmafwd) report lcore poll busyness and 
>> respond to changing traffic rates etc. We've also compared the 
>> reported busyness to similar metrics reported by other projects such 
>> as VPP and OvS, and found the reported busyness matches with a 
>> difference of +/- 1%. In addition to the DPDK example applications, 
>> we've have shared our plans with end customers and they have confirmed 
>> that the design should work with their applications.
> 
> I am sure l3fwd and testpmd should be ok, I am talking about
> something more complicated/unusual.
> Below are few examples on top of my head when I think your approach
> will generate invalid stats, feel free to correct me, if I am wrong.
> 
> 1) App doing some sort of bonding itslef, i.e:
> 
> struct rte_mbuf pkts[N*2];
> k = rte_eth_rx_burst(p0, q0, pkts, N);
> n = rte_eth_rx_burst(p1, q1, pkts + k, N);
> 
> /*process all packets from both ports at once */
> if (n + k != 0)
>     process_pkts(pkts, n + k);
> 
> Now, as I understand, if n==0, then all cycles spent
> in process_pkts() will be accounted as idle.
> 
> 2) App doing something similar to what pdump library does
> (creates a copy of a packet and sends it somewhere).
> 
> n =rte_eth_rx_burst(p0, q0, &pkt, 1);
> if (n != 0) {
>    dup_pkt = rte_pktmbuf_copy(pkt, dup_mp, ...);
>    if (dup_pkt != NULL)
>       process_dup_pkt(dup_pkt);
>    process_pkt(pkt);
> }
> 
> that relates to ring discussion below:
> if there are no mbufs in dup_mp, then ring_deque() will fail
> and process_pkt() will be accounted as idle.
> 
> 3) App dequeues from ring in a bit of unusual way:
> 
> /* idle spin loop */
> while ((n = rte_ring_count(ring)) == 0)
>    ret_pause();
> 
> n = rte_ring_dequeue_bulk(ring, pkts, n, NULL);
> if (n != 0)
>    process_pkts(pkts, n);
> 
> here, we can end-up accounting cycles spent in
> idle spin loop as busy.
> 
> 
> 4) Any thread that generates TX traffic on it's own
> (something like testpmd tx_only fwd mode)
> 
> 5) Any thread that depends on both dequeue and enqueue:
> 
> n = rte_ring_dequeue_burst(in_ring, pkts, n, ..);
> ...
> 
> /* loop till all packets are sent out successfully */
> while(rte_ring_enqueue_bulk(out_ring, pkts, n, NULL) == 0)
>     rte_pause();
> 
> Now, if n > 0, all cycles spent in enqueue() will be accounted
> as 'busy', though from my perspective they probably should
> be considered as 'idle'.
> 
> 
> Also I expect some problems when packet processing is done inside
> rx callbacks, but that probably can be easily fixed.
> 
> 
>>
>>>> It's worth keeping in mind that this feature is compile-time 
>>>> disabled by
>>>> default, so there is no impact to any application/user that does not
>>>> wish to use this, for example applications where this type of busyness
>>>> is not useful, or for applications that already use other mechanisms to
>>>> report similar telemetry.
>>> Not sure that adding in new compile-time option disabled by default 
>>> is a good thing...
>>> For me it would be much more preferable if we'll go through a more 
>>> 'standard' way here:
>>> a) define clear API to enable/disable/collect/report such type of stats.
>>> b) use some of our sample apps to demonstrate how to use it properly 
>>> with user-specific code.
>>> c) if needed implement some 'silent' stats collection for limited 
>>> scope of apps via callbacks -
>>> let say for run-to-completion apps that do use ether and crypto devs 
>>> only.
>>
>> With the compile-time option, its just one build flag for lots of 
>> applications to silently benefit from this.
> 
> There could be a lot of useful and helpfull stats
> that user would like to collect (idle/busy, processing latency, etc.).
> But, if for each such case we will hard-code new stats collection
> into our fast data-path code, then very soon it will become
> completely bloated and unmaintainable.
> I think we need some generic approach for such extra stats collection.
> Callbacks could be one way, Jerin in another mail suggested using 
> existing trace-point hooks, might be it worth to explore it further.
> 
>>
>>>   However, the upside for applications that do
>>>> wish to use this is that there are no code changes required (for the
>>>> most part), the feature simply needs to be enabled at compile-time via
>>>> the meson option.
>>>>
>>>> In scenarios where contextual awareness of the application is needed in
>>>> order to report more accurate "busyness", the
>>>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
>>>> sections of code as "busy" or "idle". This way, the application can
>>>> assume control of determining the poll busyness of its lcores while
>>>> leveraging the telemetry hooks adding in this patchset.
>>>>
>>>> We did initially consider implementing this via callbacks, however we
>>>> found this approach to have 2 main drawbacks:
>>>> 1. Application changes are required for all applications wanting to
>>>> report this telemetry - rather than the majority getting it for free.
>>> Didn't get it - why callbacks approach would require user-app changes?
>>> In other situations - rte_power callbacks, pdump, etc. it works 
>>> transparent to
>>> user-leve code.
>>> Why it can't be done here in a similar way?
>>
>>  From my understanding, the callbacks would need to be registered by 
>> the application at the very least (and the callback would have to be 
>> registered per device/pmd/lib).
> 
> Callbacks can be registered by library itself.
> AFAIK, latenc-ystats, power and pdump libraries - all use similar 
> approach. user calls something like xxx_stats_enable() and then library 
> can iterate over all available devices and setup necessary callbacks.
> same for xxx_stats_disable().
> 
>>>
>>>> 2. Ring does not have callback support, meaning pipelined applications
>>>> could not report lcore poll busyness telemetry with this approach.
>>> That's another big concern that I have:
>>> Why you consider that all rings will be used for a pipilines between 
>>> threads and should
>>> always be accounted by your stats?
>>> They could be used for dozens different purposes.
>>> What if that ring is used for mempool, and ring_dequeue() just means 
>>> we try to allocate
>>> an object from the pool? In such case, why failing to allocate an 
>>> object should mean
>>> start of new 'idle cycle'?
>>
>> Another approach could be taken here if the mempool interactions are 
>> of concern.
>>
>>  From our understanding, mempool operations use the "_bulk" APIs, 
>> whereas polling operations use the "_burst" APIs. Would only 
>> timestamping on the "_burst" APIs be better here? That way the mempool 
>> interactions won't be counted towards the busyness.
> 
> Well, it would help to solve one particular case,
> but in general I still think it is incomplete and error-prone.

I agree.

The functionality provided is very useful, and the implementation is 
clever in the way it doesn't require any application modifications. But, 
a clever, useful brittle hack is still a brittle hack.

What if there was instead a busyness module, where the application would 
explicitly report what it was up to. The new library would hook up to 
telemetry just like this patchset does, plus provide an explicit API to 
retrieve lcore thread load.

The service cores framework (fancy name for rte_service.c) could also 
call the lcore load tracking module, provided all services properly 
reported back on whether or not they were doing anything useful with the 
cycles they just spent.

The metrics of such a load tracking module could potentially be used by 
other modules in DPDK, or by the application. It could potentially be 
used for dynamic load balancing of service core services, or for power 
management (e.g, DVFS), or for a potential future deferred-work type 
mechanism more sophisticated than current rte_service, or some green 
threads/coroutines/fiber thingy. The DSW event device could also use it 
to replace its current internal load estimation scheme.

I may be repeating myself here, from past threads.

> What if pipeline app will use ring_count/ring_dequeue_bulk(),
> or even ZC ring API?
> What if app will use something different from rte_ring to pass
> packets between threads/processes?
> As I said before, without some clues from the app, it is probably
> not possible to collect such stats in a proper way.
> 
> 
>> Including support for pipelined applications using rings is key for a 
>> number of usecases, this was highlighted as part of the customer 
>> feedback when we shared the design.
>>
>>>
>>>> Eventdev is another driver which would be completely missed with this
>>>> approach.
>>> Ok, I see two ways here:
>>> - implement CB support for eventdev.
>>> -meanwhile clearly document that this stats are not supported for 
>>> eventdev  scenarios (yet).
>
  
Morten Brørup Oct. 4, 2022, 9:15 a.m. UTC | #11
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 3 October 2022 22.02

[...]

> The functionality provided is very useful, and the implementation is
> clever in the way it doesn't require any application modifications.
> But,
> a clever, useful brittle hack is still a brittle hack.
> 
> What if there was instead a busyness module, where the application
> would
> explicitly report what it was up to. The new library would hook up to
> telemetry just like this patchset does, plus provide an explicit API to
> retrieve lcore thread load.
> 
> The service cores framework (fancy name for rte_service.c) could also
> call the lcore load tracking module, provided all services properly
> reported back on whether or not they were doing anything useful with
> the
> cycles they just spent.
> 
> The metrics of such a load tracking module could potentially be used by
> other modules in DPDK, or by the application. It could potentially be
> used for dynamic load balancing of service core services, or for power
> management (e.g, DVFS), or for a potential future deferred-work type
> mechanism more sophisticated than current rte_service, or some green
> threads/coroutines/fiber thingy. The DSW event device could also use it
> to replace its current internal load estimation scheme.

[...]

I agree 100 % with everything Mattias wrote above, and I would like to voice my opinion too.

This patch is full of preconditions and assumptions. Its only true advantage (vs. a generic load tracking library) is that it doesn't require any application modifications, and thus can be deployed with zero effort.

I my opinion, it would be much better with a well designed generic load tracking library, to be called from the application, so it gets correct information about what the lcores spend their cycles doing. And as Mattias mentions: With the appropriate API for consumption of the collected data, it could also provide actionable statistics for use by the application itself, not just telemetry. ("Actionable statistics": Statistics that is directly usable for decision making.)

There is also the aspect of time-to-benefit: This patch immediately provides benefits (to the users of the DPDK applications that meet the preconditions/assumptions of the patch), while a generic load tracking library will take years to get integrated into applications before it provides benefits (to the users of the DPDK applications that use the new library).

So, we should ask ourselves: Do we want an application-specific solution with a short time-to-benefit, or a generic solution with a long time-to-benefit? (I use the term "application specific" because not all applications can be tweaked to provide meaningful data with this patch. You might also label a generic library "application specific", because it requires that the application uses the library - however that is a common requirement of all DPDK libraries.)

Furthermore, if the proposed patch is primarily for the benefit of OVS, I suppose that calls to a generic load tracking library could be added to OVS within a relatively short time frame (although not as quick as this patch).

I guess that the developers of this patch initially thought that it was generic and usable for the majority of applications, and it came as somewhat a surprise that it wasn't as generic as expected. The DPDK community has a good review process with open discussions and sharing of thoughts and ideas. Sometimes, an idea doesn't fly, because the corner cases turn out to be more common than expected. I'm sorry to say it, but I think that is the case for this patch. :-(

-Morten
  
Bruce Richardson Oct. 4, 2022, 11:57 a.m. UTC | #12
On Tue, Oct 04, 2022 at 11:15:19AM +0200, Morten Brørup wrote:
> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Monday, 3 October 2022 22.02
> 
> [...]
> 
> > The functionality provided is very useful, and the implementation is
> > clever in the way it doesn't require any application modifications.
> > But,
> > a clever, useful brittle hack is still a brittle hack.
> > 

I think that may be a little harsh here. After all, this is a feature which
is build-time disabled and runtime disabled by default, so like many other
components it's designed for use when it makes sense to do so.

Furthermore, I'd just like to point out that the authors, when doing the
patches, have left in the hooks so that even apps, for which the "for-free"
scheme doesn't work, can still leverage the infrastructure to have the app
itself report the busy/free metrics.

> > What if there was instead a busyness module, where the application
> > would
> > explicitly report what it was up to. The new library would hook up to
> > telemetry just like this patchset does, plus provide an explicit API to
> > retrieve lcore thread load.
> > 
> > The service cores framework (fancy name for rte_service.c) could also
> > call the lcore load tracking module, provided all services properly
> > reported back on whether or not they were doing anything useful with
> > the
> > cycles they just spent.
> > 
> > The metrics of such a load tracking module could potentially be used by
> > other modules in DPDK, or by the application. It could potentially be
> > used for dynamic load balancing of service core services, or for power
> > management (e.g, DVFS), or for a potential future deferred-work type
> > mechanism more sophisticated than current rte_service, or some green
> > threads/coroutines/fiber thingy. The DSW event device could also use it
> > to replace its current internal load estimation scheme.
> 
> [...]
> 
> I agree 100 % with everything Mattias wrote above, and I would like to voice my opinion too.
> 
> This patch is full of preconditions and assumptions. Its only true advantage (vs. a generic load tracking library) is that it doesn't require any application modifications, and thus can be deployed with zero effort.
> 
> I my opinion, it would be much better with a well designed generic load tracking library, to be called from the application, so it gets correct information about what the lcores spend their cycles doing. And as Mattias mentions: With the appropriate API for consumption of the collected data, it could also provide actionable statistics for use by the application itself, not just telemetry. ("Actionable statistics": Statistics that is directly usable for decision making.)
> 
> There is also the aspect of time-to-benefit: This patch immediately provides benefits (to the users of the DPDK applications that meet the preconditions/assumptions of the patch), while a generic load tracking library will take years to get integrated into applications before it provides benefits (to the users of the DPDK applications that use the new library).
> 
> So, we should ask ourselves: Do we want an application-specific solution with a short time-to-benefit, or a generic solution with a long time-to-benefit? (I use the term "application specific" because not all applications can be tweaked to provide meaningful data with this patch. You might also label a generic library "application specific", because it requires that the application uses the library - however that is a common requirement of all DPDK libraries.)
> 
> Furthermore, if the proposed patch is primarily for the benefit of OVS, I suppose that calls to a generic load tracking library could be added to OVS within a relatively short time frame (although not as quick as this patch).
> 
> I guess that the developers of this patch initially thought that it was generic and usable for the majority of applications, and it came as somewhat a surprise that it wasn't as generic as expected. The DPDK community has a good review process with open discussions and sharing of thoughts and ideas. Sometimes, an idea doesn't fly, because the corner cases turn out to be more common than expected. I'm sorry to say it, but I think that is the case for this patch. :-(
> 

I'd actually like to question this last statement a little.

I think we in the DPDK community are very good at coming up with
theoretical examples where things don't work, but are they really cases
that occur commonly in the real-world? 

I accept, for example, that the "for free" approach would not be suitable
for something like VPP which does multiple polls to gather packets before
processing, but for some of the other cases I'd question their commonality.
For example, a number of objections have focused on the case where
allocation of buffers fails and so the busyness gets counted wrongly.  Are
there really (many) apps out there where running out of buffers is not a
much more serious problem than incorrectly reported busyness stats?

I'd also say that, in my experience, the non-open-source end-user apps tend
very much to use DPDK based on the style of operation given in our DPDK
examples, rather than trying out new or different ways of working. (Maybe
others have different experiences, though, and can comment). I also tend to
believe that open-source software using DPDK probably shows more variety in
how things are done, which is not representative of a lot of non-OSS users
of DPDK.

Regards,
/Bruce
  
Mattias Rönnblom Oct. 4, 2022, 2:26 p.m. UTC | #13
On 2022-10-04 13:57, Bruce Richardson wrote:
> On Tue, Oct 04, 2022 at 11:15:19AM +0200, Morten Brørup wrote:
>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>> Sent: Monday, 3 October 2022 22.02
>>
>> [...]
>>
>>> The functionality provided is very useful, and the implementation is
>>> clever in the way it doesn't require any application modifications.
>>> But,
>>> a clever, useful brittle hack is still a brittle hack.
>>>
> 
> I think that may be a little harsh here. After all, this is a feature which
> is build-time disabled and runtime disabled by default, so like many other
> components it's designed for use when it makes sense to do so.
> 

So you don't think it's a hack? The driver level and the level of basic 
data structures (e.g., the ring) is the appropriate level to classify 
cycles into useful and not useful? And you don't think all the shaky 
assumptions makes it brittle?

Runtime configurable or not doesn't make a difference in this regard, in 
my opinion. On the source code level, this code is there, and making it 
compile-time conditional just makes matters worse.

Had this feature been limited to a small library, it would made a 
difference, but it's smeared across a wide range of APIs, and this list 
is not yet complete. Anything than can produce items of work need to be 
adapted.

That said, it's not obvious how this should be done. The higher-layer 
constructs where this should really be done aren't there in DPDK, at 
least not yet.

Have you considered the option to instrument rte_pause()? It's the 
closes DPDK has to the (now largely extinct) idle loop in an OS kernel. 
It too would be a hack, but maybe a less intrusive one.

> Furthermore, I'd just like to point out that the authors, when doing the
> patches, have left in the hooks so that even apps, for which the "for-free"
> scheme doesn't work, can still leverage the infrastructure to have the app
> itself report the busy/free metrics.
> 

If this is done properly, in a way that the data can reasonably be 
trusted and it can be enabled in runtime without much of a performance 
implication, tracking lcore load could be much more useful, than just 
best effort-telemetry.

Why is it so important not to require changes to the application? The 
changes are likely trivial, not unlike those I've submitted for the 
equivalent bookkeeping for DPDK services.

>>> What if there was instead a busyness module, where the application
>>> would
>>> explicitly report what it was up to. The new library would hook up to
>>> telemetry just like this patchset does, plus provide an explicit API to
>>> retrieve lcore thread load.
>>>
>>> The service cores framework (fancy name for rte_service.c) could also
>>> call the lcore load tracking module, provided all services properly
>>> reported back on whether or not they were doing anything useful with
>>> the
>>> cycles they just spent.
>>>
>>> The metrics of such a load tracking module could potentially be used by
>>> other modules in DPDK, or by the application. It could potentially be
>>> used for dynamic load balancing of service core services, or for power
>>> management (e.g, DVFS), or for a potential future deferred-work type
>>> mechanism more sophisticated than current rte_service, or some green
>>> threads/coroutines/fiber thingy. The DSW event device could also use it
>>> to replace its current internal load estimation scheme.
>>
>> [...]
>>
>> I agree 100 % with everything Mattias wrote above, and I would like to voice my opinion too.
>>
>> This patch is full of preconditions and assumptions. Its only true advantage (vs. a generic load tracking library) is that it doesn't require any application modifications, and thus can be deployed with zero effort.
>>
>> I my opinion, it would be much better with a well designed generic load tracking library, to be called from the application, so it gets correct information about what the lcores spend their cycles doing. And as Mattias mentions: With the appropriate API for consumption of the collected data, it could also provide actionable statistics for use by the application itself, not just telemetry. ("Actionable statistics": Statistics that is directly usable for decision making.)
>>
>> There is also the aspect of time-to-benefit: This patch immediately provides benefits (to the users of the DPDK applications that meet the preconditions/assumptions of the patch), while a generic load tracking library will take years to get integrated into applications before it provides benefits (to the users of the DPDK applications that use the new library).
>>
>> So, we should ask ourselves: Do we want an application-specific solution with a short time-to-benefit, or a generic solution with a long time-to-benefit? (I use the term "application specific" because not all applications can be tweaked to provide meaningful data with this patch. You might also label a generic library "application specific", because it requires that the application uses the library - however that is a common requirement of all DPDK libraries.)
>>
>> Furthermore, if the proposed patch is primarily for the benefit of OVS, I suppose that calls to a generic load tracking library could be added to OVS within a relatively short time frame (although not as quick as this patch).
>>
>> I guess that the developers of this patch initially thought that it was generic and usable for the majority of applications, and it came as somewhat a surprise that it wasn't as generic as expected. The DPDK community has a good review process with open discussions and sharing of thoughts and ideas. Sometimes, an idea doesn't fly, because the corner cases turn out to be more common than expected. I'm sorry to say it, but I think that is the case for this patch. :-(
>>
> 
> I'd actually like to question this last statement a little.
> 
> I think we in the DPDK community are very good at coming up with
> theoretical examples where things don't work, but are they really cases
> that occur commonly in the real-world?
> 
> I accept, for example, that the "for free" approach would not be suitable
> for something like VPP which does multiple polls to gather packets before
> processing, but for some of the other cases I'd question their commonality.
> For example, a number of objections have focused on the case where
> allocation of buffers fails and so the busyness gets counted wrongly.  Are
> there really (many) apps out there where running out of buffers is not a
> much more serious problem than incorrectly reported busyness stats?
> 

Many, if not all, non-trivial DPDK applications will poll multiple 
sources of work, some of which almost always will fail to produce any 
items. In such cases, they will transit between the busy and idle state, 
potentially several times, for every iteration in their lcore thread 
poll loop. That will cause a performance degradation if this features is 
used, and there's nothing they can do to fix it from the application 
level, assuming they find this telemetry statistic useful and don't want 
it disabled. So, not "for free", although may be you can still argue 
it's a bargain. :)

> I'd also say that, in my experience, the non-open-source end-user apps tend
> very much to use DPDK based on the style of operation given in our DPDK
> examples, rather than trying out new or different ways of working. (Maybe
> others have different experiences, though, and can comment). I also tend to
> believe that open-source software using DPDK probably shows more variety in
> how things are done, which is not representative of a lot of non-OSS users
> of DPDK.
> 
> Regards,
> /Bruce
  
Konstantin Ananyev Oct. 4, 2022, 11:30 p.m. UTC | #14
Hi Bruce,

> On Tue, Oct 04, 2022 at 11:15:19AM +0200, Morten Brørup wrote:
> > > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > > Sent: Monday, 3 October 2022 22.02
> >
> > [...]
> >
> > > The functionality provided is very useful, and the implementation is
> > > clever in the way it doesn't require any application modifications.
> > > But,
> > > a clever, useful brittle hack is still a brittle hack.
> > >
> 
> I think that may be a little harsh here. After all, this is a feature which
> is build-time disabled and runtime disabled by default, so like many other
> components it's designed for use when it makes sense to do so.

Honestly, I don't understand why both you and Kevin think that conditional
compilation provides some sort of indulgence here...
Putting #ifdef around problematic code wouldn't make it any better.
In fact, I think it only makes things worse - adds more confusion,
makes it harder to follow the code, etc. 

> 
> Furthermore, I'd just like to point out that the authors, when doing the
> patches, have left in the hooks so that even apps, for which the "for-free"
> scheme doesn't work, can still leverage the infrastructure to have the app
> itself report the busy/free metrics.

Ok, then it is probably good opportunity not to push for problematic solution,
but try to exploit these hook-points?
Take one of existing DPDK examples, add code to expose these hook points.
That will also demonstrate for the user how to use these hooks properly,
and how difficult It would be to adopt such approach.   
 
> > > What if there was instead a busyness module, where the application
> > > would
> > > explicitly report what it was up to. The new library would hook up to
> > > telemetry just like this patchset does, plus provide an explicit API to
> > > retrieve lcore thread load.
> > >
> > > The service cores framework (fancy name for rte_service.c) could also
> > > call the lcore load tracking module, provided all services properly
> > > reported back on whether or not they were doing anything useful with
> > > the
> > > cycles they just spent.
> > >
> > > The metrics of such a load tracking module could potentially be used by
> > > other modules in DPDK, or by the application. It could potentially be
> > > used for dynamic load balancing of service core services, or for power
> > > management (e.g, DVFS), or for a potential future deferred-work type
> > > mechanism more sophisticated than current rte_service, or some green
> > > threads/coroutines/fiber thingy. The DSW event device could also use it
> > > to replace its current internal load estimation scheme.
> >
> > [...]
> >
> > I agree 100 % with everything Mattias wrote above, and I would like to voice my opinion too.
> >
> > This patch is full of preconditions and assumptions. Its only true advantage (vs. a generic load tracking library) is that it doesn't
> require any application modifications, and thus can be deployed with zero effort.
> >
> > I my opinion, it would be much better with a well designed generic load tracking library, to be called from the application, so it gets
> correct information about what the lcores spend their cycles doing. And as Mattias mentions: With the appropriate API for
> consumption of the collected data, it could also provide actionable statistics for use by the application itself, not just telemetry.
> ("Actionable statistics": Statistics that is directly usable for decision making.)
> >
> > There is also the aspect of time-to-benefit: This patch immediately provides benefits (to the users of the DPDK applications that
> meet the preconditions/assumptions of the patch), while a generic load tracking library will take years to get integrated into
> applications before it provides benefits (to the users of the DPDK applications that use the new library).
> >
> > So, we should ask ourselves: Do we want an application-specific solution with a short time-to-benefit, or a generic solution with a
> long time-to-benefit? (I use the term "application specific" because not all applications can be tweaked to provide meaningful data
> with this patch. You might also label a generic library "application specific", because it requires that the application uses the library -
> however that is a common requirement of all DPDK libraries.)
> >
> > Furthermore, if the proposed patch is primarily for the benefit of OVS, I suppose that calls to a generic load tracking library could be
> added to OVS within a relatively short time frame (although not as quick as this patch).
> >
> > I guess that the developers of this patch initially thought that it was generic and usable for the majority of applications, and it came
> as somewhat a surprise that it wasn't as generic as expected. The DPDK community has a good review process with open discussions
> and sharing of thoughts and ideas. Sometimes, an idea doesn't fly, because the corner cases turn out to be more common than
> expected. I'm sorry to say it, but I think that is the case for this patch. :-(
> >
> 
> I'd actually like to question this last statement a little.
> 
> I think we in the DPDK community are very good at coming up with
> theoretical examples where things don't work, but are they really cases
> that occur commonly in the real-world?
> 
> I accept, for example, that the "for free" approach would not be suitable
> for something like VPP which does multiple polls to gather packets before
> processing, but for some of the other cases I'd question their commonality.
> For example, a number of objections have focused on the case where
> allocation of buffers fails and so the busyness gets counted wrongly.  Are
> there really (many) apps out there where running out of buffers is not a
> much more serious problem than incorrectly reported busyness stats?

Obviously, inability to dynamically allocate a memory could flag a serious problem. 
Though I don't see why it should be treated as an excuse to provide a misleading statistics.
There are many real-world network appliances that supposed
to keep working properly even under severe memory pressure.
As an example: suppose your app is doing some sort of TCP connection tracking.
So, for  every new flow you need to allocate some socket-like structure.
Also suppose that for performance reasons you use DPDK mempool to manage
these structures.
Now, it could be situations (SYN flood attack)  when you run out of your sockets.
In that situation it is probably ok to start dropping such packets,
but traffic belonging to already existing connections, plus non-TCP traffic    
still expected to be handled properly.

> 
> I'd also say that, in my experience, the non-open-source end-user apps tend
> very much to use DPDK based on the style of operation given in our DPDK
> examples, rather than trying out new or different ways of working. (Maybe
> others have different experiences, though, and can comment). I also tend to
> believe that open-source software using DPDK probably shows more variety in
> how things are done, which is not representative of a lot of non-OSS users
> of DPDK.
> 
> Regards,
> /Bruce
  

Patch

diff --git a/config/meson.build b/config/meson.build
index 7f7b6c92fd..d5954a059c 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -297,6 +297,7 @@  endforeach
 dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
 dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
 dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
+dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
 # values which have defaults which may be overridden
 dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
 dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
diff --git a/config/rte_config.h b/config/rte_config.h
index ae56a86394..86ac3b8a6e 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,7 @@ 
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6a98d3f11 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@  extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@  rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@  rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@  rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@  rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..fabc495a8e 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@  rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..a5b1d7c594 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@  rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..428157ec64 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@  rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@  rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..4c916c0fd2 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@  rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while ((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) != 0) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@  rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..3e27e0fd2b 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@ 
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@  rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@  rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@  rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@  rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_poll_telemetry.c b/lib/eal/common/eal_common_lcore_poll_telemetry.c
new file mode 100644
index 0000000000..d97996e85f
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_poll_telemetry.c
@@ -0,0 +1,303 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_poll_telemetry {
+	int poll_busyness;
+	/**< Calculated poll busyness (gets set/returned by the API) */
+	int raw_poll_busyness;
+	/**< Calculated poll busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_poll_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_poll_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+		td->raw_poll_busyness = 0;
+	}
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
+	/* if more than 1000 busyness periods have passed, this core is considered inactive */
+	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS * tsc_ms * 1000;
+	struct lcore_poll_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its poll busyness */
+	return telemetry_data[lcore_id].poll_busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable)
+{
+	int set = rte_atomic32_cmpset((volatile uint32_t *)&__rte_lcore_poll_telemetry_enabled,
+			(int)!enable, (int)enable);
+
+	/* Reset counters on successful disable */
+	if (set && !enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_poll_busyness(const struct lcore_poll_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * We don't want to use floating point math here, but we want for our poll
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows poll busyness
+	 * as measured just-in-time. Therefore, we will calculate the average poll
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. This allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+
+	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle =
+			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
+
+	/* convert idleness to poll busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_poll_telemetry *tdata;
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	/* This telemetry is not supported for unregistered non-EAL threads */
+	if (lcore_id >= RTE_MAX_LCORE) {
+		RTE_LOG(DEBUG, EAL,
+				"Lcore telemetry not supported on unregistered non-EAL thread %d",
+				lcore_id);
+		return;
+	}
+
+	tdata = &telemetry_data[lcore_id];
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
+		tdata->raw_poll_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
+	if (diff_int > interval) {
+		int raw_poll_busyness;
+
+		/* get updated poll_busyness value */
+		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_poll_busyness = raw_poll_busyness;
+		/* bring poll busyness back to 0..100 range, biased to round up */
+		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_poll_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(true);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_poll_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(false);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
+
+	return 0;
+}
+
+static int
+lcore_handle_poll_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
+	}
+
+	return 0;
+}
+
+void
+eal_lcore_poll_telemetry_free(void)
+{
+	if (telemetry_data != NULL) {
+		free(telemetry_data);
+		telemetry_data = NULL;
+	}
+}
+
+RTE_INIT(lcore_init_poll_telemetry)
+{
+	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+	if (telemetry_data == NULL)
+		rte_panic("Could not init lcore telemetry data: Out of memory\n");
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
+				   "return percentage poll busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
+				   "enable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
+				   "disable lcore poll busyness measurement");
+
+	rte_atomic32_set(&__rte_lcore_poll_telemetry_enabled, true);
+}
+
+#else
+
+int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
+{
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+void eal_lcore_poll_telemetry_free(void)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..e5741ce9f9 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@  sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_poll_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index 26fbc91b26..92c4af9c28 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -895,6 +895,7 @@  rte_eal_cleanup(void)
 	rte_mp_channel_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_lcore_poll_telemetry_free();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	rte_eal_alarm_cleanup();
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..2191c2473a 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -16,6 +16,7 @@ 
 #include <rte_eal.h>
 #include <rte_launch.h>
 #include <rte_thread.h>
+#include <rte_atomic.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -415,9 +416,91 @@  rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read poll busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read poll busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore poll busyness telemetry is enabled.
+ *
+ * @return
+ *   - true if lcore telemetry is enabled
+ *   - false if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable poll busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_poll_busyness_enabled_set(bool enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore poll busyness timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_poll_busyness_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
+
+/** @internal free memory allocated for lcore telemetry */
+void
+eal_lcore_poll_telemetry_free(void);
+
+/**
+ * Call lcore poll busyness timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do {				\
+	int enabled = (int)rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);	\
+	if (enabled)								\
+		__rte_lcore_poll_busyness_timestamp(nb_rx);			\
+} while (0)
+#else
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do { } while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
 
-
 #endif /* _RTE_LCORE_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 37d29643a5..5e81352a81 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1364,6 +1364,7 @@  rte_eal_cleanup(void)
 	rte_mp_channel_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_lcore_poll_telemetry_free();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..2fb90d446b 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@  subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core poll busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1f293e768b..3275d1fac4 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,13 @@  EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_poll_busyness_timestamp;
+	__rte_lcore_poll_telemetry_enabled;
+	rte_lcore_poll_busyness;
+	rte_lcore_poll_busyness_enabled;
+	rte_lcore_poll_busyness_enabled_set;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..4c8113f31f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@  rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a65b3c7c85 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@  rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@  rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..1cba53270a 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -16,6 +16,7 @@ 
 #include <rte_common.h>
 #include <rte_malloc.h>
 #include <rte_telemetry.h>
+#include <rte_lcore.h>
 
 #include "rte_rawdev.h"
 #include "rte_rawdev_pmd.h"
@@ -226,12 +227,15 @@  rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..8caaed502f 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@  rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@  rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..cf2370c238 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@  __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n);
 	return n;
 }
 
diff --git a/meson_options.txt b/meson_options.txt
index 7c220ad68d..9b20a36fdb 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -20,6 +20,8 @@  option('enable_driver_sdk', type: 'boolean', value: false, description:
        'Install headers to build drivers.')
 option('enable_kmods', type: 'boolean', value: false, description:
        'build kernel modules')
+option('enable_lcore_poll_busyness', type: 'boolean', value: false, description:
+       'enable collection of lcore poll busyness telemetry')
 option('examples', type: 'string', value: '', description:
        'Comma-separated list of examples to build by default')
 option('flexran_sdk', type: 'string', value: '', description: