From patchwork Mon Jun 28 15:54:08 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 94900 Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id C6333A0A0C; Mon, 28 Jun 2021 17:54:25 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id DEA1541148; Mon, 28 Jun 2021 17:54:23 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by mails.dpdk.org (Postfix) with ESMTP id A579941145 for ; Mon, 28 Jun 2021 17:54:20 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10029"; a="204975490" X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="204975490" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jun 2021 08:54:20 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="625296663" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga005.jf.intel.com with ESMTP; 28 Jun 2021 08:54:17 -0700 From: Anatoly Burakov To: dev@dpdk.org, Timothy McDaniel , Beilei Xing , Jingjing Wu , Qiming Yang , Qi Zhang , Haiyue Wang , Matan Azrad , Shahaf Shuler , Viacheslav Ovsiienko , Bruce Richardson , Konstantin Ananyev Cc: david.hunt@intel.com, ciara.loftus@intel.com Date: Mon, 28 Jun 2021 15:54:08 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v4 1/7] power_intrinsics: use callbacks for comparison X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Previously, the semantics of power monitor were such that we were checking current value against the expected value, and if they matched, then the sleep was aborted. This is somewhat inflexible, because it only allowed us to check for a specific value in a specific way. This commit replaces the comparison with a user callback mechanism, so that any PMD (or other code) using `rte_power_monitor()` can define their own comparison semantics and decision making on how to detect the need to abort the entering of power optimized state. Existing implementations are adjusted to follow the new semantics. Suggested-by: Konstantin Ananyev Signed-off-by: Anatoly Burakov Acked-by: Konstantin Ananyev --- Notes: v4: - Return error if callback is set to NULL - Replace raw number with a macro in monitor condition opaque data v2: - Use callback mechanism for more flexibility - Address feedback from Konstantin doc/guides/rel_notes/release_21_08.rst | 1 + drivers/event/dlb2/dlb2.c | 17 ++++++++-- drivers/net/i40e/i40e_rxtx.c | 20 +++++++---- drivers/net/iavf/iavf_rxtx.c | 20 +++++++---- drivers/net/ice/ice_rxtx.c | 20 +++++++---- drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++---- drivers/net/mlx5/mlx5_rx.c | 17 ++++++++-- .../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++---- lib/eal/x86/rte_power_intrinsics.c | 17 +++++----- 9 files changed, 121 insertions(+), 44 deletions(-) diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst index a6ecfdf3ce..c84ac280f5 100644 --- a/doc/guides/rel_notes/release_21_08.rst +++ b/doc/guides/rel_notes/release_21_08.rst @@ -84,6 +84,7 @@ API Changes Also, make sure to start the actual text at the margin. ======================================================= +* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism. ABI Changes ----------- diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c index eca183753f..252bbd8d5e 100644 --- a/drivers/event/dlb2/dlb2.c +++ b/drivers/event/dlb2/dlb2.c @@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num) } } +#define CLB_MASK_IDX 0 +#define CLB_VAL_IDX 1 +static int +dlb2_monitor_callback(const uint64_t val, + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]) +{ + /* abort if the value matches */ + return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0; +} + static inline int dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, struct dlb2_eventdev_port *ev_port, @@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, expected_value = 0; pmc.addr = monitor_addr; - pmc.val = expected_value; - pmc.mask = qe_mask.raw_qe[1]; + /* store expected value and comparison mask in opaque data */ + pmc.opaque[CLB_VAL_IDX] = expected_value; + pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1]; + /* set up callback */ + pmc.fn = dlb2_monitor_callback; pmc.size = sizeof(uint64_t); rte_power_monitor(&pmc, timeout + start_ticks); diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 6c58decece..081682f88b 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -81,6 +81,18 @@ #define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \ (PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK) +static int +i40e_monitor_callback(const uint64_t value, + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused) +{ + const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + return (value & m) == m ? -1 : 0; +} + int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { @@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) /* watch for changes in status bit */ pmc->addr = &rxdp->wb.qword1.status_error_len; - /* - * we expect the DD bit to be set to 1 if this descriptor was already - * written to. - */ - pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); - pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + /* comparison callback */ + pmc->fn = i40e_monitor_callback; /* registers are 64-bit */ pmc->size = sizeof(uint64_t); diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c index 0361af0d85..7ed196ec22 100644 --- a/drivers/net/iavf/iavf_rxtx.c +++ b/drivers/net/iavf/iavf_rxtx.c @@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type) rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1; } +static int +iavf_monitor_callback(const uint64_t value, + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused) +{ + const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT); + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + return (value & m) == m ? -1 : 0; +} + int iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { @@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) /* watch for changes in status bit */ pmc->addr = &rxdp->wb.qword1.status_error_len; - /* - * we expect the DD bit to be set to 1 if this descriptor was already - * written to. - */ - pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT); - pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT); + /* comparison callback */ + pmc->fn = iavf_monitor_callback; /* registers are 64-bit */ pmc->size = sizeof(uint64_t); diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c index fc9bb5a3e7..d12437d19d 100644 --- a/drivers/net/ice/ice_rxtx.c +++ b/drivers/net/ice/ice_rxtx.c @@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask; uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask; uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask; +static int +ice_monitor_callback(const uint64_t value, + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused) +{ + const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + return (value & m) == m ? -1 : 0; +} + int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { @@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) /* watch for changes in status bit */ pmc->addr = &rxdp->wb.status_error0; - /* - * we expect the DD bit to be set to 1 if this descriptor was already - * written to. - */ - pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); - pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + /* comparison callback */ + pmc->fn = ice_monitor_callback; /* register is 16-bit */ pmc->size = sizeof(uint16_t); diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index d69f36e977..c814a28cb4 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -1369,6 +1369,18 @@ const uint32_t RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP, }; +static int +ixgbe_monitor_callback(const uint64_t value, + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused) +{ + const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + return (value & m) == m ? -1 : 0; +} + int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { @@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) /* watch for changes in status bit */ pmc->addr = &rxdp->wb.upper.status_error; - /* - * we expect the DD bit to be set to 1 if this descriptor was already - * written to. - */ - pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); - pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + /* comparison callback */ + pmc->fn = ixgbe_monitor_callback; /* the registers are 32-bit */ pmc->size = sizeof(uint32_t); diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c index 777a1d6e45..17370b77dc 100644 --- a/drivers/net/mlx5/mlx5_rx.c +++ b/drivers/net/mlx5/mlx5_rx.c @@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id) return rx_queue_count(rxq); } +#define CLB_VAL_IDX 0 +#define CLB_MSK_IDX 1 +static int +mlx_monitor_callback(const uint64_t value, + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]) +{ + const uint64_t m = opaque[CLB_MSK_IDX]; + const uint64_t v = opaque[CLB_VAL_IDX]; + + return (value & m) == v ? -1 : 0; +} + int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { struct mlx5_rxq_data *rxq = rx_queue; @@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) return -rte_errno; } pmc->addr = &cqe->op_own; - pmc->val = !!idx; - pmc->mask = MLX5_CQE_OWNER_MASK; + pmc->opaque[CLB_VAL_IDX] = !!idx; + pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK; + pmc->fn = mlx_monitor_callback; pmc->size = sizeof(uint8_t); return 0; } diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h index dddca3d41c..c9aa52a86d 100644 --- a/lib/eal/include/generic/rte_power_intrinsics.h +++ b/lib/eal/include/generic/rte_power_intrinsics.h @@ -18,19 +18,38 @@ * which are architecture-dependent. */ +/** Size of the opaque data in monitor condition */ +#define RTE_POWER_MONITOR_OPAQUE_SZ 4 + +/** + * Callback definition for monitoring conditions. Callbacks with this signature + * will be used by `rte_power_monitor()` to check if the entering of power + * optimized state should be aborted. + * + * @param val + * The value read from memory. + * @param opaque + * Callback-specific data. + * + * @return + * 0 if entering of power optimized state should proceed + * -1 if entering of power optimized state should be aborted + */ +typedef int (*rte_power_monitor_clb_t)(const uint64_t val, + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]); struct rte_power_monitor_cond { volatile void *addr; /**< Address to monitor for changes */ - uint64_t val; /**< If the `mask` is non-zero, location pointed - * to by `addr` will be read and compared - * against this value. - */ - uint64_t mask; /**< 64-bit mask to extract value read from `addr` */ - uint8_t size; /**< Data size (in bytes) that will be used to compare - * expected value (`val`) with data read from the + uint8_t size; /**< Data size (in bytes) that will be read from the * monitored memory location (`addr`). Can be 1, 2, * 4, or 8. Supplying any other value will result in * an error. */ + rte_power_monitor_clb_t fn; /**< Callback to be used to check if + * entering power optimized state should + * be aborted. + */ + uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]; + /**< Callback-specific data */ }; /** diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c index 39ea9fdecd..66fea28897 100644 --- a/lib/eal/x86/rte_power_intrinsics.c +++ b/lib/eal/x86/rte_power_intrinsics.c @@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); const unsigned int lcore_id = rte_lcore_id(); struct power_wait_status *s; + uint64_t cur_value; /* prevent user from running this instruction if it's not supported */ if (!wait_supported) @@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, if (__check_val_size(pmc->size) < 0) return -EINVAL; + if (pmc->fn == NULL) + return -EINVAL; + s = &wait_status[lcore_id]; /* update sleep address */ @@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, /* now that we've put this address into monitor, we can unlock */ rte_spinlock_unlock(&s->lock); - /* if we have a comparison mask, we might not need to sleep at all */ - if (pmc->mask) { - const uint64_t cur_value = __get_umwait_val( - pmc->addr, pmc->size); - const uint64_t masked = cur_value & pmc->mask; + cur_value = __get_umwait_val(pmc->addr, pmc->size); - /* if the masked value is already matching, abort */ - if (masked == pmc->val) - goto end; - } + /* check if callback indicates we should abort */ + if (pmc->fn(cur_value, pmc->opaque) != 0) + goto end; /* execute UMWAIT */ asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" From patchwork Mon Jun 28 15:54:09 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 94901 Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 0391CA0A0C; Mon, 28 Jun 2021 17:54:33 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 1460A41150; Mon, 28 Jun 2021 17:54:25 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by mails.dpdk.org (Postfix) with ESMTP id CC5F041145 for ; Mon, 28 Jun 2021 17:54:22 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10029"; a="204975496" X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="204975496" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jun 2021 08:54:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="625296677" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga005.jf.intel.com with ESMTP; 28 Jun 2021 08:54:20 -0700 From: Anatoly Burakov To: dev@dpdk.org, Ciara Loftus , Qi Zhang Cc: david.hunt@intel.com, konstantin.ananyev@intel.com Date: Mon, 28 Jun 2021 15:54:09 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v4 2/7] net/af_xdp: add power monitor support X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Implement support for .get_monitor_addr in AF_XDP driver. Signed-off-by: Anatoly Burakov --- Notes: v2: - Rewrite using the callback mechanism drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c index eb5660a3dc..7830d0c23a 100644 --- a/drivers/net/af_xdp/rte_eth_af_xdp.c +++ b/drivers/net/af_xdp/rte_eth_af_xdp.c @@ -37,6 +37,7 @@ #include #include #include +#include #include "compat.h" @@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev) return 0; } +#define CLB_VAL_IDX 0 +static int +eth_monitor_callback(const uint64_t value, + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]) +{ + const uint64_t v = opaque[CLB_VAL_IDX]; + const uint64_t m = (uint32_t)~0; + + /* if the value has changed, abort entering power optimized state */ + return (value & m) == v ? 0 : -1; +} + +static int +eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + struct pkt_rx_queue *rxq = rx_queue; + unsigned int *prod = rxq->rx.producer; + const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */ + + /* watch for changes in producer ring */ + pmc->addr = (void*)prod; + + /* store current value */ + pmc->opaque[CLB_VAL_IDX] = cur_val; + pmc->fn = eth_monitor_callback; + + /* AF_XDP producer ring index is 32-bit */ + pmc->size = sizeof(uint32_t); + + return 0; +} + static int eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) { @@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = { .link_update = eth_link_update, .stats_get = eth_stats_get, .stats_reset = eth_stats_reset, + .get_monitor_addr = eth_get_monitor_addr }; /** parse busy_budget argument */ From patchwork Mon Jun 28 15:54:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 94902 Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 7FCBEA0A0C; Mon, 28 Jun 2021 17:54:41 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A624541156; Mon, 28 Jun 2021 17:54:31 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by mails.dpdk.org (Postfix) with ESMTP id 790DC4068A for ; Mon, 28 Jun 2021 17:54:26 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10029"; a="204975502" X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="204975502" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jun 2021 08:54:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="625296688" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga005.jf.intel.com with ESMTP; 28 Jun 2021 08:54:22 -0700 From: Anatoly Burakov To: dev@dpdk.org, Jerin Jacob , Ruifeng Wang , Jan Viktorin , David Christensen , Ray Kinsella , Neil Horman , Bruce Richardson , Konstantin Ananyev Cc: david.hunt@intel.com, ciara.loftus@intel.com Date: Mon, 28 Jun 2021 15:54:10 +0000 Message-Id: <0ddcda8c9a67ad214a8f98c851d10a279920f581.1624895595.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v4 3/7] eal: add power monitor for multiple events X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Use RTM and WAITPKG instructions to perform a wait-for-writes similar to what UMWAIT does, but without the limitation of having to listen for just one event. This works because the optimized power state used by the TPAUSE instruction will cause a wake up on RTM transaction abort, so if we add the addresses we're interested in to the read-set, any write to those addresses will wake us up. Signed-off-by: Konstantin Ananyev Signed-off-by: Anatoly Burakov --- Notes: v4: - Fixed bugs in accessing the monitor condition - Abort on any monitor condition not having a defined callback v2: - Adapt to callback mechanism doc/guides/rel_notes/release_21_08.rst | 2 + lib/eal/arm/rte_power_intrinsics.c | 11 +++ lib/eal/include/generic/rte_cpuflags.h | 2 + .../include/generic/rte_power_intrinsics.h | 35 +++++++++ lib/eal/ppc/rte_power_intrinsics.c | 11 +++ lib/eal/version.map | 3 + lib/eal/x86/rte_cpuflags.c | 2 + lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++ 8 files changed, 139 insertions(+) diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst index c84ac280f5..9d1cfac395 100644 --- a/doc/guides/rel_notes/release_21_08.rst +++ b/doc/guides/rel_notes/release_21_08.rst @@ -55,6 +55,8 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events. + Removed Items ------------- diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c index e83f04072a..78f55b7203 100644 --- a/lib/eal/arm/rte_power_intrinsics.c +++ b/lib/eal/arm/rte_power_intrinsics.c @@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id) return -ENOTSUP; } + +int +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[], + const uint32_t num, const uint64_t tsc_timestamp) +{ + RTE_SET_USED(pmc); + RTE_SET_USED(num); + RTE_SET_USED(tsc_timestamp); + + return -ENOTSUP; +} diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h index 28a5aecde8..d35551e931 100644 --- a/lib/eal/include/generic/rte_cpuflags.h +++ b/lib/eal/include/generic/rte_cpuflags.h @@ -24,6 +24,8 @@ struct rte_cpu_intrinsics { /**< indicates support for rte_power_monitor function */ uint32_t power_pause : 1; /**< indicates support for rte_power_pause function */ + uint32_t power_monitor_multi : 1; + /**< indicates support for rte_power_monitor_multi function */ }; /** diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h index c9aa52a86d..04e8c2ab37 100644 --- a/lib/eal/include/generic/rte_power_intrinsics.h +++ b/lib/eal/include/generic/rte_power_intrinsics.h @@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id); __rte_experimental int rte_power_pause(const uint64_t tsc_timestamp); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Monitor a set of addresses for changes. This will cause the CPU to enter an + * architecture-defined optimized power state until either one of the specified + * memory addresses is written to, a certain TSC timestamp is reached, or other + * reasons cause the CPU to wake up. + * + * Additionally, `expected` 64-bit values and 64-bit masks are provided. If + * mask is non-zero, the current value pointed to by the `p` pointer will be + * checked against the expected value, and if they do not match, the entering of + * optimized power state may be aborted. + * + * @warning It is responsibility of the user to check if this function is + * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. + * Failing to do so may result in an illegal CPU instruction error. + * + * @param pmc + * An array of monitoring condition structures. + * @param num + * Length of the `pmc` array. + * @param tsc_timestamp + * Maximum TSC timestamp to wait for. Note that the wait behavior is + * architecture-dependent. + * + * @return + * 0 on success + * -EINVAL on invalid parameters + * -ENOTSUP if unsupported + */ +__rte_experimental +int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[], + const uint32_t num, const uint64_t tsc_timestamp); + #endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c index 7fc9586da7..f00b58ade5 100644 --- a/lib/eal/ppc/rte_power_intrinsics.c +++ b/lib/eal/ppc/rte_power_intrinsics.c @@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id) return -ENOTSUP; } + +int +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[], + const uint32_t num, const uint64_t tsc_timestamp) +{ + RTE_SET_USED(pmc); + RTE_SET_USED(num); + RTE_SET_USED(tsc_timestamp); + + return -ENOTSUP; +} diff --git a/lib/eal/version.map b/lib/eal/version.map index fe5c3dac98..4ccd5475d6 100644 --- a/lib/eal/version.map +++ b/lib/eal/version.map @@ -423,6 +423,9 @@ EXPERIMENTAL { rte_version_release; # WINDOWS_NO_EXPORT rte_version_suffix; # WINDOWS_NO_EXPORT rte_version_year; # WINDOWS_NO_EXPORT + + # added in 21.08 + rte_power_monitor_multi; # WINDOWS_NO_EXPORT }; INTERNAL { diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c index a96312ff7f..d339734a8c 100644 --- a/lib/eal/x86/rte_cpuflags.c +++ b/lib/eal/x86/rte_cpuflags.c @@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics) if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) { intrinsics->power_monitor = 1; intrinsics->power_pause = 1; + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM)) + intrinsics->power_monitor_multi = 1; } } diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c index 66fea28897..f749da9b85 100644 --- a/lib/eal/x86/rte_power_intrinsics.c +++ b/lib/eal/x86/rte_power_intrinsics.c @@ -4,6 +4,7 @@ #include #include +#include #include #include "rte_power_intrinsics.h" @@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr) } static bool wait_supported; +static bool wait_multi_supported; static inline uint64_t __get_umwait_val(const volatile void *p, const uint8_t sz) @@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) { if (i.power_monitor && i.power_pause) wait_supported = 1; + if (i.power_monitor_multi) + wait_multi_supported = 1; } int @@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id) * In this case, since we've already woken up, the "wakeup" was * unneeded, and since T1 is still waiting on T2 releasing the lock, the * wakeup address is still valid so it's perfectly safe to write it. + * + * For multi-monitor case, the act of locking will in itself trigger the + * wakeup, so no additional writes necessary. */ rte_spinlock_lock(&s->lock); if (s->monitor_addr != NULL) @@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id) return 0; } + +int +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[], + const uint32_t num, const uint64_t tsc_timestamp) +{ + const unsigned int lcore_id = rte_lcore_id(); + struct power_wait_status *s = &wait_status[lcore_id]; + uint32_t i, rc; + + /* check if supported */ + if (!wait_multi_supported) + return -ENOTSUP; + + if (pmc == NULL || num == 0) + return -EINVAL; + + /* we are already inside transaction region, return */ + if (rte_xtest() != 0) + return 0; + + /* start new transaction region */ + rc = rte_xbegin(); + + /* transaction abort, possible write to one of wait addresses */ + if (rc != RTE_XBEGIN_STARTED) + return 0; + + /* + * the mere act of reading the lock status here adds the lock to + * the read set. This means that when we trigger a wakeup from another + * thread, even if we don't have a defined wakeup address and thus don't + * actually cause any writes, the act of locking our lock will itself + * trigger the wakeup and abort the transaction. + */ + rte_spinlock_is_locked(&s->lock); + + /* + * add all addresses to wait on into transaction read-set and check if + * any of wakeup conditions are already met. + */ + rc = 0; + for (i = 0; i < num; i++) { + const struct rte_power_monitor_cond *c = &pmc[i]; + + /* cannot be NULL */ + if (c->fn == NULL) { + rc = -EINVAL; + break; + } + + const uint64_t val = __get_umwait_val(c->addr, c->size); + + /* abort if callback indicates that we need to stop */ + if (c->fn(val, c->opaque) != 0) + break; + } + + /* none of the conditions were met, sleep until timeout */ + if (i == num) + rte_power_pause(tsc_timestamp); + + /* end transaction region */ + rte_xend(); + + return rc; +} From patchwork Mon Jun 28 15:54:11 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 94903 Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id B5D26A0A0C; Mon, 28 Jun 2021 17:54:48 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id CECBA41160; Mon, 28 Jun 2021 17:54:32 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by mails.dpdk.org (Postfix) with ESMTP id 94B2E4068A for ; Mon, 28 Jun 2021 17:54:28 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10029"; a="204975509" X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="204975509" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jun 2021 08:54:28 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="625296693" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga005.jf.intel.com with ESMTP; 28 Jun 2021 08:54:26 -0700 From: Anatoly Burakov To: dev@dpdk.org, David Hunt Cc: konstantin.ananyev@intel.com, ciara.loftus@intel.com Date: Mon, 28 Jun 2021 15:54:11 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v4 4/7] power: remove thread safety from PMD power API's X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Currently, we expect that only one callback can be active at any given moment, for a particular queue configuration, which is relatively easy to implement in a thread-safe way. However, we're about to add support for multiple queues per lcore, which will greatly increase the possibility of various race conditions. We could have used something like an RCU for this use case, but absent of a pressing need for thread safety we'll go the easy way and just mandate that the API's are to be called when all affected ports are stopped, and document this limitation. This greatly simplifies the `rte_power_monitor`-related code. Signed-off-by: Anatoly Burakov --- Notes: v2: - Add check for stopped queue - Clarified doc message - Added release notes doc/guides/rel_notes/release_21_08.rst | 5 + lib/power/meson.build | 3 + lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++--------------- lib/power/rte_power_pmd_mgmt.h | 6 ++ 4 files changed, 67 insertions(+), 80 deletions(-) diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst index 9d1cfac395..f015c509fc 100644 --- a/doc/guides/rel_notes/release_21_08.rst +++ b/doc/guides/rel_notes/release_21_08.rst @@ -88,6 +88,11 @@ API Changes * eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism. +* rte_power: The experimental PMD power management API is no longer considered + to be thread safe; all Rx queues affected by the API will now need to be + stopped before making any changes to the power management scheme. + + ABI Changes ----------- diff --git a/lib/power/meson.build b/lib/power/meson.build index c1097d32f1..4f6a242364 100644 --- a/lib/power/meson.build +++ b/lib/power/meson.build @@ -21,4 +21,7 @@ headers = files( 'rte_power_pmd_mgmt.h', 'rte_power_guest_channel.h', ) +if cc.has_argument('-Wno-cast-qual') + cflags += '-Wno-cast-qual' +endif deps += ['timer', 'ethdev'] diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c index db03cbf420..9b95cf1794 100644 --- a/lib/power/rte_power_pmd_mgmt.c +++ b/lib/power/rte_power_pmd_mgmt.c @@ -40,8 +40,6 @@ struct pmd_queue_cfg { /**< Callback mode for this queue */ const struct rte_eth_rxtx_callback *cur_cb; /**< Callback instance */ - volatile bool umwait_in_progress; - /**< are we currently sleeping? */ uint64_t empty_poll_stats; /**< Number of empty polls */ } __rte_cache_aligned; @@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, struct rte_power_monitor_cond pmc; uint16_t ret; - /* - * we might get a cancellation request while being - * inside the callback, in which case the wakeup - * wouldn't work because it would've arrived too early. - * - * to get around this, we notify the other thread that - * we're sleeping, so that it can spin until we're done. - * unsolicited wakeups are perfectly safe. - */ - q_conf->umwait_in_progress = true; - - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - - /* check if we need to cancel sleep */ - if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) { - /* use monitoring condition to sleep */ - ret = rte_eth_get_monitor_addr(port_id, qidx, - &pmc); - if (ret == 0) - rte_power_monitor(&pmc, UINT64_MAX); - } - q_conf->umwait_in_progress = false; - - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); + /* use monitoring condition to sleep */ + ret = rte_eth_get_monitor_addr(port_id, qidx, + &pmc); + if (ret == 0) + rte_power_monitor(&pmc, UINT64_MAX); } } else q_conf->empty_poll_stats = 0; @@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx, return nb_rx; } +static int +queue_stopped(const uint16_t port_id, const uint16_t queue_id) +{ + struct rte_eth_rxq_info qinfo; + + if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0) + return -1; + + return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED; +} + int rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id, enum rte_power_pmd_mgmt_type mode) { struct pmd_queue_cfg *queue_cfg; struct rte_eth_dev_info info; + rte_rx_callback_fn clb; int ret; RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); @@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, goto end; } + /* check if the queue is stopped */ + ret = queue_stopped(port_id, queue_id); + if (ret != 1) { + /* error means invalid queue, 0 means queue wasn't stopped */ + ret = ret < 0 ? -EINVAL : -EBUSY; + goto end; + } + queue_cfg = &port_cfg[port_id][queue_id]; if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) { @@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, ret = -ENOTSUP; goto end; } - /* initialize data before enabling the callback */ - queue_cfg->empty_poll_stats = 0; - queue_cfg->cb_mode = mode; - queue_cfg->umwait_in_progress = false; - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; - - /* ensure we update our state before callback starts */ - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, - clb_umwait, NULL); + clb = clb_umwait; break; } case RTE_POWER_MGMT_TYPE_SCALE: @@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, ret = -ENOTSUP; goto end; } - /* initialize data before enabling the callback */ - queue_cfg->empty_poll_stats = 0; - queue_cfg->cb_mode = mode; - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; - - /* this is not necessary here, but do it anyway */ - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, - queue_id, clb_scale_freq, NULL); + clb = clb_scale_freq; break; } case RTE_POWER_MGMT_TYPE_PAUSE: @@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, if (global_data.tsc_per_us == 0) calc_tsc(); - /* initialize data before enabling the callback */ - queue_cfg->empty_poll_stats = 0; - queue_cfg->cb_mode = mode; - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; - - /* this is not necessary here, but do it anyway */ - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, - clb_pause, NULL); + clb = clb_pause; break; + default: + RTE_LOG(DEBUG, POWER, "Invalid power management type\n"); + ret = -EINVAL; + goto end; } + + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb, NULL); + ret = 0; end: return ret; @@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id) { struct pmd_queue_cfg *queue_cfg; + int ret; RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) return -EINVAL; + /* check if the queue is stopped */ + ret = queue_stopped(port_id, queue_id); + if (ret != 1) { + /* error means invalid queue, 0 means queue wasn't stopped */ + return ret < 0 ? -EINVAL : -EBUSY; + } + /* no need to check queue id as wrong queue id would not be enabled */ queue_cfg = &port_cfg[port_id][queue_id]; @@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, /* stop any callbacks from progressing */ queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; - /* ensure we update our state before continuing */ - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - switch (queue_cfg->cb_mode) { - case RTE_POWER_MGMT_TYPE_MONITOR: - { - bool exit = false; - do { - /* - * we may request cancellation while the other thread - * has just entered the callback but hasn't started - * sleeping yet, so keep waking it up until we know it's - * done sleeping. - */ - if (queue_cfg->umwait_in_progress) - rte_power_monitor_wakeup(lcore_id); - else - exit = true; - } while (!exit); - } - /* fall-through */ + case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */ case RTE_POWER_MGMT_TYPE_PAUSE: rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cur_cb); @@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, break; } /* - * we don't free the RX callback here because it is unsafe to do so - * unless we know for a fact that all data plane threads have stopped. + * the API doc mandates that the user stops all processing on affected + * ports before calling any of these API's, so we can assume that the + * callbacks can be freed. we're intentionally casting away const-ness. */ - queue_cfg->cur_cb = NULL; + rte_free((void *)queue_cfg->cur_cb); return 0; } diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h index 7a0ac24625..444e7b8a66 100644 --- a/lib/power/rte_power_pmd_mgmt.h +++ b/lib/power/rte_power_pmd_mgmt.h @@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type { * * @note This function is not thread-safe. * + * @warning This function must be called when all affected Ethernet queues are + * stopped and no Rx/Tx is in progress! + * * @param lcore_id * The lcore the Rx queue will be polled from. * @param port_id @@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, * * @note This function is not thread-safe. * + * @warning This function must be called when all affected Ethernet queues are + * stopped and no Rx/Tx is in progress! + * * @param lcore_id * The lcore the Rx queue is polled from. * @param port_id From patchwork Mon Jun 28 15:54:12 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 94904 Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 76685A0A0C; Mon, 28 Jun 2021 17:54:55 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 0CCAF41165; Mon, 28 Jun 2021 17:54:34 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by mails.dpdk.org (Postfix) with ESMTP id E6E5D41145 for ; Mon, 28 Jun 2021 17:54:30 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10029"; a="204975517" X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="204975517" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jun 2021 08:54:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="625296697" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga005.jf.intel.com with ESMTP; 28 Jun 2021 08:54:28 -0700 From: Anatoly Burakov To: dev@dpdk.org, David Hunt , Ray Kinsella , Neil Horman Cc: konstantin.ananyev@intel.com, ciara.loftus@intel.com Date: Mon, 28 Jun 2021 15:54:12 +0000 Message-Id: <086144939a5a619f52afbf002d260412bcfad9cf.1624895595.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v4 5/7] power: support callbacks for multiple Rx queues X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Currently, there is a hard limitation on the PMD power management support that only allows it to support a single queue per lcore. This is not ideal as most DPDK use cases will poll multiple queues per core. The PMD power management mechanism relies on ethdev Rx callbacks, so it is very difficult to implement such support because callbacks are effectively stateless and have no visibility into what the other ethdev devices are doing. This places limitations on what we can do within the framework of Rx callbacks, but the basics of this implementation are as follows: - Replace per-queue structures with per-lcore ones, so that any device polled from the same lcore can share data - Any queue that is going to be polled from a specific lcore has to be added to the list of cores to poll, so that the callback is aware of other queues being polled by the same lcore - Both the empty poll counter and the actual power saving mechanism is shared between all queues polled on a particular lcore, and is only activated when a special designated "power saving" queue is polled. To put it another way, we have no idea which queue the user will poll in what order, so we rely on them telling us that queue X is the last one in the polling loop, so any power management should happen there. - A new API is added to mark a specific Rx queue as "power saving". Failing to call this API will result in no power management, however when having only one queue per core it is obvious which queue is the "power saving" one, so things will still work without this new API for use cases that were previously working without it. - The limitation on UMWAIT-based polling is not removed because UMWAIT is incapable of monitoring more than one address. Also, while we're at it, update and improve the docs. Signed-off-by: Anatoly Burakov --- Notes: v3: - Move the list of supported NICs to NIC feature table v2: - Use a TAILQ for queues instead of a static array - Address feedback from Konstantin - Add additional checks for stopped queues doc/guides/nics/features.rst | 10 + doc/guides/prog_guide/power_man.rst | 75 +++-- doc/guides/rel_notes/release_21_08.rst | 3 + lib/power/rte_power_pmd_mgmt.c | 381 ++++++++++++++++++++----- lib/power/rte_power_pmd_mgmt.h | 34 +++ lib/power/version.map | 3 + 6 files changed, 412 insertions(+), 94 deletions(-) diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst index 403c2b03a3..a96e12d155 100644 --- a/doc/guides/nics/features.rst +++ b/doc/guides/nics/features.rst @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information. * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``. * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``. +.. _nic_features_get_monitor_addr: + +PMD power management using monitor addresses +-------------------------------------------- + +Supports getting a monitoring condition to use together with Ethernet PMD power +management (see :doc:`../prog_guide/power_man` for more details). + +* **[implements] eth_dev_ops**: ``get_monitor_addr`` + .. _nic_features_other: Other dev ops not represented by a Feature diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index c70ae128ac..fac2c19516 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -198,34 +198,41 @@ Ethernet PMD Power Management API Abstract ~~~~~~~~ -Existing power management mechanisms require developers -to change application design or change code to make use of it. -The PMD power management API provides a convenient alternative -by utilizing Ethernet PMD RX callbacks, -and triggering power saving whenever empty poll count reaches a certain number. - -Monitor - This power saving scheme will put the CPU into optimized power state - and use the ``rte_power_monitor()`` function - to monitor the Ethernet PMD RX descriptor address, - and wake the CPU up whenever there's new traffic. - -Pause - This power saving scheme will avoid busy polling - by either entering power-optimized sleep state - with ``rte_power_pause()`` function, - or, if it's not available, use ``rte_pause()``. - -Frequency scaling - This power saving scheme will use ``librte_power`` library - functionality to scale the core frequency up/down - depending on traffic volume. - -.. note:: - - Currently, this power management API is limited to mandatory mapping - of 1 queue to 1 core (multiple queues are supported, - but they must be polled from different cores). +Existing power management mechanisms require developers to change application +design or change code to make use of it. The PMD power management API provides a +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering +power saving whenever empty poll count reaches a certain number. + +* Monitor + This power saving scheme will put the CPU into optimized power state and + monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever + there's new traffic. Support for this scheme may not be available on all + platforms, and further limitations may apply (see below). + +* Pause + This power saving scheme will avoid busy polling by either entering + power-optimized sleep state with ``rte_power_pause()`` function, or, if it's + not supported by the underlying platform, use ``rte_pause()``. + +* Frequency scaling + This power saving scheme will use ``librte_power`` library functionality to + scale the core frequency up/down depending on traffic volume. + +The "monitor" mode is only supported in the following configurations and scenarios: + +* If ``rte_cpu_get_intrinsics_support()`` function indicates that + ``rte_power_monitor()`` is supported by the platform, then monitoring will be + limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be + monitored from a different lcore). + +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the + ``rte_power_monitor()`` function is not supported, then monitor mode will not + be supported. + +* Not all Ethernet devices support monitoring, even if the underlying + platform may support the necessary CPU instructions. Please refer to + :doc:`../nics/overview` for more information. + API Overview for Ethernet PMD Power Management ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -234,6 +241,16 @@ API Overview for Ethernet PMD Power Management * **Queue Disable**: Disable power scheme for certain queue/port/core. +* **Set Power Save Queue**: In case of polling multiple queues from one lcore, + designate a specific queue to be the one that triggers power management routines. + +.. note:: + + When using PMD power management with multiple Ethernet Rx queues on one lcore, + it is required to designate one of the configured Rx queues as a "power save" + queue by calling the appropriate API. Failing to do so will result in no + power saving ever taking effect. + References ---------- @@ -242,3 +259,5 @@ References * The :doc:`../sample_app_ug/vm_power_management` chapter in the :doc:`../sample_app_ug/index` section. + +* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst index f015c509fc..3926d45ef8 100644 --- a/doc/guides/rel_notes/release_21_08.rst +++ b/doc/guides/rel_notes/release_21_08.rst @@ -57,6 +57,9 @@ New Features * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events. +* rte_power: The experimental PMD power management API now supports managing + multiple Ethernet Rx queues per lcore. + Removed Items ------------- diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c index 9b95cf1794..7762cd39b8 100644 --- a/lib/power/rte_power_pmd_mgmt.c +++ b/lib/power/rte_power_pmd_mgmt.c @@ -33,7 +33,28 @@ enum pmd_mgmt_state { PMD_MGMT_ENABLED }; -struct pmd_queue_cfg { +union queue { + uint32_t val; + struct { + uint16_t portid; + uint16_t qid; + }; +}; + +struct queue_list_entry { + TAILQ_ENTRY(queue_list_entry) next; + union queue queue; +}; + +struct pmd_core_cfg { + TAILQ_HEAD(queue_list_head, queue_list_entry) head; + /**< Which port-queue pairs are associated with this lcore? */ + union queue power_save_queue; + /**< When polling multiple queues, all but this one will be ignored */ + bool power_save_queue_set; + /**< When polling multiple queues, power save queue must be set */ + size_t n_queues; + /**< How many queues are in the list? */ volatile enum pmd_mgmt_state pwr_mgmt_state; /**< State of power management for this queue */ enum rte_power_pmd_mgmt_type cb_mode; @@ -43,8 +64,96 @@ struct pmd_queue_cfg { uint64_t empty_poll_stats; /**< Number of empty polls */ } __rte_cache_aligned; +static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE]; -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; +static inline bool +queue_equal(const union queue *l, const union queue *r) +{ + return l->val == r->val; +} + +static inline void +queue_copy(union queue *dst, const union queue *src) +{ + dst->val = src->val; +} + +static inline bool +queue_is_power_save(const struct pmd_core_cfg *cfg, const union queue *q) +{ + const union queue *pwrsave = &cfg->power_save_queue; + + /* if there's only single queue, no need to check anything */ + if (cfg->n_queues == 1) + return true; + return cfg->power_save_queue_set && queue_equal(q, pwrsave); +} + +static struct queue_list_entry * +queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q) +{ + struct queue_list_entry *cur; + + TAILQ_FOREACH(cur, &cfg->head, next) { + if (queue_equal(&cur->queue, q)) + return cur; + } + return NULL; +} + +static int +queue_set_power_save(struct pmd_core_cfg *cfg, const union queue *q) +{ + const struct queue_list_entry *found = queue_list_find(cfg, q); + if (found == NULL) + return -ENOENT; + queue_copy(&cfg->power_save_queue, q); + cfg->power_save_queue_set = true; + return 0; +} + +static int +queue_list_add(struct pmd_core_cfg *cfg, const union queue *q) +{ + struct queue_list_entry *qle; + + /* is it already in the list? */ + if (queue_list_find(cfg, q) != NULL) + return -EEXIST; + + qle = malloc(sizeof(*qle)); + if (qle == NULL) + return -ENOMEM; + + queue_copy(&qle->queue, q); + TAILQ_INSERT_TAIL(&cfg->head, qle, next); + cfg->n_queues++; + + return 0; +} + +static int +queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q) +{ + struct queue_list_entry *found; + + found = queue_list_find(cfg, q); + if (found == NULL) + return -ENOENT; + + TAILQ_REMOVE(&cfg->head, found, next); + cfg->n_queues--; + free(found); + + /* if this was a power save queue, unset it */ + if (cfg->power_save_queue_set && queue_is_power_save(cfg, q)) { + union queue *pwrsave = &cfg->power_save_queue; + cfg->power_save_queue_set = false; + pwrsave->val = 0; + } + + return 0; +} static void calc_tsc(void) @@ -79,10 +188,10 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *addr __rte_unused) { + const unsigned int lcore = rte_lcore_id(); + struct pmd_core_cfg *q_conf; - struct pmd_queue_cfg *q_conf; - - q_conf = &port_cfg[port_id][qidx]; + q_conf = &lcore_cfg[lcore]; if (unlikely(nb_rx == 0)) { q_conf->empty_poll_stats++; @@ -107,11 +216,26 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *addr __rte_unused) { - struct pmd_queue_cfg *q_conf; + const unsigned int lcore = rte_lcore_id(); + const union queue q = {.portid = port_id, .qid = qidx}; + const bool empty = nb_rx == 0; + struct pmd_core_cfg *q_conf; - q_conf = &port_cfg[port_id][qidx]; + q_conf = &lcore_cfg[lcore]; - if (unlikely(nb_rx == 0)) { + /* early exit */ + if (likely(!empty)) { + q_conf->empty_poll_stats = 0; + } else { + /* do we care about this particular queue? */ + if (!queue_is_power_save(q_conf, &q)) + return nb_rx; + + /* + * we can increment unconditionally here because if there were + * non-empty polls in other queues assigned to this core, we + * dropped the counter to zero anyway. + */ q_conf->empty_poll_stats++; /* sleep for 1 microsecond */ if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { @@ -127,8 +251,7 @@ clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, rte_pause(); } } - } else - q_conf->empty_poll_stats = 0; + } return nb_rx; } @@ -138,19 +261,33 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *_ __rte_unused) { - struct pmd_queue_cfg *q_conf; + const unsigned int lcore = rte_lcore_id(); + const union queue q = {.portid = port_id, .qid = qidx}; + const bool empty = nb_rx == 0; + struct pmd_core_cfg *q_conf; - q_conf = &port_cfg[port_id][qidx]; + q_conf = &lcore_cfg[lcore]; - if (unlikely(nb_rx == 0)) { + /* early exit */ + if (likely(!empty)) { + q_conf->empty_poll_stats = 0; + + /* scale up freq immediately */ + rte_power_freq_max(rte_lcore_id()); + } else { + /* do we care about this particular queue? */ + if (!queue_is_power_save(q_conf, &q)) + return nb_rx; + + /* + * we can increment unconditionally here because if there were + * non-empty polls in other queues assigned to this core, we + * dropped the counter to zero anyway. + */ q_conf->empty_poll_stats++; if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) /* scale down freq */ rte_power_freq_min(rte_lcore_id()); - } else { - q_conf->empty_poll_stats = 0; - /* scale up freq */ - rte_power_freq_max(rte_lcore_id()); } return nb_rx; @@ -167,11 +304,79 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id) return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED; } +static int +cfg_queues_stopped(struct pmd_core_cfg *queue_cfg) +{ + const struct queue_list_entry *entry; + + TAILQ_FOREACH(entry, &queue_cfg->head, next) { + const union queue *q = &entry->queue; + int ret = queue_stopped(q->portid, q->qid); + if (ret != 1) + return ret; + } + return 1; +} + +static int +check_scale(unsigned int lcore) +{ + enum power_management_env env; + + /* only PSTATE and ACPI modes are supported */ + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && + !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); + return -ENOTSUP; + } + /* ensure we could initialize the power library */ + if (rte_power_init(lcore)) + return -EINVAL; + + /* ensure we initialized the correct env */ + env = rte_power_get_env(); + if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); + return -ENOTSUP; + } + + /* we're done */ + return 0; +} + +static int +check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata) +{ + struct rte_power_monitor_cond dummy; + + /* check if rte_power_monitor is supported */ + if (!global_data.intrinsics_support.power_monitor) { + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); + return -ENOTSUP; + } + + if (cfg->n_queues > 0) { + RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n"); + return -ENOTSUP; + } + + /* check if the device supports the necessary PMD API */ + if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid, + &dummy) == -ENOTSUP) { + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n"); + return -ENOTSUP; + } + + /* we're done */ + return 0; +} + int rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id, enum rte_power_pmd_mgmt_type mode) { - struct pmd_queue_cfg *queue_cfg; + const union queue qdata = {.portid = port_id, .qid = queue_id}; + struct pmd_core_cfg *queue_cfg; struct rte_eth_dev_info info; rte_rx_callback_fn clb; int ret; @@ -202,9 +407,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, goto end; } - queue_cfg = &port_cfg[port_id][queue_id]; + queue_cfg = &lcore_cfg[lcore_id]; - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) { + /* check if other queues are stopped as well */ + ret = cfg_queues_stopped(queue_cfg); + if (ret != 1) { + /* error means invalid queue, 0 means queue wasn't stopped */ + ret = ret < 0 ? -EINVAL : -EBUSY; + goto end; + } + + /* if callback was already enabled, check current callback type */ + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED && + queue_cfg->cb_mode != mode) { ret = -EINVAL; goto end; } @@ -214,53 +429,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, switch (mode) { case RTE_POWER_MGMT_TYPE_MONITOR: - { - struct rte_power_monitor_cond dummy; - - /* check if rte_power_monitor is supported */ - if (!global_data.intrinsics_support.power_monitor) { - RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); - ret = -ENOTSUP; + /* check if we can add a new queue */ + ret = check_monitor(queue_cfg, &qdata); + if (ret < 0) goto end; - } - /* check if the device supports the necessary PMD API */ - if (rte_eth_get_monitor_addr(port_id, queue_id, - &dummy) == -ENOTSUP) { - RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n"); - ret = -ENOTSUP; - goto end; - } clb = clb_umwait; break; - } case RTE_POWER_MGMT_TYPE_SCALE: - { - enum power_management_env env; - /* only PSTATE and ACPI modes are supported */ - if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && - !rte_power_check_env_supported( - PM_ENV_PSTATE_CPUFREQ)) { - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); - ret = -ENOTSUP; + /* check if we can add a new queue */ + ret = check_scale(lcore_id); + if (ret < 0) goto end; - } - /* ensure we could initialize the power library */ - if (rte_power_init(lcore_id)) { - ret = -EINVAL; - goto end; - } - /* ensure we initialized the correct env */ - env = rte_power_get_env(); - if (env != PM_ENV_ACPI_CPUFREQ && - env != PM_ENV_PSTATE_CPUFREQ) { - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); - ret = -ENOTSUP; - goto end; - } clb = clb_scale_freq; break; - } case RTE_POWER_MGMT_TYPE_PAUSE: /* figure out various time-to-tsc conversions */ if (global_data.tsc_per_us == 0) @@ -273,11 +455,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, ret = -EINVAL; goto end; } + /* add this queue to the list */ + ret = queue_list_add(queue_cfg, &qdata); + if (ret < 0) { + RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n", + strerror(-ret)); + goto end; + } /* initialize data before enabling the callback */ - queue_cfg->empty_poll_stats = 0; - queue_cfg->cb_mode = mode; - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + if (queue_cfg->n_queues == 1) { + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + } queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, clb, NULL); @@ -290,7 +481,8 @@ int rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id) { - struct pmd_queue_cfg *queue_cfg; + const union queue qdata = {.portid = port_id, .qid = queue_id}; + struct pmd_core_cfg *queue_cfg; int ret; RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); @@ -306,13 +498,31 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, } /* no need to check queue id as wrong queue id would not be enabled */ - queue_cfg = &port_cfg[port_id][queue_id]; + queue_cfg = &lcore_cfg[lcore_id]; + + /* check if other queues are stopped as well */ + ret = cfg_queues_stopped(queue_cfg); + if (ret != 1) { + /* error means invalid queue, 0 means queue wasn't stopped */ + return ret < 0 ? -EINVAL : -EBUSY; + } if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) return -EINVAL; - /* stop any callbacks from progressing */ - queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + /* + * There is no good/easy way to do this without race conditions, so we + * are just going to throw our hands in the air and hope that the user + * has read the documentation and has ensured that ports are stopped at + * the time we enter the API functions. + */ + ret = queue_list_remove(queue_cfg, &qdata); + if (ret < 0) + return -ret; + + /* if we've removed all queues from the lists, set state to disabled */ + if (queue_cfg->n_queues == 0) + queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; switch (queue_cfg->cb_mode) { case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */ @@ -336,3 +546,42 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, return 0; } + +int +rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id) +{ + const union queue qdata = {.portid = port_id, .qid = queue_id}; + struct pmd_core_cfg *queue_cfg; + int ret; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + + if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) + return -EINVAL; + + /* no need to check queue id as wrong queue id would not be enabled */ + queue_cfg = &lcore_cfg[lcore_id]; + + if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) + return -EINVAL; + + ret = queue_set_power_save(queue_cfg, &qdata); + if (ret < 0) { + RTE_LOG(DEBUG, POWER, "Failed to set power save queue: %s\n", + strerror(-ret)); + return -ret; + } + + return 0; +} + +RTE_INIT(rte_power_ethdev_pmgmt_init) { + size_t i; + + /* initialize all tailqs */ + for (i = 0; i < RTE_DIM(lcore_cfg); i++) { + struct pmd_core_cfg *cfg = &lcore_cfg[i]; + TAILQ_INIT(&cfg->head); + } +} diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h index 444e7b8a66..d6ef8f778a 100644 --- a/lib/power/rte_power_pmd_mgmt.h +++ b/lib/power/rte_power_pmd_mgmt.h @@ -90,6 +90,40 @@ int rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id); +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice. + * + * Set a specific Ethernet device Rx queue to be the "power save" queue for a + * particular lcore. When multiple queues are assigned to a single lcore using + * the `rte_power_ethdev_pmgmt_queue_enable` API, only one of them will trigger + * the power management. In a typical scenario, the last queue to be polled on + * a particular lcore should be designated as power save queue. + * + * @note This function is not thread-safe. + * + * @note When using multiple queues per lcore, calling this function is + * mandatory. If not called, no power management routines would be triggered + * when the traffic starts. + * + * @warning This function must be called when all affected Ethernet ports are + * stopped and no Rx/Tx is in progress! + * + * @param lcore_id + * The lcore the Rx queue is polled from. + * @param port_id + * The port identifier of the Ethernet device. + * @param queue_id + * The queue identifier of the Ethernet device. + * @return + * 0 on success + * <0 on error + */ +__rte_experimental +int +rte_power_ethdev_pmgmt_queue_set_power_save(unsigned int lcore_id, + uint16_t port_id, uint16_t queue_id); + #ifdef __cplusplus } #endif diff --git a/lib/power/version.map b/lib/power/version.map index b004e3e4a9..105d1d94c2 100644 --- a/lib/power/version.map +++ b/lib/power/version.map @@ -38,4 +38,7 @@ EXPERIMENTAL { # added in 21.02 rte_power_ethdev_pmgmt_queue_disable; rte_power_ethdev_pmgmt_queue_enable; + + # added in 21.08 + rte_power_ethdev_pmgmt_queue_set_power_save; }; From patchwork Mon Jun 28 15:54:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 94905 Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 68E59A0A0C; Mon, 28 Jun 2021 17:55:04 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9B3F64116D; Mon, 28 Jun 2021 17:54:43 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by mails.dpdk.org (Postfix) with ESMTP id 6C9B94115E for ; Mon, 28 Jun 2021 17:54:32 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10029"; a="204975518" X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="204975518" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jun 2021 08:54:31 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="625296703" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga005.jf.intel.com with ESMTP; 28 Jun 2021 08:54:30 -0700 From: Anatoly Burakov To: dev@dpdk.org, David Hunt Cc: konstantin.ananyev@intel.com, ciara.loftus@intel.com Date: Mon, 28 Jun 2021 15:54:13 +0000 Message-Id: <94d79ab2b78d18048d71cd2980a9667bc35afb07.1624895595.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v4 6/7] power: support monitoring multiple Rx queues X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Use the new multi-monitor intrinsic to allow monitoring multiple ethdev Rx queues while entering the energy efficient power state. The multi version will be used unconditionally if supported, and the UMWAIT one will only be used when multi-monitor is not supported by the hardware. Signed-off-by: Anatoly Burakov --- Notes: v4: - Fix possible out of bounds access - Added missing index increment doc/guides/prog_guide/power_man.rst | 9 ++-- lib/power/rte_power_pmd_mgmt.c | 84 ++++++++++++++++++++++++++++- 2 files changed, 88 insertions(+), 5 deletions(-) diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index fac2c19516..3245a5ebed 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number. The "monitor" mode is only supported in the following configurations and scenarios: * If ``rte_cpu_get_intrinsics_support()`` function indicates that + ``rte_power_monitor_multi()`` function is supported by the platform, then + monitoring multiple Ethernet Rx queues for traffic will be supported. + +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only ``rte_power_monitor()`` is supported by the platform, then monitoring will be limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be monitored from a different lcore). -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the - ``rte_power_monitor()`` function is not supported, then monitor mode will not - be supported. +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the + two monitoring functions are supported, then monitor mode will not be supported. * Not all Ethernet devices support monitoring, even if the underlying platform may support the necessary CPU instructions. Please refer to diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c index 7762cd39b8..97c9f1ea36 100644 --- a/lib/power/rte_power_pmd_mgmt.c +++ b/lib/power/rte_power_pmd_mgmt.c @@ -155,6 +155,32 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q) return 0; } +static inline int +get_monitor_addresses(struct pmd_core_cfg *cfg, + struct rte_power_monitor_cond *pmc, size_t len) +{ + const struct queue_list_entry *qle; + size_t i = 0; + int ret; + + TAILQ_FOREACH(qle, &cfg->head, next) { + const union queue *q = &qle->queue; + struct rte_power_monitor_cond *cur; + + /* attempted out of bounds access */ + if (i >= len) { + RTE_LOG(ERR, POWER, "Too many queues being monitored\n"); + return -1; + } + + cur = &pmc[i++]; + ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur); + if (ret < 0) + return ret; + } + return 0; +} + static void calc_tsc(void) { @@ -183,6 +209,48 @@ calc_tsc(void) } } +static uint16_t +clb_multiwait(uint16_t port_id, uint16_t qidx, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *addr __rte_unused) +{ + const unsigned int lcore = rte_lcore_id(); + const union queue q = {.portid = port_id, .qid = qidx}; + const bool empty = nb_rx == 0; + struct pmd_core_cfg *q_conf; + + q_conf = &lcore_cfg[lcore]; + + /* early exit */ + if (likely(!empty)) { + q_conf->empty_poll_stats = 0; + } else { + /* do we care about this particular queue? */ + if (!queue_is_power_save(q_conf, &q)) + return nb_rx; + + /* + * we can increment unconditionally here because if there were + * non-empty polls in other queues assigned to this core, we + * dropped the counter to zero anyway. + */ + q_conf->empty_poll_stats++; + if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS]; + uint16_t ret; + + /* gather all monitoring conditions */ + ret = get_monitor_addresses(q_conf, pmc, RTE_DIM(pmc)); + + if (ret == 0) + rte_power_monitor_multi(pmc, + q_conf->n_queues, UINT64_MAX); + } + } + + return nb_rx; +} + static uint16_t clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, uint16_t max_pkts __rte_unused, @@ -348,14 +416,19 @@ static int check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata) { struct rte_power_monitor_cond dummy; + bool multimonitor_supported; /* check if rte_power_monitor is supported */ if (!global_data.intrinsics_support.power_monitor) { RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); return -ENOTSUP; } + /* check if multi-monitor is supported */ + multimonitor_supported = + global_data.intrinsics_support.power_monitor_multi; - if (cfg->n_queues > 0) { + /* if we're adding a new queue, do we support multiple queues? */ + if (cfg->n_queues > 0 && !multimonitor_supported) { RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n"); return -ENOTSUP; } @@ -371,6 +444,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata) return 0; } +static inline rte_rx_callback_fn +get_monitor_callback(void) +{ + return global_data.intrinsics_support.power_monitor_multi ? + clb_multiwait : clb_umwait; +} + int rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id, enum rte_power_pmd_mgmt_type mode) @@ -434,7 +514,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, if (ret < 0) goto end; - clb = clb_umwait; + clb = get_monitor_callback(); break; case RTE_POWER_MGMT_TYPE_SCALE: /* check if we can add a new queue */ From patchwork Mon Jun 28 15:54:14 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 94906 Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id BDD0AA0A0C; Mon, 28 Jun 2021 17:55:10 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id BCCA741172; Mon, 28 Jun 2021 17:54:44 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by mails.dpdk.org (Postfix) with ESMTP id C6B1141163 for ; Mon, 28 Jun 2021 17:54:33 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10029"; a="204975524" X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="204975524" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Jun 2021 08:54:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,306,1616482800"; d="scan'208";a="625296713" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga005.jf.intel.com with ESMTP; 28 Jun 2021 08:54:32 -0700 From: Anatoly Burakov To: dev@dpdk.org, David Hunt Cc: konstantin.ananyev@intel.com, ciara.loftus@intel.com Date: Mon, 28 Jun 2021 15:54:14 +0000 Message-Id: <514d89a9555ff38ea953662d3c8eb262bce7560f.1624895595.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v4 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Currently, l3fwd-power enforces the limitation of having one queue per lcore. This is no longer necessary, so remove the limitation, and always mark the last queue in qconf as the power save queue. Signed-off-by: Anatoly Burakov --- examples/l3fwd-power/main.c | 39 +++++++++++++++++++++++-------------- 1 file changed, 24 insertions(+), 15 deletions(-) diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index f8dfed1634..3057c06936 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -2498,6 +2498,27 @@ mode_to_str(enum appmode mode) } } +static void +pmd_pmgmt_set_up(unsigned int lcore, uint16_t portid, uint16_t qid, bool last) +{ + int ret; + + ret = rte_power_ethdev_pmgmt_queue_enable(lcore, portid, + qid, pmgmt_type); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n", + ret, portid); + + if (!last) + return; + ret = rte_power_ethdev_pmgmt_queue_set_power_save(lcore, portid, qid); + if (ret < 0) + rte_exit(EXIT_FAILURE, + "rte_power_ethdev_pmgmt_queue_set_power_save: err=%d, port=%d\n", + ret, portid); +} + int main(int argc, char **argv) { @@ -2723,12 +2744,6 @@ main(int argc, char **argv) printf("\nInitializing rx queues on lcore %u ... ", lcore_id ); fflush(stdout); - /* PMD power management mode can only do 1 queue per core */ - if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) { - rte_exit(EXIT_FAILURE, - "In PMD power management mode, only one queue per lcore is allowed\n"); - } - /* init RX queues */ for(queue = 0; queue < qconf->n_rx_queue; ++queue) { struct rte_eth_rxconf rxq_conf; @@ -2767,15 +2782,9 @@ main(int argc, char **argv) "Fail to add ptype cb\n"); } - if (app_mode == APP_MODE_PMD_MGMT) { - ret = rte_power_ethdev_pmgmt_queue_enable( - lcore_id, portid, queueid, - pmgmt_type); - if (ret < 0) - rte_exit(EXIT_FAILURE, - "rte_power_ethdev_pmgmt_queue_enable: err=%d, port=%d\n", - ret, portid); - } + if (app_mode == APP_MODE_PMD_MGMT) + pmd_pmgmt_set_up(lcore_id, portid, queueid, + queue == (qconf->n_rx_queue - 1)); } }