From patchwork Tue Jun 29 15:48:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 94982 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 64B66A0C3F; Tue, 29 Jun 2021 17:49:46 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 08B5E411D6; Tue, 29 Jun 2021 17:49:33 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by mails.dpdk.org (Postfix) with ESMTP id 1C21C411C0 for ; Tue, 29 Jun 2021 17:49:26 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10030"; a="269304821" X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="269304821" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jun 2021 08:48:36 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="408213596" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga006.jf.intel.com with ESMTP; 29 Jun 2021 08:48:33 -0700 From: Anatoly Burakov To: dev@dpdk.org, Timothy McDaniel , Beilei Xing , Jingjing Wu , Qiming Yang , Qi Zhang , Haiyue Wang , Matan Azrad , Shahaf Shuler , Viacheslav Ovsiienko , Bruce Richardson , Konstantin Ananyev Cc: david.hunt@intel.com, ciara.loftus@intel.com Date: Tue, 29 Jun 2021 15:48:24 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v5 1/7] power_intrinsics: use callbacks for comparison X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Previously, the semantics of power monitor were such that we were checking current value against the expected value, and if they matched, then the sleep was aborted. This is somewhat inflexible, because it only allowed us to check for a specific value in a specific way. This commit replaces the comparison with a user callback mechanism, so that any PMD (or other code) using `rte_power_monitor()` can define their own comparison semantics and decision making on how to detect the need to abort the entering of power optimized state. Existing implementations are adjusted to follow the new semantics. Suggested-by: Konstantin Ananyev Signed-off-by: Anatoly Burakov Acked-by: Konstantin Ananyev --- Notes: v4: - Return error if callback is set to NULL - Replace raw number with a macro in monitor condition opaque data v2: - Use callback mechanism for more flexibility - Address feedback from Konstantin doc/guides/rel_notes/release_21_08.rst | 1 + drivers/event/dlb2/dlb2.c | 17 ++++++++-- drivers/net/i40e/i40e_rxtx.c | 20 +++++++---- drivers/net/iavf/iavf_rxtx.c | 20 +++++++---- drivers/net/ice/ice_rxtx.c | 20 +++++++---- drivers/net/ixgbe/ixgbe_rxtx.c | 20 +++++++---- drivers/net/mlx5/mlx5_rx.c | 17 ++++++++-- .../include/generic/rte_power_intrinsics.h | 33 +++++++++++++++---- lib/eal/x86/rte_power_intrinsics.c | 17 +++++----- 9 files changed, 121 insertions(+), 44 deletions(-) diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst index a6ecfdf3ce..c84ac280f5 100644 --- a/doc/guides/rel_notes/release_21_08.rst +++ b/doc/guides/rel_notes/release_21_08.rst @@ -84,6 +84,7 @@ API Changes Also, make sure to start the actual text at the margin. ======================================================= +* eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism. ABI Changes ----------- diff --git a/drivers/event/dlb2/dlb2.c b/drivers/event/dlb2/dlb2.c index eca183753f..252bbd8d5e 100644 --- a/drivers/event/dlb2/dlb2.c +++ b/drivers/event/dlb2/dlb2.c @@ -3154,6 +3154,16 @@ dlb2_port_credits_inc(struct dlb2_port *qm_port, int num) } } +#define CLB_MASK_IDX 0 +#define CLB_VAL_IDX 1 +static int +dlb2_monitor_callback(const uint64_t val, + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]) +{ + /* abort if the value matches */ + return (val & opaque[CLB_MASK_IDX]) == opaque[CLB_VAL_IDX] ? -1 : 0; +} + static inline int dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, struct dlb2_eventdev_port *ev_port, @@ -3194,8 +3204,11 @@ dlb2_dequeue_wait(struct dlb2_eventdev *dlb2, expected_value = 0; pmc.addr = monitor_addr; - pmc.val = expected_value; - pmc.mask = qe_mask.raw_qe[1]; + /* store expected value and comparison mask in opaque data */ + pmc.opaque[CLB_VAL_IDX] = expected_value; + pmc.opaque[CLB_MASK_IDX] = qe_mask.raw_qe[1]; + /* set up callback */ + pmc.fn = dlb2_monitor_callback; pmc.size = sizeof(uint64_t); rte_power_monitor(&pmc, timeout + start_ticks); diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 6c58decece..081682f88b 100644 --- a/drivers/net/i40e/i40e_rxtx.c +++ b/drivers/net/i40e/i40e_rxtx.c @@ -81,6 +81,18 @@ #define I40E_TX_OFFLOAD_SIMPLE_NOTSUP_MASK \ (PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_SIMPLE_SUP_MASK) +static int +i40e_monitor_callback(const uint64_t value, + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused) +{ + const uint64_t m = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + return (value & m) == m ? -1 : 0; +} + int i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { @@ -93,12 +105,8 @@ i40e_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) /* watch for changes in status bit */ pmc->addr = &rxdp->wb.qword1.status_error_len; - /* - * we expect the DD bit to be set to 1 if this descriptor was already - * written to. - */ - pmc->val = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); - pmc->mask = rte_cpu_to_le_64(1 << I40E_RX_DESC_STATUS_DD_SHIFT); + /* comparison callback */ + pmc->fn = i40e_monitor_callback; /* registers are 64-bit */ pmc->size = sizeof(uint64_t); diff --git a/drivers/net/iavf/iavf_rxtx.c b/drivers/net/iavf/iavf_rxtx.c index 0361af0d85..7ed196ec22 100644 --- a/drivers/net/iavf/iavf_rxtx.c +++ b/drivers/net/iavf/iavf_rxtx.c @@ -57,6 +57,18 @@ iavf_proto_xtr_type_to_rxdid(uint8_t flex_type) rxdid_map[flex_type] : IAVF_RXDID_COMMS_OVS_1; } +static int +iavf_monitor_callback(const uint64_t value, + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused) +{ + const uint64_t m = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT); + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + return (value & m) == m ? -1 : 0; +} + int iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { @@ -69,12 +81,8 @@ iavf_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) /* watch for changes in status bit */ pmc->addr = &rxdp->wb.qword1.status_error_len; - /* - * we expect the DD bit to be set to 1 if this descriptor was already - * written to. - */ - pmc->val = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT); - pmc->mask = rte_cpu_to_le_64(1 << IAVF_RX_DESC_STATUS_DD_SHIFT); + /* comparison callback */ + pmc->fn = iavf_monitor_callback; /* registers are 64-bit */ pmc->size = sizeof(uint64_t); diff --git a/drivers/net/ice/ice_rxtx.c b/drivers/net/ice/ice_rxtx.c index fc9bb5a3e7..d12437d19d 100644 --- a/drivers/net/ice/ice_rxtx.c +++ b/drivers/net/ice/ice_rxtx.c @@ -27,6 +27,18 @@ uint64_t rte_net_ice_dynflag_proto_xtr_ipv6_flow_mask; uint64_t rte_net_ice_dynflag_proto_xtr_tcp_mask; uint64_t rte_net_ice_dynflag_proto_xtr_ip_offset_mask; +static int +ice_monitor_callback(const uint64_t value, + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused) +{ + const uint64_t m = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + return (value & m) == m ? -1 : 0; +} + int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { @@ -39,12 +51,8 @@ ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) /* watch for changes in status bit */ pmc->addr = &rxdp->wb.status_error0; - /* - * we expect the DD bit to be set to 1 if this descriptor was already - * written to. - */ - pmc->val = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); - pmc->mask = rte_cpu_to_le_16(1 << ICE_RX_FLEX_DESC_STATUS0_DD_S); + /* comparison callback */ + pmc->fn = ice_monitor_callback; /* register is 16-bit */ pmc->size = sizeof(uint16_t); diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index d69f36e977..c814a28cb4 100644 --- a/drivers/net/ixgbe/ixgbe_rxtx.c +++ b/drivers/net/ixgbe/ixgbe_rxtx.c @@ -1369,6 +1369,18 @@ const uint32_t RTE_PTYPE_INNER_L3_IPV4_EXT | RTE_PTYPE_INNER_L4_UDP, }; +static int +ixgbe_monitor_callback(const uint64_t value, + const uint64_t arg[RTE_POWER_MONITOR_OPAQUE_SZ] __rte_unused) +{ + const uint64_t m = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + /* + * we expect the DD bit to be set to 1 if this descriptor was already + * written to. + */ + return (value & m) == m ? -1 : 0; +} + int ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { @@ -1381,12 +1393,8 @@ ixgbe_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) /* watch for changes in status bit */ pmc->addr = &rxdp->wb.upper.status_error; - /* - * we expect the DD bit to be set to 1 if this descriptor was already - * written to. - */ - pmc->val = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); - pmc->mask = rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD); + /* comparison callback */ + pmc->fn = ixgbe_monitor_callback; /* the registers are 32-bit */ pmc->size = sizeof(uint32_t); diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c index 777a1d6e45..17370b77dc 100644 --- a/drivers/net/mlx5/mlx5_rx.c +++ b/drivers/net/mlx5/mlx5_rx.c @@ -269,6 +269,18 @@ mlx5_rx_queue_count(struct rte_eth_dev *dev, uint16_t rx_queue_id) return rx_queue_count(rxq); } +#define CLB_VAL_IDX 0 +#define CLB_MSK_IDX 1 +static int +mlx_monitor_callback(const uint64_t value, + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]) +{ + const uint64_t m = opaque[CLB_MSK_IDX]; + const uint64_t v = opaque[CLB_VAL_IDX]; + + return (value & m) == v ? -1 : 0; +} + int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) { struct mlx5_rxq_data *rxq = rx_queue; @@ -282,8 +294,9 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) return -rte_errno; } pmc->addr = &cqe->op_own; - pmc->val = !!idx; - pmc->mask = MLX5_CQE_OWNER_MASK; + pmc->opaque[CLB_VAL_IDX] = !!idx; + pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK; + pmc->fn = mlx_monitor_callback; pmc->size = sizeof(uint8_t); return 0; } diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h index dddca3d41c..c9aa52a86d 100644 --- a/lib/eal/include/generic/rte_power_intrinsics.h +++ b/lib/eal/include/generic/rte_power_intrinsics.h @@ -18,19 +18,38 @@ * which are architecture-dependent. */ +/** Size of the opaque data in monitor condition */ +#define RTE_POWER_MONITOR_OPAQUE_SZ 4 + +/** + * Callback definition for monitoring conditions. Callbacks with this signature + * will be used by `rte_power_monitor()` to check if the entering of power + * optimized state should be aborted. + * + * @param val + * The value read from memory. + * @param opaque + * Callback-specific data. + * + * @return + * 0 if entering of power optimized state should proceed + * -1 if entering of power optimized state should be aborted + */ +typedef int (*rte_power_monitor_clb_t)(const uint64_t val, + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]); struct rte_power_monitor_cond { volatile void *addr; /**< Address to monitor for changes */ - uint64_t val; /**< If the `mask` is non-zero, location pointed - * to by `addr` will be read and compared - * against this value. - */ - uint64_t mask; /**< 64-bit mask to extract value read from `addr` */ - uint8_t size; /**< Data size (in bytes) that will be used to compare - * expected value (`val`) with data read from the + uint8_t size; /**< Data size (in bytes) that will be read from the * monitored memory location (`addr`). Can be 1, 2, * 4, or 8. Supplying any other value will result in * an error. */ + rte_power_monitor_clb_t fn; /**< Callback to be used to check if + * entering power optimized state should + * be aborted. + */ + uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]; + /**< Callback-specific data */ }; /** diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c index 39ea9fdecd..66fea28897 100644 --- a/lib/eal/x86/rte_power_intrinsics.c +++ b/lib/eal/x86/rte_power_intrinsics.c @@ -76,6 +76,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32); const unsigned int lcore_id = rte_lcore_id(); struct power_wait_status *s; + uint64_t cur_value; /* prevent user from running this instruction if it's not supported */ if (!wait_supported) @@ -91,6 +92,9 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, if (__check_val_size(pmc->size) < 0) return -EINVAL; + if (pmc->fn == NULL) + return -EINVAL; + s = &wait_status[lcore_id]; /* update sleep address */ @@ -110,16 +114,11 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc, /* now that we've put this address into monitor, we can unlock */ rte_spinlock_unlock(&s->lock); - /* if we have a comparison mask, we might not need to sleep at all */ - if (pmc->mask) { - const uint64_t cur_value = __get_umwait_val( - pmc->addr, pmc->size); - const uint64_t masked = cur_value & pmc->mask; + cur_value = __get_umwait_val(pmc->addr, pmc->size); - /* if the masked value is already matching, abort */ - if (masked == pmc->val) - goto end; - } + /* check if callback indicates we should abort */ + if (pmc->fn(cur_value, pmc->opaque) != 0) + goto end; /* execute UMWAIT */ asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;" From patchwork Tue Jun 29 15:48:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 94979 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 1B226A0C3F; Tue, 29 Jun 2021 17:49:27 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id D9024411A5; Tue, 29 Jun 2021 17:49:26 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by mails.dpdk.org (Postfix) with ESMTP id 83DAB40E01 for ; Tue, 29 Jun 2021 17:49:25 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10030"; a="269304827" X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="269304827" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jun 2021 08:48:38 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="408213608" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga006.jf.intel.com with ESMTP; 29 Jun 2021 08:48:36 -0700 From: Anatoly Burakov To: dev@dpdk.org, Ciara Loftus , Qi Zhang Cc: david.hunt@intel.com, konstantin.ananyev@intel.com Date: Tue, 29 Jun 2021 15:48:25 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v5 2/7] net/af_xdp: add power monitor support X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Implement support for .get_monitor_addr in AF_XDP driver. Signed-off-by: Anatoly Burakov --- Notes: v2: - Rewrite using the callback mechanism drivers/net/af_xdp/rte_eth_af_xdp.c | 34 +++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c index eb5660a3dc..7830d0c23a 100644 --- a/drivers/net/af_xdp/rte_eth_af_xdp.c +++ b/drivers/net/af_xdp/rte_eth_af_xdp.c @@ -37,6 +37,7 @@ #include #include #include +#include #include "compat.h" @@ -788,6 +789,38 @@ eth_dev_configure(struct rte_eth_dev *dev) return 0; } +#define CLB_VAL_IDX 0 +static int +eth_monitor_callback(const uint64_t value, + const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ]) +{ + const uint64_t v = opaque[CLB_VAL_IDX]; + const uint64_t m = (uint32_t)~0; + + /* if the value has changed, abort entering power optimized state */ + return (value & m) == v ? 0 : -1; +} + +static int +eth_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc) +{ + struct pkt_rx_queue *rxq = rx_queue; + unsigned int *prod = rxq->rx.producer; + const uint32_t cur_val = rxq->rx.cached_prod; /* use cached value */ + + /* watch for changes in producer ring */ + pmc->addr = (void*)prod; + + /* store current value */ + pmc->opaque[CLB_VAL_IDX] = cur_val; + pmc->fn = eth_monitor_callback; + + /* AF_XDP producer ring index is 32-bit */ + pmc->size = sizeof(uint32_t); + + return 0; +} + static int eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) { @@ -1448,6 +1481,7 @@ static const struct eth_dev_ops ops = { .link_update = eth_link_update, .stats_get = eth_stats_get, .stats_reset = eth_stats_reset, + .get_monitor_addr = eth_get_monitor_addr }; /** parse busy_budget argument */ From patchwork Tue Jun 29 15:48:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 94981 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 76212A0C3F; Tue, 29 Jun 2021 17:49:39 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id DA4F6411CC; Tue, 29 Jun 2021 17:49:31 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by mails.dpdk.org (Postfix) with ESMTP id B91D74117E for ; Tue, 29 Jun 2021 17:49:26 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10030"; a="269304839" X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="269304839" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jun 2021 08:48:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="408213623" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga006.jf.intel.com with ESMTP; 29 Jun 2021 08:48:38 -0700 From: Anatoly Burakov To: dev@dpdk.org, Jan Viktorin , Ruifeng Wang , Jerin Jacob , David Christensen , Ray Kinsella , Neil Horman , Bruce Richardson , Konstantin Ananyev Cc: david.hunt@intel.com, ciara.loftus@intel.com Date: Tue, 29 Jun 2021 15:48:26 +0000 Message-Id: <0ddcda8c9a67ad214a8f98c851d10a279920f581.1624981670.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v5 3/7] eal: add power monitor for multiple events X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Use RTM and WAITPKG instructions to perform a wait-for-writes similar to what UMWAIT does, but without the limitation of having to listen for just one event. This works because the optimized power state used by the TPAUSE instruction will cause a wake up on RTM transaction abort, so if we add the addresses we're interested in to the read-set, any write to those addresses will wake us up. Signed-off-by: Konstantin Ananyev Signed-off-by: Anatoly Burakov --- Notes: v4: - Fixed bugs in accessing the monitor condition - Abort on any monitor condition not having a defined callback v2: - Adapt to callback mechanism doc/guides/rel_notes/release_21_08.rst | 2 + lib/eal/arm/rte_power_intrinsics.c | 11 +++ lib/eal/include/generic/rte_cpuflags.h | 2 + .../include/generic/rte_power_intrinsics.h | 35 +++++++++ lib/eal/ppc/rte_power_intrinsics.c | 11 +++ lib/eal/version.map | 3 + lib/eal/x86/rte_cpuflags.c | 2 + lib/eal/x86/rte_power_intrinsics.c | 73 +++++++++++++++++++ 8 files changed, 139 insertions(+) diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst index c84ac280f5..9d1cfac395 100644 --- a/doc/guides/rel_notes/release_21_08.rst +++ b/doc/guides/rel_notes/release_21_08.rst @@ -55,6 +55,8 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* eal: added ``rte_power_monitor_multi`` to support waiting for multiple events. + Removed Items ------------- diff --git a/lib/eal/arm/rte_power_intrinsics.c b/lib/eal/arm/rte_power_intrinsics.c index e83f04072a..78f55b7203 100644 --- a/lib/eal/arm/rte_power_intrinsics.c +++ b/lib/eal/arm/rte_power_intrinsics.c @@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id) return -ENOTSUP; } + +int +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[], + const uint32_t num, const uint64_t tsc_timestamp) +{ + RTE_SET_USED(pmc); + RTE_SET_USED(num); + RTE_SET_USED(tsc_timestamp); + + return -ENOTSUP; +} diff --git a/lib/eal/include/generic/rte_cpuflags.h b/lib/eal/include/generic/rte_cpuflags.h index 28a5aecde8..d35551e931 100644 --- a/lib/eal/include/generic/rte_cpuflags.h +++ b/lib/eal/include/generic/rte_cpuflags.h @@ -24,6 +24,8 @@ struct rte_cpu_intrinsics { /**< indicates support for rte_power_monitor function */ uint32_t power_pause : 1; /**< indicates support for rte_power_pause function */ + uint32_t power_monitor_multi : 1; + /**< indicates support for rte_power_monitor_multi function */ }; /** diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h index c9aa52a86d..04e8c2ab37 100644 --- a/lib/eal/include/generic/rte_power_intrinsics.h +++ b/lib/eal/include/generic/rte_power_intrinsics.h @@ -128,4 +128,39 @@ int rte_power_monitor_wakeup(const unsigned int lcore_id); __rte_experimental int rte_power_pause(const uint64_t tsc_timestamp); +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Monitor a set of addresses for changes. This will cause the CPU to enter an + * architecture-defined optimized power state until either one of the specified + * memory addresses is written to, a certain TSC timestamp is reached, or other + * reasons cause the CPU to wake up. + * + * Additionally, `expected` 64-bit values and 64-bit masks are provided. If + * mask is non-zero, the current value pointed to by the `p` pointer will be + * checked against the expected value, and if they do not match, the entering of + * optimized power state may be aborted. + * + * @warning It is responsibility of the user to check if this function is + * supported at runtime using `rte_cpu_get_intrinsics_support()` API call. + * Failing to do so may result in an illegal CPU instruction error. + * + * @param pmc + * An array of monitoring condition structures. + * @param num + * Length of the `pmc` array. + * @param tsc_timestamp + * Maximum TSC timestamp to wait for. Note that the wait behavior is + * architecture-dependent. + * + * @return + * 0 on success + * -EINVAL on invalid parameters + * -ENOTSUP if unsupported + */ +__rte_experimental +int rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[], + const uint32_t num, const uint64_t tsc_timestamp); + #endif /* _RTE_POWER_INTRINSIC_H_ */ diff --git a/lib/eal/ppc/rte_power_intrinsics.c b/lib/eal/ppc/rte_power_intrinsics.c index 7fc9586da7..f00b58ade5 100644 --- a/lib/eal/ppc/rte_power_intrinsics.c +++ b/lib/eal/ppc/rte_power_intrinsics.c @@ -38,3 +38,14 @@ rte_power_monitor_wakeup(const unsigned int lcore_id) return -ENOTSUP; } + +int +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[], + const uint32_t num, const uint64_t tsc_timestamp) +{ + RTE_SET_USED(pmc); + RTE_SET_USED(num); + RTE_SET_USED(tsc_timestamp); + + return -ENOTSUP; +} diff --git a/lib/eal/version.map b/lib/eal/version.map index fe5c3dac98..4ccd5475d6 100644 --- a/lib/eal/version.map +++ b/lib/eal/version.map @@ -423,6 +423,9 @@ EXPERIMENTAL { rte_version_release; # WINDOWS_NO_EXPORT rte_version_suffix; # WINDOWS_NO_EXPORT rte_version_year; # WINDOWS_NO_EXPORT + + # added in 21.08 + rte_power_monitor_multi; # WINDOWS_NO_EXPORT }; INTERNAL { diff --git a/lib/eal/x86/rte_cpuflags.c b/lib/eal/x86/rte_cpuflags.c index a96312ff7f..d339734a8c 100644 --- a/lib/eal/x86/rte_cpuflags.c +++ b/lib/eal/x86/rte_cpuflags.c @@ -189,5 +189,7 @@ rte_cpu_get_intrinsics_support(struct rte_cpu_intrinsics *intrinsics) if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) { intrinsics->power_monitor = 1; intrinsics->power_pause = 1; + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RTM)) + intrinsics->power_monitor_multi = 1; } } diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c index 66fea28897..f749da9b85 100644 --- a/lib/eal/x86/rte_power_intrinsics.c +++ b/lib/eal/x86/rte_power_intrinsics.c @@ -4,6 +4,7 @@ #include #include +#include #include #include "rte_power_intrinsics.h" @@ -28,6 +29,7 @@ __umwait_wakeup(volatile void *addr) } static bool wait_supported; +static bool wait_multi_supported; static inline uint64_t __get_umwait_val(const volatile void *p, const uint8_t sz) @@ -166,6 +168,8 @@ RTE_INIT(rte_power_intrinsics_init) { if (i.power_monitor && i.power_pause) wait_supported = 1; + if (i.power_monitor_multi) + wait_multi_supported = 1; } int @@ -204,6 +208,9 @@ rte_power_monitor_wakeup(const unsigned int lcore_id) * In this case, since we've already woken up, the "wakeup" was * unneeded, and since T1 is still waiting on T2 releasing the lock, the * wakeup address is still valid so it's perfectly safe to write it. + * + * For multi-monitor case, the act of locking will in itself trigger the + * wakeup, so no additional writes necessary. */ rte_spinlock_lock(&s->lock); if (s->monitor_addr != NULL) @@ -212,3 +219,69 @@ rte_power_monitor_wakeup(const unsigned int lcore_id) return 0; } + +int +rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[], + const uint32_t num, const uint64_t tsc_timestamp) +{ + const unsigned int lcore_id = rte_lcore_id(); + struct power_wait_status *s = &wait_status[lcore_id]; + uint32_t i, rc; + + /* check if supported */ + if (!wait_multi_supported) + return -ENOTSUP; + + if (pmc == NULL || num == 0) + return -EINVAL; + + /* we are already inside transaction region, return */ + if (rte_xtest() != 0) + return 0; + + /* start new transaction region */ + rc = rte_xbegin(); + + /* transaction abort, possible write to one of wait addresses */ + if (rc != RTE_XBEGIN_STARTED) + return 0; + + /* + * the mere act of reading the lock status here adds the lock to + * the read set. This means that when we trigger a wakeup from another + * thread, even if we don't have a defined wakeup address and thus don't + * actually cause any writes, the act of locking our lock will itself + * trigger the wakeup and abort the transaction. + */ + rte_spinlock_is_locked(&s->lock); + + /* + * add all addresses to wait on into transaction read-set and check if + * any of wakeup conditions are already met. + */ + rc = 0; + for (i = 0; i < num; i++) { + const struct rte_power_monitor_cond *c = &pmc[i]; + + /* cannot be NULL */ + if (c->fn == NULL) { + rc = -EINVAL; + break; + } + + const uint64_t val = __get_umwait_val(c->addr, c->size); + + /* abort if callback indicates that we need to stop */ + if (c->fn(val, c->opaque) != 0) + break; + } + + /* none of the conditions were met, sleep until timeout */ + if (i == num) + rte_power_pause(tsc_timestamp); + + /* end transaction region */ + rte_xend(); + + return rc; +} From patchwork Tue Jun 29 15:48:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 94983 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 82BFFA0C3F; Tue, 29 Jun 2021 17:49:53 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 27927411DD; Tue, 29 Jun 2021 17:49:34 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by mails.dpdk.org (Postfix) with ESMTP id BBF32411C3 for ; Tue, 29 Jun 2021 17:49:27 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10030"; a="269304842" X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="269304842" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jun 2021 08:48:42 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="408213629" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga006.jf.intel.com with ESMTP; 29 Jun 2021 08:48:41 -0700 From: Anatoly Burakov To: dev@dpdk.org, David Hunt Cc: konstantin.ananyev@intel.com, ciara.loftus@intel.com Date: Tue, 29 Jun 2021 15:48:27 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v5 4/7] power: remove thread safety from PMD power API's X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Currently, we expect that only one callback can be active at any given moment, for a particular queue configuration, which is relatively easy to implement in a thread-safe way. However, we're about to add support for multiple queues per lcore, which will greatly increase the possibility of various race conditions. We could have used something like an RCU for this use case, but absent of a pressing need for thread safety we'll go the easy way and just mandate that the API's are to be called when all affected ports are stopped, and document this limitation. This greatly simplifies the `rte_power_monitor`-related code. Signed-off-by: Anatoly Burakov --- Notes: v2: - Add check for stopped queue - Clarified doc message - Added release notes doc/guides/rel_notes/release_21_08.rst | 5 + lib/power/meson.build | 3 + lib/power/rte_power_pmd_mgmt.c | 133 ++++++++++--------------- lib/power/rte_power_pmd_mgmt.h | 6 ++ 4 files changed, 67 insertions(+), 80 deletions(-) diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst index 9d1cfac395..f015c509fc 100644 --- a/doc/guides/rel_notes/release_21_08.rst +++ b/doc/guides/rel_notes/release_21_08.rst @@ -88,6 +88,11 @@ API Changes * eal: the ``rte_power_intrinsics`` API changed to use a callback mechanism. +* rte_power: The experimental PMD power management API is no longer considered + to be thread safe; all Rx queues affected by the API will now need to be + stopped before making any changes to the power management scheme. + + ABI Changes ----------- diff --git a/lib/power/meson.build b/lib/power/meson.build index c1097d32f1..4f6a242364 100644 --- a/lib/power/meson.build +++ b/lib/power/meson.build @@ -21,4 +21,7 @@ headers = files( 'rte_power_pmd_mgmt.h', 'rte_power_guest_channel.h', ) +if cc.has_argument('-Wno-cast-qual') + cflags += '-Wno-cast-qual' +endif deps += ['timer', 'ethdev'] diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c index db03cbf420..9b95cf1794 100644 --- a/lib/power/rte_power_pmd_mgmt.c +++ b/lib/power/rte_power_pmd_mgmt.c @@ -40,8 +40,6 @@ struct pmd_queue_cfg { /**< Callback mode for this queue */ const struct rte_eth_rxtx_callback *cur_cb; /**< Callback instance */ - volatile bool umwait_in_progress; - /**< are we currently sleeping? */ uint64_t empty_poll_stats; /**< Number of empty polls */ } __rte_cache_aligned; @@ -92,30 +90,11 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, struct rte_power_monitor_cond pmc; uint16_t ret; - /* - * we might get a cancellation request while being - * inside the callback, in which case the wakeup - * wouldn't work because it would've arrived too early. - * - * to get around this, we notify the other thread that - * we're sleeping, so that it can spin until we're done. - * unsolicited wakeups are perfectly safe. - */ - q_conf->umwait_in_progress = true; - - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - - /* check if we need to cancel sleep */ - if (q_conf->pwr_mgmt_state == PMD_MGMT_ENABLED) { - /* use monitoring condition to sleep */ - ret = rte_eth_get_monitor_addr(port_id, qidx, - &pmc); - if (ret == 0) - rte_power_monitor(&pmc, UINT64_MAX); - } - q_conf->umwait_in_progress = false; - - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); + /* use monitoring condition to sleep */ + ret = rte_eth_get_monitor_addr(port_id, qidx, + &pmc); + if (ret == 0) + rte_power_monitor(&pmc, UINT64_MAX); } } else q_conf->empty_poll_stats = 0; @@ -177,12 +156,24 @@ clb_scale_freq(uint16_t port_id, uint16_t qidx, return nb_rx; } +static int +queue_stopped(const uint16_t port_id, const uint16_t queue_id) +{ + struct rte_eth_rxq_info qinfo; + + if (rte_eth_rx_queue_info_get(port_id, queue_id, &qinfo) < 0) + return -1; + + return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED; +} + int rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id, enum rte_power_pmd_mgmt_type mode) { struct pmd_queue_cfg *queue_cfg; struct rte_eth_dev_info info; + rte_rx_callback_fn clb; int ret; RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); @@ -203,6 +194,14 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, goto end; } + /* check if the queue is stopped */ + ret = queue_stopped(port_id, queue_id); + if (ret != 1) { + /* error means invalid queue, 0 means queue wasn't stopped */ + ret = ret < 0 ? -EINVAL : -EBUSY; + goto end; + } + queue_cfg = &port_cfg[port_id][queue_id]; if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) { @@ -232,17 +231,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, ret = -ENOTSUP; goto end; } - /* initialize data before enabling the callback */ - queue_cfg->empty_poll_stats = 0; - queue_cfg->cb_mode = mode; - queue_cfg->umwait_in_progress = false; - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; - - /* ensure we update our state before callback starts */ - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, - clb_umwait, NULL); + clb = clb_umwait; break; } case RTE_POWER_MGMT_TYPE_SCALE: @@ -269,16 +258,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, ret = -ENOTSUP; goto end; } - /* initialize data before enabling the callback */ - queue_cfg->empty_poll_stats = 0; - queue_cfg->cb_mode = mode; - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; - - /* this is not necessary here, but do it anyway */ - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, - queue_id, clb_scale_freq, NULL); + clb = clb_scale_freq; break; } case RTE_POWER_MGMT_TYPE_PAUSE: @@ -286,18 +266,21 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, if (global_data.tsc_per_us == 0) calc_tsc(); - /* initialize data before enabling the callback */ - queue_cfg->empty_poll_stats = 0; - queue_cfg->cb_mode = mode; - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; - - /* this is not necessary here, but do it anyway */ - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, - clb_pause, NULL); + clb = clb_pause; break; + default: + RTE_LOG(DEBUG, POWER, "Invalid power management type\n"); + ret = -EINVAL; + goto end; } + + /* initialize data before enabling the callback */ + queue_cfg->empty_poll_stats = 0; + queue_cfg->cb_mode = mode; + queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, + clb, NULL); + ret = 0; end: return ret; @@ -308,12 +291,20 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id) { struct pmd_queue_cfg *queue_cfg; + int ret; RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); if (lcore_id >= RTE_MAX_LCORE || queue_id >= RTE_MAX_QUEUES_PER_PORT) return -EINVAL; + /* check if the queue is stopped */ + ret = queue_stopped(port_id, queue_id); + if (ret != 1) { + /* error means invalid queue, 0 means queue wasn't stopped */ + return ret < 0 ? -EINVAL : -EBUSY; + } + /* no need to check queue id as wrong queue id would not be enabled */ queue_cfg = &port_cfg[port_id][queue_id]; @@ -323,27 +314,8 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, /* stop any callbacks from progressing */ queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; - /* ensure we update our state before continuing */ - rte_atomic_thread_fence(__ATOMIC_SEQ_CST); - switch (queue_cfg->cb_mode) { - case RTE_POWER_MGMT_TYPE_MONITOR: - { - bool exit = false; - do { - /* - * we may request cancellation while the other thread - * has just entered the callback but hasn't started - * sleeping yet, so keep waking it up until we know it's - * done sleeping. - */ - if (queue_cfg->umwait_in_progress) - rte_power_monitor_wakeup(lcore_id); - else - exit = true; - } while (!exit); - } - /* fall-through */ + case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */ case RTE_POWER_MGMT_TYPE_PAUSE: rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cur_cb); @@ -356,10 +328,11 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, break; } /* - * we don't free the RX callback here because it is unsafe to do so - * unless we know for a fact that all data plane threads have stopped. + * the API doc mandates that the user stops all processing on affected + * ports before calling any of these API's, so we can assume that the + * callbacks can be freed. we're intentionally casting away const-ness. */ - queue_cfg->cur_cb = NULL; + rte_free((void *)queue_cfg->cur_cb); return 0; } diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h index 7a0ac24625..444e7b8a66 100644 --- a/lib/power/rte_power_pmd_mgmt.h +++ b/lib/power/rte_power_pmd_mgmt.h @@ -43,6 +43,9 @@ enum rte_power_pmd_mgmt_type { * * @note This function is not thread-safe. * + * @warning This function must be called when all affected Ethernet queues are + * stopped and no Rx/Tx is in progress! + * * @param lcore_id * The lcore the Rx queue will be polled from. * @param port_id @@ -69,6 +72,9 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, * * @note This function is not thread-safe. * + * @warning This function must be called when all affected Ethernet queues are + * stopped and no Rx/Tx is in progress! + * * @param lcore_id * The lcore the Rx queue is polled from. * @param port_id From patchwork Tue Jun 29 15:48:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 94985 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id DC221A0C3F; Tue, 29 Jun 2021 17:50:07 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 65B5B411EB; Tue, 29 Jun 2021 17:49:36 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by mails.dpdk.org (Postfix) with ESMTP id EA2C3411C4 for ; Tue, 29 Jun 2021 17:49:27 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10030"; a="269304850" X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="269304850" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jun 2021 08:48:45 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="408213643" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga006.jf.intel.com with ESMTP; 29 Jun 2021 08:48:43 -0700 From: Anatoly Burakov To: dev@dpdk.org, David Hunt Cc: konstantin.ananyev@intel.com, ciara.loftus@intel.com Date: Tue, 29 Jun 2021 15:48:28 +0000 Message-Id: <8f5d030a77aa2f0e95e9680cb911b4e8db30c879.1624981670.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v5 5/7] power: support callbacks for multiple Rx queues X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Currently, there is a hard limitation on the PMD power management support that only allows it to support a single queue per lcore. This is not ideal as most DPDK use cases will poll multiple queues per core. The PMD power management mechanism relies on ethdev Rx callbacks, so it is very difficult to implement such support because callbacks are effectively stateless and have no visibility into what the other ethdev devices are doing. This places limitations on what we can do within the framework of Rx callbacks, but the basics of this implementation are as follows: - Replace per-queue structures with per-lcore ones, so that any device polled from the same lcore can share data - Any queue that is going to be polled from a specific lcore has to be added to the list of queues to poll, so that the callback is aware of other queues being polled by the same lcore - Both the empty poll counter and the actual power saving mechanism is shared between all queues polled on a particular lcore, and is only activated when all queues in the list were polled and were determined to have no traffic. - The limitation on UMWAIT-based polling is not removed because UMWAIT is incapable of monitoring more than one address. Also, while we're at it, update and improve the docs. Signed-off-by: Anatoly Burakov --- Notes: v5: - Remove the "power save queue" API and replace it with mechanism suggested by Konstantin v3: - Move the list of supported NICs to NIC feature table v2: - Use a TAILQ for queues instead of a static array - Address feedback from Konstantin - Add additional checks for stopped queues doc/guides/nics/features.rst | 10 + doc/guides/prog_guide/power_man.rst | 65 ++-- doc/guides/rel_notes/release_21_08.rst | 3 + lib/power/rte_power_pmd_mgmt.c | 431 ++++++++++++++++++------- 4 files changed, 373 insertions(+), 136 deletions(-) diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst index 403c2b03a3..a96e12d155 100644 --- a/doc/guides/nics/features.rst +++ b/doc/guides/nics/features.rst @@ -912,6 +912,16 @@ Supports to get Rx/Tx packet burst mode information. * **[implements] eth_dev_ops**: ``rx_burst_mode_get``, ``tx_burst_mode_get``. * **[related] API**: ``rte_eth_rx_burst_mode_get()``, ``rte_eth_tx_burst_mode_get()``. +.. _nic_features_get_monitor_addr: + +PMD power management using monitor addresses +-------------------------------------------- + +Supports getting a monitoring condition to use together with Ethernet PMD power +management (see :doc:`../prog_guide/power_man` for more details). + +* **[implements] eth_dev_ops**: ``get_monitor_addr`` + .. _nic_features_other: Other dev ops not represented by a Feature diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index c70ae128ac..ec04a72108 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -198,34 +198,41 @@ Ethernet PMD Power Management API Abstract ~~~~~~~~ -Existing power management mechanisms require developers -to change application design or change code to make use of it. -The PMD power management API provides a convenient alternative -by utilizing Ethernet PMD RX callbacks, -and triggering power saving whenever empty poll count reaches a certain number. - -Monitor - This power saving scheme will put the CPU into optimized power state - and use the ``rte_power_monitor()`` function - to monitor the Ethernet PMD RX descriptor address, - and wake the CPU up whenever there's new traffic. - -Pause - This power saving scheme will avoid busy polling - by either entering power-optimized sleep state - with ``rte_power_pause()`` function, - or, if it's not available, use ``rte_pause()``. - -Frequency scaling - This power saving scheme will use ``librte_power`` library - functionality to scale the core frequency up/down - depending on traffic volume. - -.. note:: - - Currently, this power management API is limited to mandatory mapping - of 1 queue to 1 core (multiple queues are supported, - but they must be polled from different cores). +Existing power management mechanisms require developers to change application +design or change code to make use of it. The PMD power management API provides a +convenient alternative by utilizing Ethernet PMD RX callbacks, and triggering +power saving whenever empty poll count reaches a certain number. + +* Monitor + This power saving scheme will put the CPU into optimized power state and + monitor the Ethernet PMD RX descriptor address, waking the CPU up whenever + there's new traffic. Support for this scheme may not be available on all + platforms, and further limitations may apply (see below). + +* Pause + This power saving scheme will avoid busy polling by either entering + power-optimized sleep state with ``rte_power_pause()`` function, or, if it's + not supported by the underlying platform, use ``rte_pause()``. + +* Frequency scaling + This power saving scheme will use ``librte_power`` library functionality to + scale the core frequency up/down depending on traffic volume. + +The "monitor" mode is only supported in the following configurations and scenarios: + +* If ``rte_cpu_get_intrinsics_support()`` function indicates that + ``rte_power_monitor()`` is supported by the platform, then monitoring will be + limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be + monitored from a different lcore). + +* If ``rte_cpu_get_intrinsics_support()`` function indicates that the + ``rte_power_monitor()`` function is not supported, then monitor mode will not + be supported. + +* Not all Ethernet drivers support monitoring, even if the underlying + platform may support the necessary CPU instructions. Please refer to + :doc:`../nics/overview` for more information. + API Overview for Ethernet PMD Power Management ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -242,3 +249,5 @@ References * The :doc:`../sample_app_ug/vm_power_management` chapter in the :doc:`../sample_app_ug/index` section. + +* The :doc:`../nics/overview` chapter in the :doc:`../nics/index` section diff --git a/doc/guides/rel_notes/release_21_08.rst b/doc/guides/rel_notes/release_21_08.rst index f015c509fc..3926d45ef8 100644 --- a/doc/guides/rel_notes/release_21_08.rst +++ b/doc/guides/rel_notes/release_21_08.rst @@ -57,6 +57,9 @@ New Features * eal: added ``rte_power_monitor_multi`` to support waiting for multiple events. +* rte_power: The experimental PMD power management API now supports managing + multiple Ethernet Rx queues per lcore. + Removed Items ------------- diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c index 9b95cf1794..fccfd236c2 100644 --- a/lib/power/rte_power_pmd_mgmt.c +++ b/lib/power/rte_power_pmd_mgmt.c @@ -33,18 +33,96 @@ enum pmd_mgmt_state { PMD_MGMT_ENABLED }; -struct pmd_queue_cfg { +union queue { + uint32_t val; + struct { + uint16_t portid; + uint16_t qid; + }; +}; + +struct queue_list_entry { + TAILQ_ENTRY(queue_list_entry) next; + union queue queue; + uint64_t n_empty_polls; + const struct rte_eth_rxtx_callback *cb; +}; + +struct pmd_core_cfg { + TAILQ_HEAD(queue_list_head, queue_list_entry) head; + /**< List of queues associated with this lcore */ + size_t n_queues; + /**< How many queues are in the list? */ volatile enum pmd_mgmt_state pwr_mgmt_state; /**< State of power management for this queue */ enum rte_power_pmd_mgmt_type cb_mode; /**< Callback mode for this queue */ - const struct rte_eth_rxtx_callback *cur_cb; - /**< Callback instance */ - uint64_t empty_poll_stats; - /**< Number of empty polls */ + uint64_t n_queues_ready_to_sleep; + /**< Number of queues ready to enter power optimized state */ } __rte_cache_aligned; +static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE]; -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT]; +static inline bool +queue_equal(const union queue *l, const union queue *r) +{ + return l->val == r->val; +} + +static inline void +queue_copy(union queue *dst, const union queue *src) +{ + dst->val = src->val; +} + +static struct queue_list_entry * +queue_list_find(const struct pmd_core_cfg *cfg, const union queue *q) +{ + struct queue_list_entry *cur; + + TAILQ_FOREACH(cur, &cfg->head, next) { + if (queue_equal(&cur->queue, q)) + return cur; + } + return NULL; +} + +static int +queue_list_add(struct pmd_core_cfg *cfg, const union queue *q) +{ + struct queue_list_entry *qle; + + /* is it already in the list? */ + if (queue_list_find(cfg, q) != NULL) + return -EEXIST; + + qle = malloc(sizeof(*qle)); + if (qle == NULL) + return -ENOMEM; + memset(qle, 0, sizeof(*qle)); + + queue_copy(&qle->queue, q); + TAILQ_INSERT_TAIL(&cfg->head, qle, next); + cfg->n_queues++; + qle->n_empty_polls = 0; + + return 0; +} + +static struct queue_list_entry * +queue_list_take(struct pmd_core_cfg *cfg, const union queue *q) +{ + struct queue_list_entry *found; + + found = queue_list_find(cfg, q); + if (found == NULL) + return NULL; + + TAILQ_REMOVE(&cfg->head, found, next); + cfg->n_queues--; + + /* freeing is responsibility of the caller */ + return found; +} static void calc_tsc(void) @@ -74,21 +152,56 @@ calc_tsc(void) } } +static inline void +queue_reset(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg) +{ + /* reset empty poll counter for this queue */ + qcfg->n_empty_polls = 0; + /* reset the sleep counter too */ + cfg->n_queues_ready_to_sleep = 0; +} + +static inline bool +queue_can_sleep(struct pmd_core_cfg *cfg, struct queue_list_entry *qcfg) +{ + /* this function is called - that means we have an empty poll */ + qcfg->n_empty_polls++; + + /* if we haven't reached threshold for empty polls, we can't sleep */ + if (qcfg->n_empty_polls <= EMPTYPOLL_MAX) + return false; + + /* we're ready to sleep */ + cfg->n_queues_ready_to_sleep++; + + return true; +} + +static inline bool +lcore_can_sleep(struct pmd_core_cfg *cfg) +{ + /* are all queues ready to sleep? */ + if (cfg->n_queues_ready_to_sleep != cfg->n_queues) + return false; + + /* we've reached an iteration where we can sleep, reset sleep counter */ + cfg->n_queues_ready_to_sleep = 0; + + return true; +} + static uint16_t clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, - uint16_t nb_rx, uint16_t max_pkts __rte_unused, - void *addr __rte_unused) + uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg) { + struct queue_list_entry *queue_conf = arg; - struct pmd_queue_cfg *q_conf; - - q_conf = &port_cfg[port_id][qidx]; - + /* this callback can't do more than one queue, omit multiqueue logic */ if (unlikely(nb_rx == 0)) { - q_conf->empty_poll_stats++; - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { + queue_conf->n_empty_polls++; + if (unlikely(queue_conf->n_empty_polls > EMPTYPOLL_MAX)) { struct rte_power_monitor_cond pmc; - uint16_t ret; + int ret; /* use monitoring condition to sleep */ ret = rte_eth_get_monitor_addr(port_id, qidx, @@ -97,60 +210,77 @@ clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, rte_power_monitor(&pmc, UINT64_MAX); } } else - q_conf->empty_poll_stats = 0; + queue_conf->n_empty_polls = 0; return nb_rx; } static uint16_t -clb_pause(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, - uint16_t nb_rx, uint16_t max_pkts __rte_unused, - void *addr __rte_unused) +clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *arg) { - struct pmd_queue_cfg *q_conf; + const unsigned int lcore = rte_lcore_id(); + struct queue_list_entry *queue_conf = arg; + struct pmd_core_cfg *lcore_conf; + const bool empty = nb_rx == 0; - q_conf = &port_cfg[port_id][qidx]; + lcore_conf = &lcore_cfgs[lcore]; - if (unlikely(nb_rx == 0)) { - q_conf->empty_poll_stats++; - /* sleep for 1 microsecond */ - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { - /* use tpause if we have it */ - if (global_data.intrinsics_support.power_pause) { - const uint64_t cur = rte_rdtsc(); - const uint64_t wait_tsc = - cur + global_data.tsc_per_us; - rte_power_pause(wait_tsc); - } else { - uint64_t i; - for (i = 0; i < global_data.pause_per_us; i++) - rte_pause(); - } + if (likely(!empty)) + /* early exit */ + queue_reset(lcore_conf, queue_conf); + else { + /* can this queue sleep? */ + if (!queue_can_sleep(lcore_conf, queue_conf)) + return nb_rx; + + /* can this lcore sleep? */ + if (!lcore_can_sleep(lcore_conf)) + return nb_rx; + + /* sleep for 1 microsecond, use tpause if we have it */ + if (global_data.intrinsics_support.power_pause) { + const uint64_t cur = rte_rdtsc(); + const uint64_t wait_tsc = + cur + global_data.tsc_per_us; + rte_power_pause(wait_tsc); + } else { + uint64_t i; + for (i = 0; i < global_data.pause_per_us; i++) + rte_pause(); } - } else - q_conf->empty_poll_stats = 0; + } return nb_rx; } static uint16_t -clb_scale_freq(uint16_t port_id, uint16_t qidx, +clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused, struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, - uint16_t max_pkts __rte_unused, void *_ __rte_unused) + uint16_t max_pkts __rte_unused, void *arg) { - struct pmd_queue_cfg *q_conf; + const unsigned int lcore = rte_lcore_id(); + const bool empty = nb_rx == 0; + struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore]; + struct queue_list_entry *queue_conf = arg; - q_conf = &port_cfg[port_id][qidx]; + if (likely(!empty)) { + /* early exit */ + queue_reset(lcore_conf, queue_conf); - if (unlikely(nb_rx == 0)) { - q_conf->empty_poll_stats++; - if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) - /* scale down freq */ - rte_power_freq_min(rte_lcore_id()); - } else { - q_conf->empty_poll_stats = 0; - /* scale up freq */ + /* scale up freq immediately */ rte_power_freq_max(rte_lcore_id()); + } else { + /* can this queue sleep? */ + if (!queue_can_sleep(lcore_conf, queue_conf)) + return nb_rx; + + /* can this lcore sleep? */ + if (!lcore_can_sleep(lcore_conf)) + return nb_rx; + + rte_power_freq_min(rte_lcore_id()); } return nb_rx; @@ -167,11 +297,80 @@ queue_stopped(const uint16_t port_id, const uint16_t queue_id) return qinfo.queue_state == RTE_ETH_QUEUE_STATE_STOPPED; } +static int +cfg_queues_stopped(struct pmd_core_cfg *queue_cfg) +{ + const struct queue_list_entry *entry; + + TAILQ_FOREACH(entry, &queue_cfg->head, next) { + const union queue *q = &entry->queue; + int ret = queue_stopped(q->portid, q->qid); + if (ret != 1) + return ret; + } + return 1; +} + +static int +check_scale(unsigned int lcore) +{ + enum power_management_env env; + + /* only PSTATE and ACPI modes are supported */ + if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && + !rte_power_check_env_supported(PM_ENV_PSTATE_CPUFREQ)) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); + return -ENOTSUP; + } + /* ensure we could initialize the power library */ + if (rte_power_init(lcore)) + return -EINVAL; + + /* ensure we initialized the correct env */ + env = rte_power_get_env(); + if (env != PM_ENV_ACPI_CPUFREQ && env != PM_ENV_PSTATE_CPUFREQ) { + RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); + return -ENOTSUP; + } + + /* we're done */ + return 0; +} + +static int +check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata) +{ + struct rte_power_monitor_cond dummy; + + /* check if rte_power_monitor is supported */ + if (!global_data.intrinsics_support.power_monitor) { + RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); + return -ENOTSUP; + } + + if (cfg->n_queues > 0) { + RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n"); + return -ENOTSUP; + } + + /* check if the device supports the necessary PMD API */ + if (rte_eth_get_monitor_addr(qdata->portid, qdata->qid, + &dummy) == -ENOTSUP) { + RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n"); + return -ENOTSUP; + } + + /* we're done */ + return 0; +} + int rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id, enum rte_power_pmd_mgmt_type mode) { - struct pmd_queue_cfg *queue_cfg; + const union queue qdata = {.portid = port_id, .qid = queue_id}; + struct pmd_core_cfg *lcore_cfg; + struct queue_list_entry *queue_cfg; struct rte_eth_dev_info info; rte_rx_callback_fn clb; int ret; @@ -202,9 +401,19 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, goto end; } - queue_cfg = &port_cfg[port_id][queue_id]; + lcore_cfg = &lcore_cfgs[lcore_id]; - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED) { + /* check if other queues are stopped as well */ + ret = cfg_queues_stopped(lcore_cfg); + if (ret != 1) { + /* error means invalid queue, 0 means queue wasn't stopped */ + ret = ret < 0 ? -EINVAL : -EBUSY; + goto end; + } + + /* if callback was already enabled, check current callback type */ + if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED && + lcore_cfg->cb_mode != mode) { ret = -EINVAL; goto end; } @@ -214,53 +423,20 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, switch (mode) { case RTE_POWER_MGMT_TYPE_MONITOR: - { - struct rte_power_monitor_cond dummy; - - /* check if rte_power_monitor is supported */ - if (!global_data.intrinsics_support.power_monitor) { - RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); - ret = -ENOTSUP; + /* check if we can add a new queue */ + ret = check_monitor(lcore_cfg, &qdata); + if (ret < 0) goto end; - } - /* check if the device supports the necessary PMD API */ - if (rte_eth_get_monitor_addr(port_id, queue_id, - &dummy) == -ENOTSUP) { - RTE_LOG(DEBUG, POWER, "The device does not support rte_eth_get_monitor_addr\n"); - ret = -ENOTSUP; - goto end; - } clb = clb_umwait; break; - } case RTE_POWER_MGMT_TYPE_SCALE: - { - enum power_management_env env; - /* only PSTATE and ACPI modes are supported */ - if (!rte_power_check_env_supported(PM_ENV_ACPI_CPUFREQ) && - !rte_power_check_env_supported( - PM_ENV_PSTATE_CPUFREQ)) { - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes are supported\n"); - ret = -ENOTSUP; + /* check if we can add a new queue */ + ret = check_scale(lcore_id); + if (ret < 0) goto end; - } - /* ensure we could initialize the power library */ - if (rte_power_init(lcore_id)) { - ret = -EINVAL; - goto end; - } - /* ensure we initialized the correct env */ - env = rte_power_get_env(); - if (env != PM_ENV_ACPI_CPUFREQ && - env != PM_ENV_PSTATE_CPUFREQ) { - RTE_LOG(DEBUG, POWER, "Neither ACPI nor PSTATE modes were initialized\n"); - ret = -ENOTSUP; - goto end; - } clb = clb_scale_freq; break; - } case RTE_POWER_MGMT_TYPE_PAUSE: /* figure out various time-to-tsc conversions */ if (global_data.tsc_per_us == 0) @@ -273,13 +449,23 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, ret = -EINVAL; goto end; } + /* add this queue to the list */ + ret = queue_list_add(lcore_cfg, &qdata); + if (ret < 0) { + RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n", + strerror(-ret)); + goto end; + } + /* new queue is always added last */ + queue_cfg = TAILQ_LAST(&lcore_cfgs->head, queue_list_head); /* initialize data before enabling the callback */ - queue_cfg->empty_poll_stats = 0; - queue_cfg->cb_mode = mode; - queue_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; - queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, - clb, NULL); + if (lcore_cfg->n_queues == 1) { + lcore_cfg->cb_mode = mode; + lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED; + } + queue_cfg->cb = rte_eth_add_rx_callback(port_id, queue_id, + clb, queue_cfg); ret = 0; end: @@ -290,7 +476,9 @@ int rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id) { - struct pmd_queue_cfg *queue_cfg; + const union queue qdata = {.portid = port_id, .qid = queue_id}; + struct pmd_core_cfg *lcore_cfg; + struct queue_list_entry *queue_cfg; int ret; RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); @@ -306,24 +494,40 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, } /* no need to check queue id as wrong queue id would not be enabled */ - queue_cfg = &port_cfg[port_id][queue_id]; + lcore_cfg = &lcore_cfgs[lcore_id]; - if (queue_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) + /* check if other queues are stopped as well */ + ret = cfg_queues_stopped(lcore_cfg); + if (ret != 1) { + /* error means invalid queue, 0 means queue wasn't stopped */ + return ret < 0 ? -EINVAL : -EBUSY; + } + + if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED) return -EINVAL; - /* stop any callbacks from progressing */ - queue_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + /* + * There is no good/easy way to do this without race conditions, so we + * are just going to throw our hands in the air and hope that the user + * has read the documentation and has ensured that ports are stopped at + * the time we enter the API functions. + */ + queue_cfg = queue_list_take(lcore_cfg, &qdata); + if (queue_cfg == NULL) + return -ENOENT; - switch (queue_cfg->cb_mode) { + /* if we've removed all queues from the lists, set state to disabled */ + if (lcore_cfg->n_queues == 0) + lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED; + + switch (lcore_cfg->cb_mode) { case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */ case RTE_POWER_MGMT_TYPE_PAUSE: - rte_eth_remove_rx_callback(port_id, queue_id, - queue_cfg->cur_cb); + rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb); break; case RTE_POWER_MGMT_TYPE_SCALE: rte_power_freq_max(lcore_id); - rte_eth_remove_rx_callback(port_id, queue_id, - queue_cfg->cur_cb); + rte_eth_remove_rx_callback(port_id, queue_id, queue_cfg->cb); rte_power_exit(lcore_id); break; } @@ -332,7 +536,18 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id, * ports before calling any of these API's, so we can assume that the * callbacks can be freed. we're intentionally casting away const-ness. */ - rte_free((void *)queue_cfg->cur_cb); + rte_free((void *)queue_cfg->cb); + free(queue_cfg); return 0; } + +RTE_INIT(rte_power_ethdev_pmgmt_init) { + size_t i; + + /* initialize all tailqs */ + for (i = 0; i < RTE_DIM(lcore_cfgs); i++) { + struct pmd_core_cfg *cfg = &lcore_cfgs[i]; + TAILQ_INIT(&cfg->head); + } +} From patchwork Tue Jun 29 15:48:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 94984 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id CCF59A0C3F; Tue, 29 Jun 2021 17:50:00 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 2EBBA411E3; Tue, 29 Jun 2021 17:49:35 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by mails.dpdk.org (Postfix) with ESMTP id 1C5E3411C0 for ; Tue, 29 Jun 2021 17:49:27 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10030"; a="269304859" X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="269304859" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jun 2021 08:48:46 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="408213665" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga006.jf.intel.com with ESMTP; 29 Jun 2021 08:48:45 -0700 From: Anatoly Burakov To: dev@dpdk.org, David Hunt Cc: konstantin.ananyev@intel.com, ciara.loftus@intel.com Date: Tue, 29 Jun 2021 15:48:29 +0000 Message-Id: <351ec2cd10ee91e9497330447783ed8d26789aad.1624981670.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v5 6/7] power: support monitoring multiple Rx queues X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Use the new multi-monitor intrinsic to allow monitoring multiple ethdev Rx queues while entering the energy efficient power state. The multi version will be used unconditionally if supported, and the UMWAIT one will only be used when multi-monitor is not supported by the hardware. Signed-off-by: Anatoly Burakov --- Notes: v4: - Fix possible out of bounds access - Added missing index increment doc/guides/prog_guide/power_man.rst | 9 ++-- lib/power/rte_power_pmd_mgmt.c | 81 ++++++++++++++++++++++++++++- 2 files changed, 85 insertions(+), 5 deletions(-) diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst index ec04a72108..94353ca012 100644 --- a/doc/guides/prog_guide/power_man.rst +++ b/doc/guides/prog_guide/power_man.rst @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number. The "monitor" mode is only supported in the following configurations and scenarios: * If ``rte_cpu_get_intrinsics_support()`` function indicates that + ``rte_power_monitor_multi()`` function is supported by the platform, then + monitoring multiple Ethernet Rx queues for traffic will be supported. + +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only ``rte_power_monitor()`` is supported by the platform, then monitoring will be limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be monitored from a different lcore). -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the - ``rte_power_monitor()`` function is not supported, then monitor mode will not - be supported. +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the + two monitoring functions are supported, then monitor mode will not be supported. * Not all Ethernet drivers support monitoring, even if the underlying platform may support the necessary CPU instructions. Please refer to diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c index fccfd236c2..2056996b9c 100644 --- a/lib/power/rte_power_pmd_mgmt.c +++ b/lib/power/rte_power_pmd_mgmt.c @@ -124,6 +124,32 @@ queue_list_take(struct pmd_core_cfg *cfg, const union queue *q) return found; } +static inline int +get_monitor_addresses(struct pmd_core_cfg *cfg, + struct rte_power_monitor_cond *pmc, size_t len) +{ + const struct queue_list_entry *qle; + size_t i = 0; + int ret; + + TAILQ_FOREACH(qle, &cfg->head, next) { + const union queue *q = &qle->queue; + struct rte_power_monitor_cond *cur; + + /* attempted out of bounds access */ + if (i >= len) { + RTE_LOG(ERR, POWER, "Too many queues being monitored\n"); + return -1; + } + + cur = &pmc[i++]; + ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur); + if (ret < 0) + return ret; + } + return 0; +} + static void calc_tsc(void) { @@ -190,6 +216,45 @@ lcore_can_sleep(struct pmd_core_cfg *cfg) return true; } +static uint16_t +clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused, + struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, + uint16_t max_pkts __rte_unused, void *arg) +{ + const unsigned int lcore = rte_lcore_id(); + struct queue_list_entry *queue_conf = arg; + struct pmd_core_cfg *lcore_conf; + const bool empty = nb_rx == 0; + + lcore_conf = &lcore_cfgs[lcore]; + + /* early exit */ + if (likely(!empty)) + /* early exit */ + queue_reset(lcore_conf, queue_conf); + else { + struct rte_power_monitor_cond pmc[RTE_MAX_ETHPORTS]; + int ret; + + /* can this queue sleep? */ + if (!queue_can_sleep(lcore_conf, queue_conf)) + return nb_rx; + + /* can this lcore sleep? */ + if (!lcore_can_sleep(lcore_conf)) + return nb_rx; + + /* gather all monitoring conditions */ + ret = get_monitor_addresses(lcore_conf, pmc, RTE_DIM(pmc)); + if (ret < 0) + return nb_rx; + + rte_power_monitor_multi(pmc, lcore_conf->n_queues, UINT64_MAX); + } + + return nb_rx; +} + static uint16_t clb_umwait(uint16_t port_id, uint16_t qidx, struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, uint16_t max_pkts __rte_unused, void *arg) @@ -341,14 +406,19 @@ static int check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata) { struct rte_power_monitor_cond dummy; + bool multimonitor_supported; /* check if rte_power_monitor is supported */ if (!global_data.intrinsics_support.power_monitor) { RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n"); return -ENOTSUP; } + /* check if multi-monitor is supported */ + multimonitor_supported = + global_data.intrinsics_support.power_monitor_multi; - if (cfg->n_queues > 0) { + /* if we're adding a new queue, do we support multiple queues? */ + if (cfg->n_queues > 0 && !multimonitor_supported) { RTE_LOG(DEBUG, POWER, "Monitoring multiple queues is not supported\n"); return -ENOTSUP; } @@ -364,6 +434,13 @@ check_monitor(struct pmd_core_cfg *cfg, const union queue *qdata) return 0; } +static inline rte_rx_callback_fn +get_monitor_callback(void) +{ + return global_data.intrinsics_support.power_monitor_multi ? + clb_multiwait : clb_umwait; +} + int rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, uint16_t queue_id, enum rte_power_pmd_mgmt_type mode) @@ -428,7 +505,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id, if (ret < 0) goto end; - clb = clb_umwait; + clb = get_monitor_callback(); break; case RTE_POWER_MGMT_TYPE_SCALE: /* check if we can add a new queue */ From patchwork Tue Jun 29 15:48:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 94986 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 04995A0C3F; Tue, 29 Jun 2021 17:50:15 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9647F411F3; Tue, 29 Jun 2021 17:49:37 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by mails.dpdk.org (Postfix) with ESMTP id C0916411C0 for ; Tue, 29 Jun 2021 17:49:28 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10030"; a="269304867" X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="269304867" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jun 2021 08:48:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.83,309,1616482800"; d="scan'208";a="408213671" Received: from silpixa00399498.ir.intel.com (HELO silpixa00399498.ger.corp.intel.com) ([10.237.223.53]) by orsmga006.jf.intel.com with ESMTP; 29 Jun 2021 08:48:47 -0700 From: Anatoly Burakov To: dev@dpdk.org, David Hunt Cc: konstantin.ananyev@intel.com, ciara.loftus@intel.com Date: Tue, 29 Jun 2021 15:48:30 +0000 Message-Id: <64f9ccc2dbe97be95ea4ccfeae87a0f8fef041b8.1624981670.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH v5 7/7] l3fwd-power: support multiqueue in PMD pmgmt modes X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Currently, l3fwd-power enforces the limitation of having one queue per lcore. This is no longer necessary, so remove the limitation. Signed-off-by: Anatoly Burakov --- examples/l3fwd-power/main.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index f8dfed1634..52f56dc405 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -2723,12 +2723,6 @@ main(int argc, char **argv) printf("\nInitializing rx queues on lcore %u ... ", lcore_id ); fflush(stdout); - /* PMD power management mode can only do 1 queue per core */ - if (app_mode == APP_MODE_PMD_MGMT && qconf->n_rx_queue > 1) { - rte_exit(EXIT_FAILURE, - "In PMD power management mode, only one queue per lcore is allowed\n"); - } - /* init RX queues */ for(queue = 0; queue < qconf->n_rx_queue; ++queue) { struct rte_eth_rxconf rxq_conf;