From patchwork Fri Apr 21 19:16:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Wathsala Wathawana Vithanage X-Patchwork-Id: 126416 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 42C11429A8; Fri, 21 Apr 2023 21:16:57 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 223104113C; Fri, 21 Apr 2023 21:16:57 +0200 (CEST) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mails.dpdk.org (Postfix) with ESMTP id 59158410FB for ; Fri, 21 Apr 2023 21:16:56 +0200 (CEST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AF737139F; Fri, 21 Apr 2023 12:17:39 -0700 (PDT) Received: from ampere-altra-2-1.usa.Arm.com (ampere-altra-2-1.usa.arm.com [10.118.91.158]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id C91433F6C4; Fri, 21 Apr 2023 12:16:55 -0700 (PDT) From: Wathsala Vithanage To: honnappa.nagarahalli@arm.com, konstantin.v.ananyev@yandex.ru, feifei.wang2@arm.com Cc: dev@dpdk.org, nd@arm.com, Wathsala Vithanage Subject: [RFC] ring: improve ring performance with C11 atomics Date: Fri, 21 Apr 2023 19:16:42 +0000 Message-Id: <20230421191642.217011-1-wathsala.vithanage@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE. Because to calculate the addresses of the dequeue elements __rte_ring_dequeue_elems uses the old_head updated by the __atomic_compare_exchange_n intrinsic used in __rte_ring_move_prod_head. This results in an address dependency between the two operations. Therefore __rte_ring_dequeue_elems cannot happen before __rte_ring_move_prod_head. Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head won't be reordered either. Performance on Arm N1 Gain relative to generic implementation +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 8 (Arm N1) | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 766730 | Total count: 651686 | Total count: 812125 | | | Gain: -15% | Gain: 6% | +-------------------------------------------------------------------+ +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 32 (Arm N1) | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 816745 | Total count: 646385 | Total count: 830935 | | | Gain: -21% | Gain: 2% | +-------------------------------------------------------------------+ Performance on x86-64 Cascade Lake Gain relative to generic implementation +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 8 | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 181640 | Total count: 181995 | Total count: 182791 | | | Gain: 0.2% | Gain: 0.6% +-------------------------------------------------------------------+ +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 32 | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 167495 | Total count: 161536 | Total count: 163190 | | | Gain: -3.5% | Gain: -2.6% | +-------------------------------------------------------------------+ Signed-off-by: Wathsala Vithanage Reviewed-by: Honnappa Nagarahalli Reviewed-by: Feifei Wang Acked-by: Morten Brørup --- .mailmap | 1 + lib/ring/rte_ring_c11_pvt.h | 18 +++++++++--------- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/.mailmap b/.mailmap index 4018f0fc47..367115d134 100644 --- a/.mailmap +++ b/.mailmap @@ -1430,6 +1430,7 @@ Walter Heymans Wang Sheng-Hui Wangyu (Eric) Waterman Cao +Wathsala Vithanage Weichun Chen Wei Dai Weifeng Li diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index f895950df4..1895f2bb0e 100644 --- a/lib/ring/rte_ring_c11_pvt.h +++ b/lib/ring/rte_ring_c11_pvt.h @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val, if (!single) rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED); + /* + * Updating of ht->tail cannot happen before elements are added to or + * removed from the ring, as it could result in data races between + * producer and consumer threads. Therefore ht->tail should be updated + * with release semantics to prevent ring data copy phase from sinking + * below it. + */ __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE); } @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp, /* Ensure the head is read before tail */ __atomic_thread_fence(__ATOMIC_ACQUIRE); - /* load-acquire synchronize with store-release of ht->tail - * in update_tail. - */ cons_tail = __atomic_load_n(&r->cons.tail, - __ATOMIC_ACQUIRE); + __ATOMIC_RELAXED); /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have @@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc, /* Ensure the head is read before tail */ __atomic_thread_fence(__ATOMIC_ACQUIRE); - /* this load-acquire synchronize with store-release of ht->tail - * in update_tail. - */ prod_tail = __atomic_load_n(&r->prod.tail, - __ATOMIC_ACQUIRE); - + __ATOMIC_RELAXED); /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have * cons_head > prod_tail). So 'entries' is always between 0