From patchwork Fri Apr 21 19:16:42 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Wathsala Wathawana Vithanage
 <wathsala.vithanage@arm.com>
X-Patchwork-Id: 126416
X-Patchwork-Delegate: thomas@monjalon.net
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 42C11429A8;
	Fri, 21 Apr 2023 21:16:57 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 223104113C;
	Fri, 21 Apr 2023 21:16:57 +0200 (CEST)
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by mails.dpdk.org (Postfix) with ESMTP id 59158410FB
 for <dev@dpdk.org>; Fri, 21 Apr 2023 21:16:56 +0200 (CEST)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AF737139F;
 Fri, 21 Apr 2023 12:17:39 -0700 (PDT)
Received: from ampere-altra-2-1.usa.Arm.com (ampere-altra-2-1.usa.arm.com
 [10.118.91.158])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id C91433F6C4;
 Fri, 21 Apr 2023 12:16:55 -0700 (PDT)
From: Wathsala Vithanage <wathsala.vithanage@arm.com>
To: honnappa.nagarahalli@arm.com, konstantin.v.ananyev@yandex.ru,
 feifei.wang2@arm.com
Cc: dev@dpdk.org, nd@arm.com, Wathsala Vithanage <wathsala.vithanage@arm.com>
Subject: [RFC] ring: improve ring performance with C11 atomics
Date: Fri, 21 Apr 2023 19:16:42 +0000
Message-Id: <20230421191642.217011-1-wathsala.vithanage@arm.com>
X-Mailer: git-send-email 2.25.1
MIME-Version: 1.0
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head
can be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
Because to calculate the addresses of the dequeue
elements __rte_ring_dequeue_elems uses the old_head updated by the
__atomic_compare_exchange_n intrinsic used in
__rte_ring_move_prod_head. This results in an address dependency
between the two operations. Therefore __rte_ring_dequeue_elems cannot
happen before  __rte_ring_move_prod_head.
Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head
won't be reordered either.

Performance on Arm N1
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8 (Arm N1)                         |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 766730 | Total count: 651686  | Total count: 812125  |
|                     |        Gain: -15%    |        Gain: 6%      |
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32 (Arm N1)                        |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 816745 | Total count: 646385  | Total count: 830935  |
|                     |        Gain: -21%    |        Gain: 2%      |
+-------------------------------------------------------------------+

Performance on x86-64 Cascade Lake
Gain relative to generic implementation
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 8                                  |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 181640 | Total count: 181995  | Total count: 182791  |
|                     |        Gain: 0.2%    |        Gain: 0.6%
+-------------------------------------------------------------------+
+-------------------------------------------------------------------+
| Bulk enq/dequeue count on size 32                                 |
+-------------------------------------------------------------------+
| Generic             | C11 atomics          | C11 atomics improved |
+-------------------------------------------------------------------+
| Total count: 167495 | Total count: 161536  | Total count: 163190  |
|                     |        Gain: -3.5%   |        Gain: -2.6%   |
+-------------------------------------------------------------------+

Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Feifei Wang <feifei.wang2@arm.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 .mailmap                    |  1 +
 lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/.mailmap b/.mailmap
index 4018f0fc47..367115d134 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans@corigine.com>
 Wang Sheng-Hui <shhuiw@gmail.com>
 Wangyu (Eric) <seven.wangyu@huawei.com>
 Waterman Cao <waterman.cao@intel.com>
+Wathsala Vithanage <wathsala.vithanage@arm.com>
 Weichun Chen <weichunx.chen@intel.com>
 Wei Dai <wei.dai@intel.com>
 Weifeng Li <liweifeng96@126.com>
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index f895950df4..1895f2bb0e 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
 	if (!single)
 		rte_wait_until_equal_32(&ht->tail, old_val, __ATOMIC_RELAXED);
 
+	/*
+	 * Updating of ht->tail cannot happen before elements are added to or
+	 * removed from the ring, as it could result in data races between
+	 * producer and consumer threads. Therefore ht->tail should be updated
+	 * with release semantics to prevent ring data copy phase from sinking
+	 * below it.
+	 */
 	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
 }
 
@@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		cons_tail = __atomic_load_n(&r->cons.tail,
-					__ATOMIC_ACQUIRE);
+					__ATOMIC_RELAXED);
 
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
@@ -145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* this load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
-		 */
 		prod_tail = __atomic_load_n(&r->prod.tail,
-					__ATOMIC_ACQUIRE);
-
+					__ATOMIC_RELAXED);
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0