From patchwork Mon Feb 24 11:35:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ananyev, Konstantin" X-Patchwork-Id: 66001 Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id A73C5A0524; Mon, 24 Feb 2020 12:35:24 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 4E5FE1BE85; Mon, 24 Feb 2020 12:35:23 +0100 (CET) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 650881F1C for ; Mon, 24 Feb 2020 12:35:22 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 24 Feb 2020 03:35:20 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,480,1574150400"; d="scan'208";a="435878522" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by fmsmga005.fm.intel.com with ESMTP; 24 Feb 2020 03:35:18 -0800 From: Konstantin Ananyev To: dev@dpdk.org Cc: olivier.matz@6wind.com, Konstantin Ananyev Date: Mon, 24 Feb 2020 11:35:09 +0000 Message-Id: <20200224113515.1744-1-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Upfront note - that RFC is not a complete patch. It introduces an ABI breakage, plus it doesn't update ring_elem code properly, etc. I plan to deal with all these things in later versions. Right now I seek an initial feedback about proposed ideas. Would also ask people to repeat performance tests (see below) on their platforms to confirm the impact. More and more customers use(/try to use) DPDK based apps within overcommitted systems (multiple acttive threads over same pysical cores): VM, container deployments, etc. One quite common problem they hit: Lock-Holder-Preemption with rte_ring. LHP is quite a common problem for spin-based sync primitives (spin-locks, etc.) on overcommitted systems. The situation gets much worse when some sort of fair-locking technique is used (ticket-lock, etc.). As now not only lock-owner but also lock-waiters scheduling order matters a lot. This is a well-known problem for kernel within VMs: http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf The problem with rte_ring is that while head accusion is sort of un-fair locking, waiting on tail is very similar to ticket lock schema - tail has to be updated in particular order. That makes current rte_ring implementation to perform really pure on some overcommited scenarios. While it is probably not possible to completely resolve this problem in userspace only (without some kernel communication/intervention), removing fairness in tail update can mitigate it significantly. So this RFC proposes two new optional ring synchronization modes: 1) Head/Tail Sync (HTS) mode In that mode enqueue/dequeue operation is fully serialized: only one thread at a time is allowed to perform given op. As another enhancement provide ability to split enqueue/dequeue operation into two phases: - enqueue/dequeue start - enqueue/dequeue finish That allows user to inspect objects in the ring without removing them from it (aka MT safe peek). 2) Relaxed Tail Sync (RTS) The main difference from original MP/MC algorithm is that tail value is increased not by every thread that finished enqueue/dequeue, but only by the last one. That allows threads to avoid spinning on ring tail value, leaving actual tail value change to the last thread in the update queue. Test results on IA (see below) show significant improvements for average enqueue/dequeue op times on overcommitted systems. For 'classic' DPDK deployments (one thread per core) original MP/MC algorithm still shows best numbers, though for 64-bit target RTS numbers are not that far away. Numbers were produced by ring_stress_*autotest (first patch in these series). X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz DEQ+ENQ average cycles/obj MP/MC HTS RTS 1thread@1core(--lcores=6-7) 8.00 8.15 8.99 2thread@2core(--lcores=6-8) 19.14 19.61 20.35 4thread@4core(--lcores=6-10) 29.43 29.79 31.82 8thread@8core(--lcores=6-14) 110.59 192.81 119.50 16thread@16core(--lcores=6-22) 461.03 813.12 495.59 32thread/@32core(--lcores='6-22,55-70') 982.90 1972.38 1160.51 2thread@1core(--lcores='6,(10-11)@7' 20140.50 23.58 25.14 4thread@2core(--lcores='6,(10-11)@7,(20-21)@8' 153680.60 76.88 80.05 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' 280314.32 294.72 318.79 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59 1144.02 1175.14 32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80 4627.48 4892.68 8thread@2core(--lcores='6,(10-17)@(7,8))' 321085.98 298.59 307.47 16thread@4core(--lcores='6,(20-35)@(7-10))' 1900705.61 575.35 678.29 32thread@4core(--lcores='6,(20-51)@(7-10))' 5510445.85 2164.36 2714.12 i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz DEQ+ENQ average cycles/obj MP/MC HTS RTS 1thread@1core(--lcores=6-7) 7.85 12.13 11.31 2thread@2core(--lcores=6-8) 17.89 24.52 21.86 8thread@8core(--lcores=6-14) 32.58 354.20 54.58 32thread/@32core(--lcores='6-22,55-70') 813.77 6072.41 2169.91 2thread@1core(--lcores='6,(10-11)@7' 16095.00 36.06 34.74 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' 1140354.54 346.61 361.57 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86 1314.90 1416.65 8thread@2core(--lcores='6,(10-17)@(7,8))' 594358.61 332.70 357.74 32thread@4core(--lcores='6,(20-51)@(7-10))' 5319896.86 2836.44 3028.87 Konstantin Ananyev (6): test/ring: add contention stress test ring: rework ring layout to allow new sync schemes ring: introduce RTS ring mode test/ring: add contention stress test for RTS ring ring: introduce HTS ring mode test/ring: add contention stress test for HTS ring app/test/Makefile | 3 + app/test/meson.build | 3 + app/test/test_pdump.c | 6 +- app/test/test_ring_hts_stress.c | 28 ++ app/test/test_ring_rts_stress.c | 28 ++ app/test/test_ring_stress.c | 27 ++ app/test/test_ring_stress.h | 477 +++++++++++++++++++ lib/librte_pdump/rte_pdump.c | 2 +- lib/librte_port/rte_port_ring.c | 12 +- lib/librte_ring/Makefile | 4 +- lib/librte_ring/meson.build | 4 +- lib/librte_ring/rte_ring.c | 84 +++- lib/librte_ring/rte_ring.h | 619 +++++++++++++++++++++++-- lib/librte_ring/rte_ring_elem.h | 8 +- lib/librte_ring/rte_ring_hts_generic.h | 228 +++++++++ lib/librte_ring/rte_ring_rts_generic.h | 240 ++++++++++ 16 files changed, 1721 insertions(+), 52 deletions(-) create mode 100644 app/test/test_ring_hts_stress.c create mode 100644 app/test/test_ring_rts_stress.c create mode 100644 app/test/test_ring_stress.c create mode 100644 app/test/test_ring_stress.h create mode 100644 lib/librte_ring/rte_ring_hts_generic.h create mode 100644 lib/librte_ring/rte_ring_rts_generic.h