[RFC,v2,0/2] Add high-performance timer facility
Message ID | 20230315170342.214127-1-mattias.ronnblom@ericsson.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8D13841EA2; Wed, 15 Mar 2023 18:09:36 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 7169540A7A; Wed, 15 Mar 2023 18:09:36 +0100 (CET) Received: from EUR05-AM6-obe.outbound.protection.outlook.com (mail-am6eur05on2072.outbound.protection.outlook.com [40.107.22.72]) by mails.dpdk.org (Postfix) with ESMTP id 91DF240141 for <dev@dpdk.org>; Wed, 15 Mar 2023 18:09:34 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=K3jqrOaIXX7Cn9pP/j0X12SuTtBqOlTB/t+bOnGPggWaAq/BVjVTLaLUYRO9nj9xZTv7I9ptYGHQsBCdSOU9uVXCjqNjRjETU3Ps6qZjU2J7afgobo27KPHi8iaf+hZua4T/XMc1XGJB2jd9CD8mC7HNOSOMjaxM3uMHeDBb9x2TQ6rw4ofFq+bHK8PiQgJAJmjHtWDtKkZkid7npXC4Ea6hx7erR0//TvqIPevDE5+uwxsurcbtHxMpaEdXQdTt0qTC8gvWLCkFD8jcBSfniB3q72z9VYKHVKyRs1T0Dp/tzfYXUMap8HHH72N91WnTfi+uOhhydfWD85taEi+E1w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=vi4aQsi2xb3OtdvS2MWYT1JpEDuP7N6i6R3m28ZcgAk=; b=DGO6M8VNkSEpuJbhLVdF6b/wvI2C+Xp1/7M74ndIorBxv6CD7ppthSiycqIRq940RuWGqh2nFoY/gP16IAHHvaXTB9NQhgOjPJBcTskK9CSx+CWATD2kikV//vZd7ABQUQkLUy7Q2KyZJB34CTS0QwprFhMQ6COpmVwxMpjPt/aO97mktqMYBqo9hSgsfp7C2FTyG7Yi1JjB/bbhDGiePKl2QJHfUyimMh8kwMjmNzDlHAXNSSGV88IktlYzxBWIALWmfUj1f6RuwYBoGxoIcuRTIiTCVYJUX1sBsbpG21PAfXmUfUsOz5v+NJCaPfN7pRADntUFGno2Fh1HtfmNXw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 192.176.1.74) smtp.rcpttodomain=dpdk.org smtp.mailfrom=ericsson.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=ericsson.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ericsson.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=vi4aQsi2xb3OtdvS2MWYT1JpEDuP7N6i6R3m28ZcgAk=; b=TqdH5Ui3zmbBG4B+KducXqPlwIRPrPM+htmEK/1b0D7+nIBwcEd2laUTVACPsho4GJaXb7lG4C6Vw/76wOPYu1ezy0JhlGzybF60qK70o8jJ8chIDwr9+WSJkk0ItK+dlhYMPMvL053979dLhfO6NWQxljnpvR4ezW2Dt9gKSGw= Received: from AM5PR04CA0035.eurprd04.prod.outlook.com (2603:10a6:206:1::48) by DU0PR07MB9625.eurprd07.prod.outlook.com (2603:10a6:10:316::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6178.26; Wed, 15 Mar 2023 17:09:33 +0000 Received: from AM0EUR02FT018.eop-EUR02.prod.protection.outlook.com (2603:10a6:206:1:cafe::b0) by AM5PR04CA0035.outlook.office365.com (2603:10a6:206:1::48) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6178.29 via Frontend Transport; Wed, 15 Mar 2023 17:09:33 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 192.176.1.74) smtp.mailfrom=ericsson.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=ericsson.com; Received-SPF: Pass (protection.outlook.com: domain of ericsson.com designates 192.176.1.74 as permitted sender) receiver=protection.outlook.com; client-ip=192.176.1.74; helo=oa.msg.ericsson.com; pr=C Received: from oa.msg.ericsson.com (192.176.1.74) by AM0EUR02FT018.mail.protection.outlook.com (10.13.54.135) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.20.6178.20 via Frontend Transport; Wed, 15 Mar 2023 17:09:32 +0000 Received: from ESESSMB505.ericsson.se (153.88.183.166) by ESESBMB502.ericsson.se (153.88.183.169) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2507.17; Wed, 15 Mar 2023 18:09:32 +0100 Received: from seliicinfr00050.seli.gic.ericsson.se (153.88.183.153) by smtp.internal.ericsson.com (153.88.183.193) with Microsoft SMTP Server id 15.1.2507.17 via Frontend Transport; Wed, 15 Mar 2023 18:09:32 +0100 Received: from breslau.. (seliicwb00002.seli.gic.ericsson.se [10.156.25.100]) by seliicinfr00050.seli.gic.ericsson.se (Postfix) with ESMTP id 8BBAD1C006A; Wed, 15 Mar 2023 18:09:32 +0100 (CET) From: =?utf-8?q?Mattias_R=C3=B6nnblom?= <mattias.ronnblom@ericsson.com> To: <dev@dpdk.org> CC: Erik Gabriel Carrillo <erik.g.carrillo@intel.com>, David Marchand <david.marchand@redhat.com>, <maria.lingemark@ericsson.com>, Stefan Sundkvist <stefan.sundkvist@ericsson.com>, Stephen Hemminger <stephen@networkplumber.org>, =?utf-8?q?Morten_Br=C3=B8ru?= =?utf-8?q?p?= <mb@smartsharesystems.com>, Tyler Retzlaff <roretzla@linux.microsoft.com>, =?utf-8?q?Mattias_R=C3=B6nnb?= =?utf-8?q?lom?= <mattias.ronnblom@ericsson.com> Subject: [RFC v2 0/2] Add high-performance timer facility Date: Wed, 15 Mar 2023 18:03:40 +0100 Message-ID: <20230315170342.214127-1-mattias.ronnblom@ericsson.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230228093916.87206-1-mattias.ronnblom@ericsson.com> References: <20230228093916.87206-1-mattias.ronnblom@ericsson.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: AM0EUR02FT018:EE_|DU0PR07MB9625:EE_ X-MS-Office365-Filtering-Correlation-Id: 1524b6e9-7c5b-465f-db89-08db25780bf6 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: C/Z465Y+CQS3bfZ0KgecFsKHP1TkQ5HkcFTapEYpvVv4CIbxe/UVsvyFzniOVhLPSaojKhI3P/c1Itk0b3HBB1nkcBjfxMeOoWfbvG9g4+xQzCU7Xieltz8vrvxcLAk1VW1RJenWrI+haWCW9JdgryvfOHDWcdGeduuPgj6hrAm7Oq7gVVwHa9e7/OOasxxQ7rENBvnntlaFWjIg1yOnwNwkM/jiw161fGhF1Q490t3wCV8NmuOM9KKXTbjkhxlJ1BoZvfGvsRMGbH4i2PKO0E0CaDkEL94i0D7vIHtWxAzw42Ng0ONoFdEbLAplJunnVI//wIg0mwUc4QnTm2s/ZA5S8nbE13kOUznvETlp4Qjn1AVXIVPlTm+osSLbcxDuUD1Xa+ESNY0T1pqELs0Di7JTU7pIVVIJh/0gKq5Y4acjmVdWKK2cw6oUAtqVTlM6C6tgxqazkD8UX4I8gNiDsg5uTNZAVtCH3RvmpQ7qvcQYlb7XE62dQI9bG511r1YeQADBu16ZZaAnaoAiRulKoCJqEwuPcVF/38cY1rzON067QQtrMW2QwE14dP05HVTcHoS9WpX4zHkkz/4NqGWCoH1RnBKf6qlq2hJKN0nECtarZgjJzd3aX07YLLsDpKQrk7wV1r43/5I9U+gF1BT4aR9XwPGlr6M0Ld9TFhrOoh9OHT7T7rLuGu5JwJe8i0Da0SDENtW3MgqJgswWFjKRNXVKjJeJaaNyopHKecQPdd7BWI24WojnR3nuJJzOnAss X-Forefront-Antispam-Report: CIP:192.176.1.74; CTRY:SE; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:oa.msg.ericsson.com; PTR:office365.se.ericsson.net; CAT:NONE; SFS:(13230025)(4636009)(39860400002)(376002)(346002)(396003)(136003)(451199018)(36840700001)(46966006)(40470700004)(1076003)(26005)(41300700001)(36756003)(5660300002)(83380400001)(478600001)(6666004)(107886003)(6266002)(186003)(2616005)(316002)(70206006)(70586007)(8676002)(54906003)(336012)(8936002)(66574015)(40480700001)(4326008)(47076005)(40460700003)(6916009)(7636003)(82740400003)(82960400001)(356005)(82310400005)(86362001)(36860700001)(2906002); DIR:OUT; SFP:1101; X-OriginatorOrg: ericsson.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 15 Mar 2023 17:09:32.9462 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 1524b6e9-7c5b-465f-db89-08db25780bf6 X-MS-Exchange-CrossTenant-Id: 92e84ceb-fbfd-47ab-be52-080c6b87953f X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=92e84ceb-fbfd-47ab-be52-080c6b87953f; Ip=[192.176.1.74]; Helo=[oa.msg.ericsson.com] X-MS-Exchange-CrossTenant-AuthSource: AM0EUR02FT018.eop-EUR02.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DU0PR07MB9625 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org |
Message
Mattias Rönnblom
March 15, 2023, 5:03 p.m. UTC
This patchset is an attempt to introduce a high-performance, highly scalable timer facility into DPDK. More specifically, the goals for the htimer library are: * Efficient handling of a handful up to hundreds of thousands of concurrent timers. * Make adding and canceling timers low-overhead, constant-time operations. * Provide a service functionally equivalent to that of <rte_timer.h>. API/ABI backward compatibility is secondary. In the author's opinion, there are two main shortcomings with the current DPDK timer library (i.e., rte_timer.[ch]). One is the synchronization overhead, where heavy-weight full-barrier type synchronization is used. rte_timer.c uses per-EAL/lcore skip lists, but any thread may add or cancel (or otherwise access) timers managed by another lcore (and thus resides in its timer skip list). The other is an algorithmic shortcoming, with rte_timer.c's reliance on a skip list, which is less efficient than certain alternatives. This patchset implements a hierarchical timer wheel (HWT, in rte_htw.c), as per the Varghese and Lauck paper "Hashed and Hierarchical Timing Wheels: Data Structures for the Efficient Implementation of a Timer Facility". A HWT is a data structure purposely design for this task, and used by many operating system kernel timer facilities. To further improve the solution described by Varghese and Lauck, a bitset is placed in front of each of the timer wheel in the HWT, reducing overhead of rte_htimer_mgr_manage() (i.e., progressing time and expiry processing). Cycle-efficient scanning and manipulation of these bitsets are crucial for the HWT's performance. The htimer module keeps a per-lcore (or per-registered EAL thread) HWT instance, much like rte_timer.c keeps a per-lcore skip list. To avoid expensive synchronization overhead for thread-local timer management, the HWTs are accessed only from the "owning" thread. Any interaction any other thread does with a particular lcore's timer wheel goes over a set of DPDK rings. A side-effect of this design is that all operations working toward a "remote" HWT must be asynchronous. The <rte_htimer.h> API is available only to EAL threads and registered non-EAL threads. The htimer API allows the application to supply the current time, useful in case it already has retrieved this for other purposes, saving the cost of a rdtsc instruction (or its equivalent). Relative htimer does not retrieve a new time, but reuse the current time (as known via/at-the-time of the manage-call), again to shave off some cycles of overhead. A semantic improvement compared to the <rte_timer.h> API is that the htimer library can give a definite answer on the question if the timer expiry callback was called, after a timer has been canceled. The patchset includes a performance test case 'timer_htimer_htw_perf_autotest', which compares rte_timer, rte_htimer and rte_htw timers in the same scenario. 'timer_htimer_htw_perf_autotest' suggests that rte_htimer is ~3-5x faster than rte_timer for timer/timeout-heavy applications, in a scenario where the timer always fires. For a scenario with a mix of canceled and expired timers, the performance difference is greater. In scenarios with few timeouts, rte_timer has lower overhead than htimer, but both variants consume very little CPU time. In certain scenarios, rte_timer does not suffer from non-constant-time-add and cancel operations. On such is in case the timer added is always last in the list, where htimer is only ~2-3x faster. The bitset implementation which the HWT implementation depends upon seemed generic-enough and potentially useful outside the world of HWTs, to justify being located in the EAL. This patchset is very much an RFC, and the author is yet to form an opinion on many important issues. * If deemed a suitable replacement, should the htimer replace the current DPDK timer library in some particular (ABI-breaking) release, or should it live side-by-side with the then-legacy <rte_timer.h> API? A lot of things in and outside DPDK depend on <rte_timer.h>, so coexistence may be required to facilitate a smooth transition. * Should the htimer and htw-related files be colocated with rte_timer.c in the timer library? * Would it be useful for applications using asynchronous cancel to have the option of having the timer callback run not only in case of timer expiration, but also cancellation (on the target lcore)? The timer cb signature would need to include an additional parameter in that case. * Should the rte_htimer be a nested struct, so the htw parts be separated from the htimer parts? * <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should it be so? * rte_htimer struct is only supposed to be used by the application to give an indication of how much memory it needs to allocate, and is its member are not supposed to be directly accessed (w/ the possible exception of the owner_lcore_id field). Should there be a dummy struct, or a #define RTE_HTIMER_MEMSIZE or a rte_htimer_get_memsize() function instead, serving the same purpose? Better encapsulation, but more inconvenient for applications. Run-time dynamic sizing would force application-level dynamic allocations. * Asynchronous cancellation is a little tricky to use for the application (primarily due to timer memory reclamation/race issues). Should this functionality be removed? * Should rte_htimer_mgr_init() also retrieve the current time? If so, there should to be a variant which allows the user to specify the time (to match rte_htimer_mgr_manage_time()). One pitfall with the current proposed API is an application calling rte_htimer_mgr_init() and then immediately adding a timer with a relative timeout, in which case the current absolute time used is 0, which might be a surprise. * Would the event timer adapter be best off using <rte_htw.h> directly, or <rte_htimer.h>? In the latter case, there needs to be a way to instantiate more HWTs (similar to the "alt" functions of <rte_timer.h>)? * Should the PERIODICAL flag (and the complexity it brings) be removed? And leave the application with only single-shot timers, and the option to re-add them in the timer callback. * Should the async result codes and the sync cancel error codes be merged into one set of result codes? * Should the rte_htimer_mgr_async_add() have a flag which allow buffering add request messages until rte_htimer_mgr_process() is called? Or any manage function. Would reduce ring signaling overhead (i.e., burst enqueue operations instead of single-element enqueue). Could also be a rte_htimer_mgr_async_add_burst() function, solving the same "problem" a different way. (The signature of such a function would not be pretty.) * Does the functionality provided by the rte_htimer_mgr_process() function match its the use cases? Should there me a more clear separation between expiry processing and asynchronous operation processing? * Should the patchset be split into more commits? If so, how? Thanks to Erik Carrillo for his assistance. Mattias Rönnblom (2): eal: add bitset type eal: add high-performance timer facility app/test/meson.build | 12 +- app/test/test_bitset.c | 645 +++++++++++++++++++ app/test/test_htimer_mgr.c | 674 ++++++++++++++++++++ app/test/test_htimer_mgr_perf.c | 322 ++++++++++ app/test/test_htw.c | 478 ++++++++++++++ app/test/test_htw_perf.c | 181 ++++++ app/test/test_timer_htimer_htw_perf.c | 693 ++++++++++++++++++++ doc/api/doxy-api-index.md | 5 +- doc/api/doxy-api.conf.in | 1 + lib/eal/common/meson.build | 1 + lib/eal/common/rte_bitset.c | 29 + lib/eal/include/meson.build | 1 + lib/eal/include/rte_bitset.h | 879 ++++++++++++++++++++++++++ lib/eal/version.map | 3 + lib/htimer/meson.build | 7 + lib/htimer/rte_htimer.h | 68 ++ lib/htimer/rte_htimer_mgr.c | 547 ++++++++++++++++ lib/htimer/rte_htimer_mgr.h | 516 +++++++++++++++ lib/htimer/rte_htimer_msg.h | 44 ++ lib/htimer/rte_htimer_msg_ring.c | 18 + lib/htimer/rte_htimer_msg_ring.h | 55 ++ lib/htimer/rte_htw.c | 445 +++++++++++++ lib/htimer/rte_htw.h | 49 ++ lib/htimer/version.map | 17 + lib/meson.build | 1 + 25 files changed, 5689 insertions(+), 2 deletions(-) create mode 100644 app/test/test_bitset.c create mode 100644 app/test/test_htimer_mgr.c create mode 100644 app/test/test_htimer_mgr_perf.c create mode 100644 app/test/test_htw.c create mode 100644 app/test/test_htw_perf.c create mode 100644 app/test/test_timer_htimer_htw_perf.c create mode 100644 lib/eal/common/rte_bitset.c create mode 100644 lib/eal/include/rte_bitset.h create mode 100644 lib/htimer/meson.build create mode 100644 lib/htimer/rte_htimer.h create mode 100644 lib/htimer/rte_htimer_mgr.c create mode 100644 lib/htimer/rte_htimer_mgr.h create mode 100644 lib/htimer/rte_htimer_msg.h create mode 100644 lib/htimer/rte_htimer_msg_ring.c create mode 100644 lib/htimer/rte_htimer_msg_ring.h create mode 100644 lib/htimer/rte_htw.c create mode 100644 lib/htimer/rte_htw.h create mode 100644 lib/htimer/version.map
Comments
On Wed, 15 Mar 2023 18:03:40 +0100 Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote: > This patchset is an attempt to introduce a high-performance, highly > scalable timer facility into DPDK. > > More specifically, the goals for the htimer library are: > > * Efficient handling of a handful up to hundreds of thousands of > concurrent timers. > * Make adding and canceling timers low-overhead, constant-time > operations. > * Provide a service functionally equivalent to that of > <rte_timer.h>. API/ABI backward compatibility is secondary. Worthwhile goals, and the problem needs to be addressed. But this patch never got accepted. Please fix/improve/extend existing rte_timer instead.
> From: Stephen Hemminger [mailto:stephen@networkplumber.org] > Sent: Thursday, 3 October 2024 20.37 > > On Wed, 15 Mar 2023 18:03:40 +0100 > Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote: > > > This patchset is an attempt to introduce a high-performance, highly > > scalable timer facility into DPDK. > > > > More specifically, the goals for the htimer library are: > > > > * Efficient handling of a handful up to hundreds of thousands of > > concurrent timers. > > * Make adding and canceling timers low-overhead, constant-time > > operations. > > * Provide a service functionally equivalent to that of > > <rte_timer.h>. API/ABI backward compatibility is secondary. > > Worthwhile goals, and the problem needs to be addressed. > But this patch never got accepted. I think work on it was put on hold due to the requested changes requiring a significant development effort. I too look forward to work on this being resumed. ;-) > > Please fix/improve/extend existing rte_timer instead. The rte_timer API is too "fat" for use in the fast path with millions of timers, e.g. TCP flow timers. Shoehorning a fast path feature into a slow path API is not going to cut it. I support having a separate htimer library with its own API for high volume, high-performance fast path timers. When striving for low latency across the internet, timing is everything. Packet pacing is the "new" hot thing in congestion control algorithms, and a simple software implementation would require a timer firing once per packet.
On 2024-10-03 23:32, Morten Brørup wrote: >> From: Stephen Hemminger [mailto:stephen@networkplumber.org] >> Sent: Thursday, 3 October 2024 20.37 >> >> On Wed, 15 Mar 2023 18:03:40 +0100 >> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote: >> >>> This patchset is an attempt to introduce a high-performance, highly >>> scalable timer facility into DPDK. >>> >>> More specifically, the goals for the htimer library are: >>> >>> * Efficient handling of a handful up to hundreds of thousands of >>> concurrent timers. >>> * Make adding and canceling timers low-overhead, constant-time >>> operations. >>> * Provide a service functionally equivalent to that of >>> <rte_timer.h>. API/ABI backward compatibility is secondary. >> >> Worthwhile goals, and the problem needs to be addressed. >> But this patch never got accepted. > > I think work on it was put on hold due to the requested changes requiring a significant development effort. > I too look forward to work on this being resumed. ;-) > >> >> Please fix/improve/extend existing rte_timer instead. > > The rte_timer API is too "fat" for use in the fast path with millions of timers, e.g. TCP flow timers. > > Shoehorning a fast path feature into a slow path API is not going to cut it. I support having a separate htimer library with its own API for high volume, high-performance fast path timers. > > When striving for low latency across the internet, timing is everything. Packet pacing is the "new" hot thing in congestion control algorithms, and a simple software implementation would require a timer firing once per packet. > I think DPDK should have two public APIs in the timer area. One is a just a bare-bones hierarchical timer wheel API, without callbacks, auto-created per-lcore instances, MT safety or any other of the <rte_timer.h> bells and whistles. It also doesn't make any assumptions about the time source (other it being monotonic) or resolution. The other is a new variant of <rte_timer.h>, using the core HTW library for its implementation (and being public, it may also expose this library in its header files, which may be required for efficient operation). The new <rte_timer.h> would provide the same kind of functionality as the old API, but with some quirks and bugs fixed, plus potentially some new functionality added. For example, it would be useful to allow non-preemption safe threads to add and remove timers (something rte_timer and its spinlocks doesn't allow). I would consider both "fast path APIs". In addition, there should probably also be a time source API. Considering the lead time of relatively small contributions like the bitops extensions and the new bitset API (which still aren't in), I can't imagine how long time it would take to get in a semi-backward compatible rte_timer with a new implementation, plus a new timer wheel library, into DPDK.
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] > Sent: Sunday, 6 October 2024 15.03 > > On 2024-10-03 23:32, Morten Brørup wrote: > >> From: Stephen Hemminger [mailto:stephen@networkplumber.org] > >> Sent: Thursday, 3 October 2024 20.37 > >> > >> On Wed, 15 Mar 2023 18:03:40 +0100 > >> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote: > >> > >>> This patchset is an attempt to introduce a high-performance, highly > >>> scalable timer facility into DPDK. > >>> > >>> More specifically, the goals for the htimer library are: > >>> > >>> * Efficient handling of a handful up to hundreds of thousands of > >>> concurrent timers. > >>> * Make adding and canceling timers low-overhead, constant-time > >>> operations. > >>> * Provide a service functionally equivalent to that of > >>> <rte_timer.h>. API/ABI backward compatibility is secondary. > >> > >> Worthwhile goals, and the problem needs to be addressed. > >> But this patch never got accepted. > > > > I think work on it was put on hold due to the requested changes > requiring a significant development effort. > > I too look forward to work on this being resumed. ;-) > > > >> > >> Please fix/improve/extend existing rte_timer instead. > > > > The rte_timer API is too "fat" for use in the fast path with millions > of timers, e.g. TCP flow timers. > > > > Shoehorning a fast path feature into a slow path API is not going to > cut it. I support having a separate htimer library with its own API for > high volume, high-performance fast path timers. > > > > When striving for low latency across the internet, timing is > everything. Packet pacing is the "new" hot thing in congestion control > algorithms, and a simple software implementation would require a timer > firing once per packet. > > > > I think DPDK should have two public APIs in the timer area. Agree. > One is a > just a bare-bones hierarchical timer wheel API, without callbacks, > auto-created per-lcore instances, MT safety or any other of the > <rte_timer.h> bells and whistles. It also doesn't make any assumptions > about the time source (other it being monotonic) or resolution. The <rte_timer.h> library does not - and is never going to - provide sufficient performance for timer intensive applications, such as packet pacing and fast path TCP/QUIC/whatever congestion control. It is too "fat" for this. We need a new library with a new API for that. I agree with Mattias' description of the requirements for such a library. > > The other is a new variant of <rte_timer.h>, using the core HTW library > for its implementation (and being public, it may also expose this > library in its header files, which may be required for efficient > operation). The new <rte_timer.h> would provide the same kind of > functionality as the old API, but with some quirks and bugs fixed, plus > potentially some new functionality added. For example, it would be > useful to allow non-preemption safe threads to add and remove timers > (something rte_timer and its spinlocks doesn't allow). Agree. Until that becomes part of DPDK, we will have to stick with what <rte_timer.h> currently offers. > > I would consider both "fast path APIs". > > In addition, there should probably also be a time source API. A third library, orthogonal to the two other timer libraries. But I see why you mention it: It could be somewhat related to the design and implementation of the <rte_timer.h> library. But, let's please forget about a time source API for now. > > Considering the lead time of relatively small contributions like the > bitops extensions and the new bitset API (which still aren't in), I > can't imagine how long time it would take to get in a semi-backward > compatible rte_timer with a new implementation, plus a new timer wheel > library, into DPDK. Well said! Instead of aiming for an unreachable target, let's instead take this approach: - Provide the new high-performance HTW library as a stand-alone library. - Postpone improving the <rte_timer.h> library; it can be done any time in the future, if someone cares to do it. And it can use the HTW library or not, whichever is appropriate. Doing both simultaneously would require a substantial effort, and would cause much backpressure from the community (due to the modified <rte_timer.h> API and implementation). Although it might be beneficial for the design of the HTW library to consider how an improved <rte_timer.h> would use it, it is not the primary use case of the HTW library, so co-design is not a requirement here.
On 2024-10-06 15:43, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] >> Sent: Sunday, 6 October 2024 15.03 >> >> On 2024-10-03 23:32, Morten Brørup wrote: >>>> From: Stephen Hemminger [mailto:stephen@networkplumber.org] >>>> Sent: Thursday, 3 October 2024 20.37 >>>> >>>> On Wed, 15 Mar 2023 18:03:40 +0100 >>>> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote: >>>> >>>>> This patchset is an attempt to introduce a high-performance, highly >>>>> scalable timer facility into DPDK. >>>>> >>>>> More specifically, the goals for the htimer library are: >>>>> >>>>> * Efficient handling of a handful up to hundreds of thousands of >>>>> concurrent timers. >>>>> * Make adding and canceling timers low-overhead, constant-time >>>>> operations. >>>>> * Provide a service functionally equivalent to that of >>>>> <rte_timer.h>. API/ABI backward compatibility is secondary. >>>> >>>> Worthwhile goals, and the problem needs to be addressed. >>>> But this patch never got accepted. >>> >>> I think work on it was put on hold due to the requested changes >> requiring a significant development effort. >>> I too look forward to work on this being resumed. ;-) >>> >>>> >>>> Please fix/improve/extend existing rte_timer instead. >>> >>> The rte_timer API is too "fat" for use in the fast path with millions >> of timers, e.g. TCP flow timers. >>> >>> Shoehorning a fast path feature into a slow path API is not going to >> cut it. I support having a separate htimer library with its own API for >> high volume, high-performance fast path timers. >>> >>> When striving for low latency across the internet, timing is >> everything. Packet pacing is the "new" hot thing in congestion control >> algorithms, and a simple software implementation would require a timer >> firing once per packet. >>> >> >> I think DPDK should have two public APIs in the timer area. > > Agree. > >> One is a >> just a bare-bones hierarchical timer wheel API, without callbacks, >> auto-created per-lcore instances, MT safety or any other of the >> <rte_timer.h> bells and whistles. It also doesn't make any assumptions >> about the time source (other it being monotonic) or resolution. > > The <rte_timer.h> library does not - and is never going to - provide sufficient performance for timer intensive applications, such as packet pacing and fast path TCP/QUIC/whatever congestion control. It is too "fat" for this. > > We need a new library with a new API for that. > I agree with Mattias' description of the requirements for such a library. > >> >> The other is a new variant of <rte_timer.h>, using the core HTW library >> for its implementation (and being public, it may also expose this >> library in its header files, which may be required for efficient >> operation). The new <rte_timer.h> would provide the same kind of >> functionality as the old API, but with some quirks and bugs fixed, plus >> potentially some new functionality added. For example, it would be >> useful to allow non-preemption safe threads to add and remove timers >> (something rte_timer and its spinlocks doesn't allow). > > Agree. > > Until that becomes part of DPDK, we will have to stick with what <rte_timer.h> currently offers. > >> >> I would consider both "fast path APIs". >> >> In addition, there should probably also be a time source API. > > A third library, orthogonal to the two other timer libraries. > But I see why you mention it: It could be somewhat related to the design and implementation of the <rte_timer.h> library. > But, let's please forget about a time source API for now. > >> >> Considering the lead time of relatively small contributions like the >> bitops extensions and the new bitset API (which still aren't in), I >> can't imagine how long time it would take to get in a semi-backward >> compatible rte_timer with a new implementation, plus a new timer wheel >> library, into DPDK. > > Well said! > > Instead of aiming for an unreachable target, let's instead take this approach: > - Provide the new high-performance HTW library as a stand-alone library. > - Postpone improving the <rte_timer.h> library; it can be done any time in the future, if someone cares to do it. And it can use the HTW library or not, whichever is appropriate. > > Doing both simultaneously would require a substantial effort, and would cause much backpressure from the community (due to the modified <rte_timer.h> API and implementation). > > Although it might be beneficial for the design of the HTW library to consider how an improved <rte_timer.h> would use it, it is not the primary use case of the HTW library, so co-design is not a requirement here. > Postponing rte_timer improvements would also mean postponing most of the benefits of the new timer wheel, in my opinion. In most scenarios, I think you want to have all application modules sharing timer wheel instances, preferably without having to agree on a proprietary timer API. Here rte_timer shines. Also, you want to get the HTW library *exactly* right for the rte_timer use case. Making it a public API would make changes to its API painful, to address any shortcomings you accidentally designed in. To be on the safe side, you would need to have a new rte_timer implementation ready upon submitting a HTW library. That in turn would require a techboard ACK on the necessity of rte_timer API tweaks, otherwise all your work may be wasted.