From patchwork Fri Apr 12 20:20:36 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Honnappa Nagarahalli X-Patchwork-Id: 52731 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 5C7721B216; Fri, 12 Apr 2019 22:20:52 +0200 (CEST) Received: from foss.arm.com (foss.arm.com [217.140.101.70]) by dpdk.org (Postfix) with ESMTP id E6CFC1B213 for ; Fri, 12 Apr 2019 22:20:50 +0200 (CEST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1F83115AB; Fri, 12 Apr 2019 13:20:50 -0700 (PDT) Received: from qc2400f-1.austin.arm.com (qc2400f-1.austin.arm.com [10.118.13.209]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A1E583F557; Fri, 12 Apr 2019 13:20:49 -0700 (PDT) From: Honnappa Nagarahalli To: konstantin.ananyev@intel.com, stephen@networkplumber.org, paulmck@linux.ibm.com, marko.kovacevic@intel.com, dev@dpdk.org Cc: honnappa.nagarahalli@arm.com, gavin.hu@arm.com, dharmik.thakkar@arm.com, malvika.gupta@arm.com Date: Fri, 12 Apr 2019 15:20:36 -0500 Message-Id: <20190412202039.46902-1-honnappa.nagarahalli@arm.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> References: <20181122033055.3431-1-honnappa.nagarahalli@arm.com> Subject: [dpdk-dev] [PATCH v5 0/3] lib/rcu: add RCU library supporting QSBR mechanism X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Lock-less data structures provide scalability and determinism. They enable use cases where locking may not be allowed (for ex: real-time applications). In the following paras, the term 'memory' refers to memory allocated by typical APIs like malloc or anything that is representative of memory, for ex: an index of a free element array. Since these data structures are lock less, the writers and readers are accessing the data structures concurrently. Hence, while removing an element from a data structure, the writers cannot return the memory to the allocator, without knowing that the readers are not referencing that element/memory anymore. Hence, it is required to separate the operation of removing an element into 2 steps: Delete: in this step, the writer removes the reference to the element from the data structure but does not return the associated memory to the allocator. This will ensure that new readers will not get a reference to the removed element. Removing the reference is an atomic operation. Free(Reclaim): in this step, the writer returns the memory to the memory allocator, only after knowing that all the readers have stopped referencing the deleted element. This library helps the writer determine when it is safe to free the memory. This library makes use of thread Quiescent State (QS). QS can be defined as 'any point in the thread execution where the thread does not hold a reference to shared memory'. It is upto the application to determine its quiescent state. Let us consider the following diagram: Time --------------------------------------------------> | | RT1 $++++****D1****+++***D2*|**+++|+++**D3*****++++$ | | RT2 $++++****D1****++|+**D2|***++++++**D3*****++++$ | | RT3 $++++****D1****+++***|D2***|++++++**D2*****++++$ | | |<--->| Del | Free | Cannot free memory during this period (Grace Period) RTx - Reader thread < and > - Start and end of while(1) loop ***Dx*** - Reader thread is accessing the shared data structure Dx. i.e. critical section. +++ - Reader thread is not accessing any shared data structure. i.e. non critical section or quiescent state. Del - Point in time when the reference to the entry is removed using atomic operation. Free - Point in time when the writer can free the entry. Grace Period - Time duration between Del and Free, during which memory cannot be freed. As shown, thread RT1 accesses data structures D1, D2 and D3. When it is accessing D2, if the writer has to remove an element from D2, the writer cannot free the memory associated with that element immediately. The writer can return the memory to the allocator only after the reader stops referencing D2. In other words, reader thread RT1 has to enter a quiescent state. Similarly, since thread RT3 is also accessing D2, writer has to wait till RT3 enters quiescent state as well. However, the writer does not need to wait for RT2 to enter quiescent state. Thread RT2 was not accessing D2 when the delete operation happened. So, RT2 will not get a reference to the deleted entry. It can be noted that, the critical sections for D2 and D3 are quiescent states for D1. i.e. for a given data structure Dx, any point in the thread execution that does not reference Dx is a quiescent state. Since memory is not freed immediately, there might be a need for provisioning of additional memory, depending on the application requirements. It is important to make sure that this library keeps the overhead of identifying the end of grace period and subsequent freeing of memory, to a minimum. The following paras explain how grace period and critical section affect this overhead. The writer has to poll the readers to identify the end of grace period. Polling introduces memory accesses and wastes CPU cycles. The memory is not available for reuse during grace period. Longer grace periods exasperate these conditions. The length of the critical section and the number of reader threads is proportional to the duration of the grace period. Keeping the critical sections smaller will keep the grace period smaller. However, keeping the critical sections smaller requires additional CPU cycles(due to additional reporting) in the readers. Hence, we need the characteristics of small grace period and large critical section. This library addresses this by allowing the writer to do other work without having to block till the readers report their quiescent state. For DPDK applications, the start and end of while(1) loop (where no references to shared data structures are kept) act as perfect quiescent states. This will combine all the shared data structure accesses into a single, large critical section which helps keep the overhead on the reader side to a minimum. DPDK supports pipeline model of packet processing and service cores. In these use cases, a given data structure may not be used by all the workers in the application. The writer does not have to wait for all the workers to report their quiescent state. To provide the required flexibility, this library has a concept of QS variable. The application can create one QS variable per data structure to help it track the end of grace period for each data structure. This helps keep the grace period to a minimum. The application has to allocate memory and initialize a QS variable. Application can call rte_rcu_qsbr_get_memsize to calculate the size of memory to allocate. This API takes maximum number of reader threads, using this variable, as a parameter. Currently, a maximum of 1024 threads are supported. Further, the application can initialize a QS variable using the API rte_rcu_qsbr_init. Each reader thread is assumed to have a unique thread ID. Currently, the management of the thread ID (for ex: allocation/free) is left to the application. The thread ID should be in the range of 0 to maximum number of threads provided while creating the QS variable. The application could also use lcore_id as the thread ID where applicable. rte_rcu_qsbr_thread_register API will register a reader thread to report its quiescent state. This can be called from a reader thread. A control plane thread can also call this on behalf of a reader thread. The reader thread must call rte_rcu_qsbr_thread_online API to start reporting its quiescent state. Some of the use cases might require the reader threads to make blocking API calls (for ex: while using eventdev APIs). The writer thread should not wait for such reader threads to enter quiescent state. The reader thread must call rte_rcu_qsbr_thread_offline API, before calling blocking APIs. It can call rte_rcu_qsbr_thread_online API once the blocking API call returns. The writer thread can trigger the reader threads to report their quiescent state by calling the API rte_rcu_qsbr_start. It is possible for multiple writer threads to query the quiescent state status simultaneously. Hence, rte_rcu_qsbr_start returns a token to each caller. The writer thread has to call rte_rcu_qsbr_check API with the token to get the current quiescent state status. Option to block till all the reader threads enter the quiescent state is provided. If this API indicates that all the reader threads have entered the quiescent state, the application can free the deleted entry. The APIs rte_rcu_qsbr_start and rte_rcu_qsbr_check are lock free. Hence, they can be called concurrently from multiple writers even while running as worker threads. The separation of triggering the reporting from querying the status provides the writer threads flexibility to do useful work instead of blocking for the reader threads to enter the quiescent state or go offline. This reduces the memory accesses due to continuous polling for the status. rte_rcu_qsbr_synchronize API combines the functionality of rte_rcu_qsbr_start and blocking rte_rcu_qsbr_check into a single API. This API triggers the reader threads to report their quiescent state and polls till all the readers enter the quiescent state or go offline. This API does not allow the writer to do useful work while waiting and also introduces additional memory accesses due to continuous polling. The reader thread must call rte_rcu_qsbr_thread_offline and rte_rcu_qsbr_thread_unregister APIs to remove itself from reporting its quiescent state. The rte_rcu_qsbr_check API will not wait for this reader thread to report the quiescent state status anymore. The reader threads should call rte_rcu_qsbr_update API to indicate that they entered a quiescent state. This API checks if a writer has triggered a quiescent state query and update the state accordingly. Patch v5: 1) Library changes a) Removed extra alignment in rte_rcu_qsbr_get_memsize API (Paul) b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock APIs (Paul) c) Clarified the need for 64b counters (Paul) 2) Test cases a) Added additional performance test cases to benchmark __rcu_qsbr_check_all b) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock calls in various test cases 3) Documentation a) Added rte_rcu_qsbr_lock/rte_rcu_qsbr_unlock usage description Patch v4: 1) Library changes a) Fixed the compilation issue on x86 (Konstantin) b) Rebased with latest master Patch v3: 1) Library changes a) Moved the registered thread ID array to the end of the structure (Konstantin) b) Removed the compile time constant RTE_RCU_MAX_THREADS c) Added code to keep track of registered number of threads Patch v2: 1) Library changes a) Corrected the RTE_ASSERT checks (Konstantin) b) Replaced RTE_ASSERT with 'if' checks for non-datapath APIs (Konstantin) c) Made rte_rcu_qsbr_thread_register/unregister non-datapath critical APIs d) Renamed rte_rcu_qsbr_update to rte_rcu_qsbr_quiescent (Ola) e) Used rte_smp_mb() in rte_rcu_qsbr_thread_online API for x86 (Konstantin) f) Removed the macro to access the thread QS counters (Konstantin) 2) Test cases a) Added additional test cases for removing RTE_ASSERT 3) Documentation a) Changed the figure to make it bigger (Marko) b) Spelling and format corrections (Marko) Patch v1: 1) Library changes a) Changed the maximum number of reader threads to 1024 b) Renamed rte_rcu_qsbr_register/unregister_thread to rte_rcu_qsbr_thread_register/unregister c) Added rte_rcu_qsbr_thread_online/offline API. These are optimized version of rte_rcu_qsbr_thread_register/unregister API. These also provide the flexibility for performance when the requested maximum number of threads is higher than the current number of threads. d) Corrected memory orderings in rte_rcu_qsbr_update e) Changed the signature of rte_rcu_qsbr_start API to return the token f) Changed the signature of rte_rcu_qsbr_start API to not take the expected number of QS states to wait. g) Added debug logs h) Added API and programmer guide documentation. RFC v3: 1) Library changes a) Rebased to latest master b) Added new API rte_rcu_qsbr_get_memsize c) Add support for memory allocation for QSBR variable (Konstantin) d) Fixed a bug in rte_rcu_qsbr_check (Konstantin) 2) Testcase changes a) Separated stress tests into a performance test case file b) Added performance statistics RFC v2: 1) Cover letter changes a) Explian the parameters that affect the overhead of using RCU and their effect b) Explain how this library addresses these effects to keep the overhead to minimum 2) Library changes a) Rename the library to avoid confusion (Matias, Bruce, Konstantin) b) Simplify the code/remove APIs to keep this library inline with other synchronisation mechanisms like locks (Konstantin) c) Change the design to support more than 64 threads (Konstantin) d) Fixed version map to remove static inline functions 3) Testcase changes a) Add boundary and additional functional test cases b) Add stress test cases (Paul E. McKenney) Dharmik Thakkar (1): test/rcu_qsbr: add API and functional tests Honnappa Nagarahalli (2): rcu: add RCU library supporting QSBR mechanism doc/rcu: add lib_rcu documentation MAINTAINERS | 5 + app/test/Makefile | 2 + app/test/autotest_data.py | 12 + app/test/meson.build | 5 + app/test/test_rcu_qsbr.c | 1014 +++++++++++++++++ app/test/test_rcu_qsbr_perf.c | 703 ++++++++++++ config/common_base | 6 + doc/api/doxy-api-index.md | 3 +- doc/api/doxy-api.conf.in | 1 + .../prog_guide/img/rcu_general_info.svg | 509 +++++++++ doc/guides/prog_guide/index.rst | 1 + doc/guides/prog_guide/rcu_lib.rst | 185 +++ lib/Makefile | 2 + lib/librte_rcu/Makefile | 23 + lib/librte_rcu/meson.build | 5 + lib/librte_rcu/rte_rcu_qsbr.c | 237 ++++ lib/librte_rcu/rte_rcu_qsbr.h | 645 +++++++++++ lib/librte_rcu/rte_rcu_version.map | 11 + lib/meson.build | 2 +- mk/rte.app.mk | 1 + 20 files changed, 3370 insertions(+), 2 deletions(-) create mode 100644 app/test/test_rcu_qsbr.c create mode 100644 app/test/test_rcu_qsbr_perf.c create mode 100644 doc/guides/prog_guide/img/rcu_general_info.svg create mode 100644 doc/guides/prog_guide/rcu_lib.rst create mode 100644 lib/librte_rcu/Makefile create mode 100644 lib/librte_rcu/meson.build create mode 100644 lib/librte_rcu/rte_rcu_qsbr.c create mode 100644 lib/librte_rcu/rte_rcu_qsbr.h create mode 100644 lib/librte_rcu/rte_rcu_version.map