Message ID | 20211224225923.806498-1-dharmik.thakkar@arm.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id AE877A0352; Sat, 25 Dec 2021 00:00:20 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 1B169410FA; Sat, 25 Dec 2021 00:00:14 +0100 (CET) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mails.dpdk.org (Postfix) with ESMTP id 03AFF4013F for <dev@dpdk.org>; Sat, 25 Dec 2021 00:00:10 +0100 (CET) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4F7881FB; Fri, 24 Dec 2021 15:00:10 -0800 (PST) Received: from 2p2660v4-1.austin.arm.com (2p2660v4-1.austin.arm.com [10.118.13.211]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3B6F93F718; Fri, 24 Dec 2021 15:00:10 -0800 (PST) From: Dharmik Thakkar <dharmik.thakkar@arm.com> To: Cc: dev@dpdk.org, nd@arm.com, honnappa.nagarahalli@arm.com, ruifeng.wang@arm.com, Dharmik Thakkar <dharmik.thakkar@arm.com> Subject: [PATCH 0/1] mempool: implement index-based per core cache Date: Fri, 24 Dec 2021 16:59:22 -0600 Message-Id: <20211224225923.806498-1-dharmik.thakkar@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210930172735.2675627-1-dharmik.thakkar@arm.com> References: <20210930172735.2675627-1-dharmik.thakkar@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org |
Series |
mempool: implement index-based per core cache
|
|
Message
Dharmik Thakkar
Dec. 24, 2021, 10:59 p.m. UTC
Current mempool per core cache implementation stores pointers to mbufs On 64b architectures, each pointer consumes 8B This patch replaces it with index-based implementation, where in each buffer is addressed by (pool base address + index) It reduces the amount of memory/cache required for per core cache L3Fwd performance testing reveals minor improvements in the cache performance (L1 and L2 misses reduced by 0.60%) with no change in throughput Micro-benchmarking the patch using mempool_perf_test shows significant improvement with majority of the test cases Number of cores = 1: n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.01 n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=19.91 n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-20.37 (regression) n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.01 (regression) n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression) n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81 (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=53.93 n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=60.90 n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.64 n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.76 n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-4.71 (regression) n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.19 (regression) n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.58 Number of core = 2: n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.21 n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=21.89 n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-21.21 (regression) n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.05 (regression) n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression) n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49 (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=56.28 n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=67.69 n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.45 n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.84 n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-5.27 (regression) n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.09 (regression) n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.51 From analyzing the results, it is clear that for n_get_bulk and n_put_bulk sizes of 32 there is no performance regression IMO, the other sizes are not practical from performance perspective and the regression in those cases can be safely ignored Dharmik Thakkar (1): mempool: implement index-based per core cache lib/mempool/rte_mempool.h | 114 +++++++++++++++++++++++++- lib/mempool/rte_mempool_ops_default.c | 7 ++ 2 files changed, 119 insertions(+), 2 deletions(-)
Comments
> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] > Sent: Friday, 24 December 2021 23.59 > > Current mempool per core cache implementation stores pointers to mbufs > On 64b architectures, each pointer consumes 8B > This patch replaces it with index-based implementation, > where in each buffer is addressed by (pool base address + index) > It reduces the amount of memory/cache required for per core cache > > L3Fwd performance testing reveals minor improvements in the cache > performance (L1 and L2 misses reduced by 0.60%) > with no change in throughput > > Micro-benchmarking the patch using mempool_perf_test shows > significant improvement with majority of the test cases > > Number of cores = 1: > n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.01 > n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=19.91 > n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-20.37 > (regression) > n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.01 > (regression) > n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 > (regression) > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81 > (regression) > n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=53.93 > n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=60.90 > n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.64 > n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.76 > n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-4.71 > (regression) > n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.19 > (regression) > n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 > n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 > n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 > n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 > n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 > n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.58 > > Number of core = 2: > n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.21 > n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=21.89 > n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-21.21 > (regression) > n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.05 > (regression) > n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 > (regression) > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49 > (regression) > n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=56.28 > n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=67.69 > n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.45 > n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.84 > n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-5.27 > (regression) > n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.09 > (regression) > n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 > n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 > n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 > n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 > n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 > n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.51 > > > From analyzing the results, it is clear that for n_get_bulk and > n_put_bulk sizes of 32 there is no performance regression > IMO, the other sizes are not practical from performance perspective > and the regression in those cases can be safely ignored > > Dharmik Thakkar (1): > mempool: implement index-based per core cache > > lib/mempool/rte_mempool.h | 114 +++++++++++++++++++++++++- > lib/mempool/rte_mempool_ops_default.c | 7 ++ > 2 files changed, 119 insertions(+), 2 deletions(-) > > -- > 2.25.1 > I still think this is very interesting. And your performance numbers are looking good. However, it limits the size of a mempool to 4 GB. As previously discussed, the max mempool size can be increased by multiplying the index with a constant. I would suggest using sizeof(uintptr_t) as the constant multiplier, so the mempool can hold objects of any size divisible by sizeof(uintptr_t). And it would be silly to use a mempool to hold objects smaller than sizeof(uintptr_t). How does the performance look if you multiply the index by sizeof(uintptr_t)? Med venlig hilsen / Kind regards, -Morten Brørup
On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: Friday, 24 > > December 2021 23.59 > > > > Current mempool per core cache implementation stores pointers to mbufs > > On 64b architectures, each pointer consumes 8B This patch replaces it > > with index-based implementation, where in each buffer is addressed by > > (pool base address + index) It reduces the amount of memory/cache > > required for per core cache > > > > L3Fwd performance testing reveals minor improvements in the cache > > performance (L1 and L2 misses reduced by 0.60%) with no change in > > throughput > > > > Micro-benchmarking the patch using mempool_perf_test shows significant > > improvement with majority of the test cases > > > > Number of cores = 1: n_get_bulk=1 n_put_bulk=1 n_keep=32 > > %_change_with_patch=18.01 n_get_bulk=1 n_put_bulk=1 n_keep=128 > > %_change_with_patch=19.91 n_get_bulk=1 n_put_bulk=4 n_keep=32 > > %_change_with_patch=-20.37 (regression) n_get_bulk=1 n_put_bulk=4 > > n_keep=128 %_change_with_patch=-17.01 (regression) n_get_bulk=1 > > n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression) > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81 > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 > > %_change_with_patch=53.93 n_get_bulk=4 n_put_bulk=1 n_keep=128 > > %_change_with_patch=60.90 n_get_bulk=4 n_put_bulk=4 n_keep=32 > > %_change_with_patch=1.64 n_get_bulk=4 n_put_bulk=4 n_keep=128 > > %_change_with_patch=8.76 n_get_bulk=4 n_put_bulk=32 n_keep=32 > > %_change_with_patch=-4.71 (regression) n_get_bulk=4 n_put_bulk=32 > > n_keep=128 %_change_with_patch=-3.19 (regression) n_get_bulk=32 > > n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 n_get_bulk=32 > > n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 n_get_bulk=32 > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 n_get_bulk=32 > > n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 n_get_bulk=32 > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 n_get_bulk=32 > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.58 > > > > Number of core = 2: n_get_bulk=1 n_put_bulk=1 n_keep=32 > > %_change_with_patch=18.21 n_get_bulk=1 n_put_bulk=1 n_keep=128 > > %_change_with_patch=21.89 n_get_bulk=1 n_put_bulk=4 n_keep=32 > > %_change_with_patch=-21.21 (regression) n_get_bulk=1 n_put_bulk=4 > > n_keep=128 %_change_with_patch=-17.05 (regression) n_get_bulk=1 > > n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression) > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49 > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 > > %_change_with_patch=56.28 n_get_bulk=4 n_put_bulk=1 n_keep=128 > > %_change_with_patch=67.69 n_get_bulk=4 n_put_bulk=4 n_keep=32 > > %_change_with_patch=1.45 n_get_bulk=4 n_put_bulk=4 n_keep=128 > > %_change_with_patch=8.84 n_get_bulk=4 n_put_bulk=32 n_keep=32 > > %_change_with_patch=-5.27 (regression) n_get_bulk=4 n_put_bulk=32 > > n_keep=128 %_change_with_patch=-3.09 (regression) n_get_bulk=32 > > n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 n_get_bulk=32 > > n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 n_get_bulk=32 > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 n_get_bulk=32 > > n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 n_get_bulk=32 > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 n_get_bulk=32 > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.51 > > > > > > From analyzing the results, it is clear that for n_get_bulk and > > n_put_bulk sizes of 32 there is no performance regression IMO, the > > other sizes are not practical from performance perspective and the > > regression in those cases can be safely ignored > > > > Dharmik Thakkar (1): mempool: implement index-based per core cache > > > > lib/mempool/rte_mempool.h | 114 +++++++++++++++++++++++++- > > lib/mempool/rte_mempool_ops_default.c | 7 ++ 2 files changed, 119 > > insertions(+), 2 deletions(-) > > > > -- 2.25.1 > > > > I still think this is very interesting. And your performance numbers are > looking good. > > However, it limits the size of a mempool to 4 GB. As previously > discussed, the max mempool size can be increased by multiplying the index > with a constant. > > I would suggest using sizeof(uintptr_t) as the constant multiplier, so > the mempool can hold objects of any size divisible by sizeof(uintptr_t). > And it would be silly to use a mempool to hold objects smaller than > sizeof(uintptr_t). > > How does the performance look if you multiply the index by > sizeof(uintptr_t)? > Each mempool entry is cache aligned, so we can use that if we want a bigger multiplier.
> From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Friday, 7 January 2022 12.16 > > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: > > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: > Friday, 24 > > > December 2021 23.59 > > > > > > Current mempool per core cache implementation stores pointers to > mbufs > > > On 64b architectures, each pointer consumes 8B This patch replaces > it > > > with index-based implementation, where in each buffer is addressed > by > > > (pool base address + index) It reduces the amount of memory/cache > > > required for per core cache > > > > > > L3Fwd performance testing reveals minor improvements in the cache > > > performance (L1 and L2 misses reduced by 0.60%) with no change in > > > throughput > > > > > > Micro-benchmarking the patch using mempool_perf_test shows > significant > > > improvement with majority of the test cases > > > > > > Number of cores = 1: n_get_bulk=1 n_put_bulk=1 n_keep=32 > > > %_change_with_patch=18.01 n_get_bulk=1 n_put_bulk=1 n_keep=128 > > > %_change_with_patch=19.91 n_get_bulk=1 n_put_bulk=4 n_keep=32 > > > %_change_with_patch=-20.37 (regression) n_get_bulk=1 n_put_bulk=4 > > > n_keep=128 %_change_with_patch=-17.01 (regression) n_get_bulk=1 > > > n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression) > > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81 > > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 > > > %_change_with_patch=53.93 n_get_bulk=4 n_put_bulk=1 n_keep=128 > > > %_change_with_patch=60.90 n_get_bulk=4 n_put_bulk=4 n_keep=32 > > > %_change_with_patch=1.64 n_get_bulk=4 n_put_bulk=4 n_keep=128 > > > %_change_with_patch=8.76 n_get_bulk=4 n_put_bulk=32 n_keep=32 > > > %_change_with_patch=-4.71 (regression) n_get_bulk=4 n_put_bulk=32 > > > n_keep=128 %_change_with_patch=-3.19 (regression) n_get_bulk=32 > > > n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 n_get_bulk=32 > > > n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 n_get_bulk=32 > > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 n_get_bulk=32 > > > n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 n_get_bulk=32 > > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 n_get_bulk=32 > > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.58 > > > > > > Number of core = 2: n_get_bulk=1 n_put_bulk=1 n_keep=32 > > > %_change_with_patch=18.21 n_get_bulk=1 n_put_bulk=1 n_keep=128 > > > %_change_with_patch=21.89 n_get_bulk=1 n_put_bulk=4 n_keep=32 > > > %_change_with_patch=-21.21 (regression) n_get_bulk=1 n_put_bulk=4 > > > n_keep=128 %_change_with_patch=-17.05 (regression) n_get_bulk=1 > > > n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression) > > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49 > > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 > > > %_change_with_patch=56.28 n_get_bulk=4 n_put_bulk=1 n_keep=128 > > > %_change_with_patch=67.69 n_get_bulk=4 n_put_bulk=4 n_keep=32 > > > %_change_with_patch=1.45 n_get_bulk=4 n_put_bulk=4 n_keep=128 > > > %_change_with_patch=8.84 n_get_bulk=4 n_put_bulk=32 n_keep=32 > > > %_change_with_patch=-5.27 (regression) n_get_bulk=4 n_put_bulk=32 > > > n_keep=128 %_change_with_patch=-3.09 (regression) n_get_bulk=32 > > > n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 n_get_bulk=32 > > > n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 n_get_bulk=32 > > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 n_get_bulk=32 > > > n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 n_get_bulk=32 > > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 n_get_bulk=32 > > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.51 > > > > > > > > > From analyzing the results, it is clear that for n_get_bulk and > > > n_put_bulk sizes of 32 there is no performance regression IMO, the > > > other sizes are not practical from performance perspective and the > > > regression in those cases can be safely ignored > > > > > > Dharmik Thakkar (1): mempool: implement index-based per core cache > > > > > > lib/mempool/rte_mempool.h | 114 > +++++++++++++++++++++++++- > > > lib/mempool/rte_mempool_ops_default.c | 7 ++ 2 files changed, > 119 > > > insertions(+), 2 deletions(-) > > > > > > -- 2.25.1 > > > > > > > I still think this is very interesting. And your performance numbers > are > > looking good. > > > > However, it limits the size of a mempool to 4 GB. As previously > > discussed, the max mempool size can be increased by multiplying the > index > > with a constant. > > > > I would suggest using sizeof(uintptr_t) as the constant multiplier, > so > > the mempool can hold objects of any size divisible by > sizeof(uintptr_t). > > And it would be silly to use a mempool to hold objects smaller than > > sizeof(uintptr_t). > > > > How does the performance look if you multiply the index by > > sizeof(uintptr_t)? > > > > Each mempool entry is cache aligned, so we can use that if we want a > bigger > multiplier. Thanks for chiming in, Bruce. Please also read this discussion about the multiplier: http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/
On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote: > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > Sent: Friday, 7 January 2022 12.16 > > > > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: > > > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: > > Friday, 24 > > > > December 2021 23.59 > > > > > > > > Current mempool per core cache implementation stores pointers to > > mbufs > > > > On 64b architectures, each pointer consumes 8B This patch replaces > > it > > > > with index-based implementation, where in each buffer is addressed > > by > > > > (pool base address + index) It reduces the amount of memory/cache > > > > required for per core cache > > > > > > > > L3Fwd performance testing reveals minor improvements in the cache > > > > performance (L1 and L2 misses reduced by 0.60%) with no change in > > > > throughput > > > > > > > > Micro-benchmarking the patch using mempool_perf_test shows > > significant > > > > improvement with majority of the test cases > > > > > > > > Number of cores = 1: n_get_bulk=1 n_put_bulk=1 n_keep=32 > > > > %_change_with_patch=18.01 n_get_bulk=1 n_put_bulk=1 n_keep=128 > > > > %_change_with_patch=19.91 n_get_bulk=1 n_put_bulk=4 n_keep=32 > > > > %_change_with_patch=-20.37 (regression) n_get_bulk=1 n_put_bulk=4 > > > > n_keep=128 %_change_with_patch=-17.01 (regression) n_get_bulk=1 > > > > n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression) > > > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81 > > > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 > > > > %_change_with_patch=53.93 n_get_bulk=4 n_put_bulk=1 n_keep=128 > > > > %_change_with_patch=60.90 n_get_bulk=4 n_put_bulk=4 n_keep=32 > > > > %_change_with_patch=1.64 n_get_bulk=4 n_put_bulk=4 n_keep=128 > > > > %_change_with_patch=8.76 n_get_bulk=4 n_put_bulk=32 n_keep=32 > > > > %_change_with_patch=-4.71 (regression) n_get_bulk=4 n_put_bulk=32 > > > > n_keep=128 %_change_with_patch=-3.19 (regression) n_get_bulk=32 > > > > n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 n_get_bulk=32 > > > > n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 n_get_bulk=32 > > > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 n_get_bulk=32 > > > > n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 n_get_bulk=32 > > > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 n_get_bulk=32 > > > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.58 > > > > > > > > Number of core = 2: n_get_bulk=1 n_put_bulk=1 n_keep=32 > > > > %_change_with_patch=18.21 n_get_bulk=1 n_put_bulk=1 n_keep=128 > > > > %_change_with_patch=21.89 n_get_bulk=1 n_put_bulk=4 n_keep=32 > > > > %_change_with_patch=-21.21 (regression) n_get_bulk=1 n_put_bulk=4 > > > > n_keep=128 %_change_with_patch=-17.05 (regression) n_get_bulk=1 > > > > n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression) > > > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49 > > > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 > > > > %_change_with_patch=56.28 n_get_bulk=4 n_put_bulk=1 n_keep=128 > > > > %_change_with_patch=67.69 n_get_bulk=4 n_put_bulk=4 n_keep=32 > > > > %_change_with_patch=1.45 n_get_bulk=4 n_put_bulk=4 n_keep=128 > > > > %_change_with_patch=8.84 n_get_bulk=4 n_put_bulk=32 n_keep=32 > > > > %_change_with_patch=-5.27 (regression) n_get_bulk=4 n_put_bulk=32 > > > > n_keep=128 %_change_with_patch=-3.09 (regression) n_get_bulk=32 > > > > n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 n_get_bulk=32 > > > > n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 n_get_bulk=32 > > > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 n_get_bulk=32 > > > > n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 n_get_bulk=32 > > > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 n_get_bulk=32 > > > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.51 > > > > > > > > > > > > From analyzing the results, it is clear that for n_get_bulk and > > > > n_put_bulk sizes of 32 there is no performance regression IMO, the > > > > other sizes are not practical from performance perspective and the > > > > regression in those cases can be safely ignored > > > > > > > > Dharmik Thakkar (1): mempool: implement index-based per core cache > > > > > > > > lib/mempool/rte_mempool.h | 114 > > +++++++++++++++++++++++++- > > > > lib/mempool/rte_mempool_ops_default.c | 7 ++ 2 files changed, > > 119 > > > > insertions(+), 2 deletions(-) > > > > > > > > -- 2.25.1 > > > > > > > > > > I still think this is very interesting. And your performance numbers > > are > > > looking good. > > > > > > However, it limits the size of a mempool to 4 GB. As previously > > > discussed, the max mempool size can be increased by multiplying the > > index > > > with a constant. > > > > > > I would suggest using sizeof(uintptr_t) as the constant multiplier, > > so > > > the mempool can hold objects of any size divisible by > > sizeof(uintptr_t). > > > And it would be silly to use a mempool to hold objects smaller than > > > sizeof(uintptr_t). > > > > > > How does the performance look if you multiply the index by > > > sizeof(uintptr_t)? > > > > > > > Each mempool entry is cache aligned, so we can use that if we want a > > bigger > > multiplier. > > Thanks for chiming in, Bruce. > > Please also read this discussion about the multiplier: > http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/ > I actually wondered after I had sent the email whether we had indeed an option to disable the cache alignment or not! Thanks for pointing out that we do. This brings a couple additional thoughts: * Using indexes for the cache should probably be a runtime flag rather than a build-time one. * It would seem reasonable to me to disallow use of the indexed-cache flag and the non-cache aligned flag simultaneously. * On the offchance that that restriction is unacceptable, then we can make things a little more complicated by doing a runtime computation of the "index-shiftwidth" to use. Overall, I think defaulting to cacheline shiftwidth and disallowing index-based addressing when using unaligned buffers is simplest and easiest unless we can come up with a valid usecase for needing more than that. /Bruce
> From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Friday, 7 January 2022 14.51 > > On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote: > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > Sent: Friday, 7 January 2022 12.16 > > > > > > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: > > > > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: > > > Friday, 24 > > > > > December 2021 23.59 > > > > > > > > > > Current mempool per core cache implementation stores pointers > to > > > mbufs > > > > > On 64b architectures, each pointer consumes 8B This patch > replaces > > > it > > > > > with index-based implementation, where in each buffer is > addressed > > > by > > > > > (pool base address + index) It reduces the amount of > memory/cache > > > > > required for per core cache > > > > > > > > > > L3Fwd performance testing reveals minor improvements in the > cache > > > > > performance (L1 and L2 misses reduced by 0.60%) with no change > in > > > > > throughput > > > > > > > > > > Micro-benchmarking the patch using mempool_perf_test shows > > > significant > > > > > improvement with majority of the test cases > > > > > > > > > > > > > I still think this is very interesting. And your performance > numbers > > > are > > > > looking good. > > > > > > > > However, it limits the size of a mempool to 4 GB. As previously > > > > discussed, the max mempool size can be increased by multiplying > the > > > index > > > > with a constant. > > > > > > > > I would suggest using sizeof(uintptr_t) as the constant > multiplier, > > > so > > > > the mempool can hold objects of any size divisible by > > > sizeof(uintptr_t). > > > > And it would be silly to use a mempool to hold objects smaller > than > > > > sizeof(uintptr_t). > > > > > > > > How does the performance look if you multiply the index by > > > > sizeof(uintptr_t)? > > > > > > > > > > Each mempool entry is cache aligned, so we can use that if we want > a > > > bigger > > > multiplier. > > > > Thanks for chiming in, Bruce. > > > > Please also read this discussion about the multiplier: > > http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/ > > > > I actually wondered after I had sent the email whether we had indeed an > option to disable the cache alignment or not! Thanks for pointing out > that > we do. This brings a couple additional thoughts: > > * Using indexes for the cache should probably be a runtime flag rather > than > a build-time one. > * It would seem reasonable to me to disallow use of the indexed-cache > flag > and the non-cache aligned flag simultaneously. > * On the offchance that that restriction is unacceptable, then we can > make things a little more complicated by doing a runtime computation > of > the "index-shiftwidth" to use. > > Overall, I think defaulting to cacheline shiftwidth and disallowing > index-based addressing when using unaligned buffers is simplest and > easiest > unless we can come up with a valid usecase for needing more than that. > > /Bruce This feature is a performance optimization. With that in mind, it should not introduce function pointers or similar run-time checks or in the fast path, to determine what kind of cache to use per mempool. And if an index multiplier is implemented, it should be a compile time constant, probably something between sizeof(uintptr_t) or RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE). The patch comes with a tradeoff between better performance and limited mempool size, and possibly some limitations regarding very small objects that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ). With no multiplier, the only tradeoff is that the mempool size is limited to 4 GB. If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the mempool size is limited to 32 GB. (And a waste of memory for objects smaller than 8 byte; but I don't think anyone would use a mempool to hold objects smaller than 8 byte.) If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect. Note: 32 bit platforms have no benefit from this patch: The pointer already only uses 4 bytes, so replacing the pointer with a 4 byte index makes no difference. Since this feature is a performance optimization only, and doesn't provide any new features, I don't mind it being a compile time option. If this feature is a compile time option, and the mempool library is compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public header file, so compilation of applications using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a warning about the flag being ignored. Obviously, rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger than the library supports, possibly emitting a message that the mempool library should be built without this feature to support the larger mempool. Here is another thought: If only exotic applications use mempools larger than 32 GB, this would be a generally acceptable limit, and DPDK should use index-based cache as default, making the opposite (i.e. pointer-based cache) a compile time option instead. A similar decision was recently made for limiting the RTE_MAX_LCORE default. Although DPDK is moving away from compile time options in order to better support Linux distros, there should be a general exception for performance and memory optimizations. Otherwise, network appliance vendors will inherit the increasing amount of DPDK bloat, and we (network appliance vendors) will eventually be forced to fork DPDK to get rid of the bloat and achieve the goals originally intended by DPDK. If anyone disagrees with the principle about a general exception for performance and memory optimizations, I would like to pass on the decision to the Techboard!
On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <mb@smartsharesystems.com> wrote: > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > Sent: Friday, 7 January 2022 14.51 > > > > On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote: > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > > Sent: Friday, 7 January 2022 12.16 > > > > > > > > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: > > > > > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: > > > > Friday, 24 > > > > > > December 2021 23.59 > > > > > > > > > > > > Current mempool per core cache implementation stores pointers > > to > > > > mbufs > > > > > > On 64b architectures, each pointer consumes 8B This patch > > replaces > > > > it > > > > > > with index-based implementation, where in each buffer is > > addressed > > > > by > > > > > > (pool base address + index) It reduces the amount of > > memory/cache > > > > > > required for per core cache > > > > > > > > > > > > L3Fwd performance testing reveals minor improvements in the > > cache > > > > > > performance (L1 and L2 misses reduced by 0.60%) with no change > > in > > > > > > throughput > > > > > > > > > > > > Micro-benchmarking the patch using mempool_perf_test shows > > > > significant > > > > > > improvement with majority of the test cases > > > > > > > > > > > > > > > > I still think this is very interesting. And your performance > > numbers > > > > are > > > > > looking good. > > > > > > > > > > However, it limits the size of a mempool to 4 GB. As previously > > > > > discussed, the max mempool size can be increased by multiplying > > the > > > > index > > > > > with a constant. > > > > > > > > > > I would suggest using sizeof(uintptr_t) as the constant > > multiplier, > > > > so > > > > > the mempool can hold objects of any size divisible by > > > > sizeof(uintptr_t). > > > > > And it would be silly to use a mempool to hold objects smaller > > than > > > > > sizeof(uintptr_t). > > > > > > > > > > How does the performance look if you multiply the index by > > > > > sizeof(uintptr_t)? > > > > > > > > > > > > > Each mempool entry is cache aligned, so we can use that if we want > > a > > > > bigger > > > > multiplier. > > > > > > Thanks for chiming in, Bruce. > > > > > > Please also read this discussion about the multiplier: > > > http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/ > > > > > > > I actually wondered after I had sent the email whether we had indeed an > > option to disable the cache alignment or not! Thanks for pointing out > > that > > we do. This brings a couple additional thoughts: > > > > * Using indexes for the cache should probably be a runtime flag rather > > than > > a build-time one. > > * It would seem reasonable to me to disallow use of the indexed-cache > > flag > > and the non-cache aligned flag simultaneously. > > * On the offchance that that restriction is unacceptable, then we can > > make things a little more complicated by doing a runtime computation > > of > > the "index-shiftwidth" to use. > > > > Overall, I think defaulting to cacheline shiftwidth and disallowing > > index-based addressing when using unaligned buffers is simplest and > > easiest > > unless we can come up with a valid usecase for needing more than that. > > > > /Bruce > > This feature is a performance optimization. > > With that in mind, it should not introduce function pointers or similar run-time checks or in the fast path, to determine what kind of cache to use per mempool. And if an index multiplier is implemented, it should be a compile time constant, probably something between sizeof(uintptr_t) or RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE). > > The patch comes with a tradeoff between better performance and limited mempool size, and possibly some limitations regarding very small objects that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ). > > With no multiplier, the only tradeoff is that the mempool size is limited to 4 GB. > > If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the mempool size is limited to 32 GB. (And a waste of memory for objects smaller than 8 byte; but I don't think anyone would use a mempool to hold objects smaller than 8 byte.) > > If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect. > > Note: 32 bit platforms have no benefit from this patch: The pointer already only uses 4 bytes, so replacing the pointer with a 4 byte index makes no difference. > > > Since this feature is a performance optimization only, and doesn't provide any new features, I don't mind it being a compile time option. > > If this feature is a compile time option, and the mempool library is compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public header file, so compilation of applications using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a warning about the flag being ignored. Obviously, rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger than the library supports, possibly emitting a message that the mempool library should be built without this feature to support the larger mempool. > > Here is another thought: If only exotic applications use mempools larger than 32 GB, this would be a generally acceptable limit, and DPDK should use index-based cache as default, making the opposite (i.e. pointer-based cache) a compile time option instead. A similar decision was recently made for limiting the RTE_MAX_LCORE default. > > > Although DPDK is moving away from compile time options in order to better support Linux distros, there should be a general exception for performance and memory optimizations. Otherwise, network appliance vendors will inherit the increasing amount of DPDK bloat, and we (network appliance vendors) will eventually be forced to fork DPDK to get rid of the bloat and achieve the goals originally intended by DPDK. Agree with Morten's view on this. >If anyone disagrees with the principle about a general exception for performance and memory optimizations, I would like to pass on the decision to the Techboard! >
Hi, Thank you for your valuable review comments and suggestions! I will be sending out a v2 in which I have increased the size of the mempool to 32GB by using division by sizeof(uintptr_t). However, I am seeing ~5% performance degradation with mempool_perf_autotest (for bulk size of 32) with this change when compared to the base performance. Earlier, without this change, I was seeing an improvement of ~13% compared to base performance. So, this is a significant degradation. I would appreciate your review comments on v2. Thank you! > On Jan 10, 2022, at 12:38 AM, Jerin Jacob <jerinjacobk@gmail.com> wrote: > > On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <mb@smartsharesystems.com> wrote: >> >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] >>> Sent: Friday, 7 January 2022 14.51 >>> >>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote: >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] >>>>> Sent: Friday, 7 January 2022 12.16 >>>>> >>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: >>>>>>> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: >>>>> Friday, 24 >>>>>>> December 2021 23.59 >>>>>>> >>>>>>> Current mempool per core cache implementation stores pointers >>> to >>>>> mbufs >>>>>>> On 64b architectures, each pointer consumes 8B This patch >>> replaces >>>>> it >>>>>>> with index-based implementation, where in each buffer is >>> addressed >>>>> by >>>>>>> (pool base address + index) It reduces the amount of >>> memory/cache >>>>>>> required for per core cache >>>>>>> >>>>>>> L3Fwd performance testing reveals minor improvements in the >>> cache >>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change >>> in >>>>>>> throughput >>>>>>> >>>>>>> Micro-benchmarking the patch using mempool_perf_test shows >>>>> significant >>>>>>> improvement with majority of the test cases >>>>>>> >>>>>> >>>>>> I still think this is very interesting. And your performance >>> numbers >>>>> are >>>>>> looking good. >>>>>> >>>>>> However, it limits the size of a mempool to 4 GB. As previously >>>>>> discussed, the max mempool size can be increased by multiplying >>> the >>>>> index >>>>>> with a constant. >>>>>> >>>>>> I would suggest using sizeof(uintptr_t) as the constant >>> multiplier, >>>>> so >>>>>> the mempool can hold objects of any size divisible by >>>>> sizeof(uintptr_t). >>>>>> And it would be silly to use a mempool to hold objects smaller >>> than >>>>>> sizeof(uintptr_t). >>>>>> >>>>>> How does the performance look if you multiply the index by >>>>>> sizeof(uintptr_t)? >>>>>> >>>>> >>>>> Each mempool entry is cache aligned, so we can use that if we want >>> a >>>>> bigger >>>>> multiplier. >>>> >>>> Thanks for chiming in, Bruce. >>>> >>>> Please also read this discussion about the multiplier: >>>> http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/ >>>> >>> >>> I actually wondered after I had sent the email whether we had indeed an >>> option to disable the cache alignment or not! Thanks for pointing out >>> that >>> we do. This brings a couple additional thoughts: >>> >>> * Using indexes for the cache should probably be a runtime flag rather >>> than >>> a build-time one. >>> * It would seem reasonable to me to disallow use of the indexed-cache >>> flag >>> and the non-cache aligned flag simultaneously. >>> * On the offchance that that restriction is unacceptable, then we can >>> make things a little more complicated by doing a runtime computation >>> of >>> the "index-shiftwidth" to use. >>> >>> Overall, I think defaulting to cacheline shiftwidth and disallowing >>> index-based addressing when using unaligned buffers is simplest and >>> easiest >>> unless we can come up with a valid usecase for needing more than that. >>> >>> /Bruce >> >> This feature is a performance optimization. >> >> With that in mind, it should not introduce function pointers or similar run-time checks or in the fast path, to determine what kind of cache to use per mempool. And if an index multiplier is implemented, it should be a compile time constant, probably something between sizeof(uintptr_t) or RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE). >> >> The patch comes with a tradeoff between better performance and limited mempool size, and possibly some limitations regarding very small objects that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ). >> >> With no multiplier, the only tradeoff is that the mempool size is limited to 4 GB. >> >> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the mempool size is limited to 32 GB. (And a waste of memory for objects smaller than 8 byte; but I don't think anyone would use a mempool to hold objects smaller than 8 byte.) >> >> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect. >> >> Note: 32 bit platforms have no benefit from this patch: The pointer already only uses 4 bytes, so replacing the pointer with a 4 byte index makes no difference. >> >> >> Since this feature is a performance optimization only, and doesn't provide any new features, I don't mind it being a compile time option. >> >> If this feature is a compile time option, and the mempool library is compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public header file, so compilation of applications using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a warning about the flag being ignored. Obviously, rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger than the library supports, possibly emitting a message that the mempool library should be built without this feature to support the larger mempool. >> >> Here is another thought: If only exotic applications use mempools larger than 32 GB, this would be a generally acceptable limit, and DPDK should use index-based cache as default, making the opposite (i.e. pointer-based cache) a compile time option instead. A similar decision was recently made for limiting the RTE_MAX_LCORE default. >> >> >> Although DPDK is moving away from compile time options in order to better support Linux distros, there should be a general exception for performance and memory optimizations. Otherwise, network appliance vendors will inherit the increasing amount of DPDK bloat, and we (network appliance vendors) will eventually be forced to fork DPDK to get rid of the bloat and achieve the goals originally intended by DPDK. > > Agree with Morten's view on this. > >> If anyone disagrees with the principle about a general exception for performance and memory optimizations, I would like to pass on the decision to the Techboard! >>
On Thu, 13 Jan 2022 05:31:18 +0000 Dharmik Thakkar <Dharmik.Thakkar@arm.com> wrote: > Hi, > > Thank you for your valuable review comments and suggestions! > > I will be sending out a v2 in which I have increased the size of the mempool to 32GB by using division by sizeof(uintptr_t). > However, I am seeing ~5% performance degradation with mempool_perf_autotest (for bulk size of 32) with this change > when compared to the base performance. > Earlier, without this change, I was seeing an improvement of ~13% compared to base performance. So, this is a significant degradation. > I would appreciate your review comments on v2. > > Thank you! > > > On Jan 10, 2022, at 12:38 AM, Jerin Jacob <jerinjacobk@gmail.com> wrote: > > > > On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <mb@smartsharesystems.com> wrote: > >> > >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] > >>> Sent: Friday, 7 January 2022 14.51 > >>> > >>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote: > >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] > >>>>> Sent: Friday, 7 January 2022 12.16 > >>>>> > >>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote: > >>>>>>> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: > >>>>> Friday, 24 > >>>>>>> December 2021 23.59 > >>>>>>> > >>>>>>> Current mempool per core cache implementation stores pointers > >>> to > >>>>> mbufs > >>>>>>> On 64b architectures, each pointer consumes 8B This patch > >>> replaces > >>>>> it > >>>>>>> with index-based implementation, where in each buffer is > >>> addressed > >>>>> by > >>>>>>> (pool base address + index) It reduces the amount of > >>> memory/cache > >>>>>>> required for per core cache > >>>>>>> > >>>>>>> L3Fwd performance testing reveals minor improvements in the > >>> cache > >>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change > >>> in > >>>>>>> throughput > >>>>>>> > >>>>>>> Micro-benchmarking the patch using mempool_perf_test shows > >>>>> significant > >>>>>>> improvement with majority of the test cases > >>>>>>> > >>>>>> > >>>>>> I still think this is very interesting. And your performance > >>> numbers > >>>>> are > >>>>>> looking good. > >>>>>> > >>>>>> However, it limits the size of a mempool to 4 GB. As previously > >>>>>> discussed, the max mempool size can be increased by multiplying > >>> the > >>>>> index > >>>>>> with a constant. > >>>>>> > >>>>>> I would suggest using sizeof(uintptr_t) as the constant > >>> multiplier, > >>>>> so > >>>>>> the mempool can hold objects of any size divisible by > >>>>> sizeof(uintptr_t). > >>>>>> And it would be silly to use a mempool to hold objects smaller > >>> than > >>>>>> sizeof(uintptr_t). > >>>>>> > >>>>>> How does the performance look if you multiply the index by > >>>>>> sizeof(uintptr_t)? > >>>>>> > >>>>> > >>>>> Each mempool entry is cache aligned, so we can use that if we want > >>> a > >>>>> bigger > >>>>> multiplier. > >>>> > >>>> Thanks for chiming in, Bruce. > >>>> > >>>> Please also read this discussion about the multiplier: > >>>> http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/ > >>>> > >>> > >>> I actually wondered after I had sent the email whether we had indeed an > >>> option to disable the cache alignment or not! Thanks for pointing out > >>> that > >>> we do. This brings a couple additional thoughts: > >>> > >>> * Using indexes for the cache should probably be a runtime flag rather > >>> than > >>> a build-time one. > >>> * It would seem reasonable to me to disallow use of the indexed-cache > >>> flag > >>> and the non-cache aligned flag simultaneously. > >>> * On the offchance that that restriction is unacceptable, then we can > >>> make things a little more complicated by doing a runtime computation > >>> of > >>> the "index-shiftwidth" to use. > >>> > >>> Overall, I think defaulting to cacheline shiftwidth and disallowing > >>> index-based addressing when using unaligned buffers is simplest and > >>> easiest > >>> unless we can come up with a valid usecase for needing more than that. > >>> > >>> /Bruce > >> > >> This feature is a performance optimization. > >> > >> With that in mind, it should not introduce function pointers or similar run-time checks or in the fast path, to determine what kind of cache to use per mempool. And if an index multiplier is implemented, it should be a compile time constant, probably something between sizeof(uintptr_t) or RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE). > >> > >> The patch comes with a tradeoff between better performance and limited mempool size, and possibly some limitations regarding very small objects that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ). > >> > >> With no multiplier, the only tradeoff is that the mempool size is limited to 4 GB. > >> > >> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the mempool size is limited to 32 GB. (And a waste of memory for objects smaller than 8 byte; but I don't think anyone would use a mempool to hold objects smaller than 8 byte.) > >> > >> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect. > >> > >> Note: 32 bit platforms have no benefit from this patch: The pointer already only uses 4 bytes, so replacing the pointer with a 4 byte index makes no difference. > >> > >> > >> Since this feature is a performance optimization only, and doesn't provide any new features, I don't mind it being a compile time option. > >> > >> If this feature is a compile time option, and the mempool library is compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public header file, so compilation of applications using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a warning about the flag being ignored. Obviously, rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger than the library supports, possibly emitting a message that the mempool library should be built without this feature to support the larger mempool. > >> > >> Here is another thought: If only exotic applications use mempools larger than 32 GB, this would be a generally acceptable limit, and DPDK should use index-based cache as default, making the opposite (i.e. pointer-based cache) a compile time option instead. A similar decision was recently made for limiting the RTE_MAX_LCORE default. > >> > >> > >> Although DPDK is moving away from compile time options in order to better support Linux distros, there should be a general exception for performance and memory optimizations. Otherwise, network appliance vendors will inherit the increasing amount of DPDK bloat, and we (network appliance vendors) will eventually be forced to fork DPDK to get rid of the bloat and achieve the goals originally intended by DPDK. > > > > Agree with Morten's view on this. > > > >> If anyone disagrees with the principle about a general exception for performance and memory optimizations, I would like to pass on the decision to the Techboard! > >> NAK Having compile time stuff like this means one side or the other is not tested by CI infrastructure. There never was sufficient justification, and lots of objections. Dropping the patch.