[0/1] mempool: implement index-based per core cache

Message ID	20211224225923.806498-1-dharmik.thakkar@arm.com (mailing list archive)
Headers	From: Dharmik Thakkar <dharmik.thakkar@arm.com> To: Cc: dev@dpdk.org, nd@arm.com, honnappa.nagarahalli@arm.com, ruifeng.wang@arm.com, Dharmik Thakkar <dharmik.thakkar@arm.com> Subject: [PATCH 0/1] mempool: implement index-based per core cache Date: Fri, 24 Dec 2021 16:59:22 -0600 Message-Id: <20211224225923.806498-1-dharmik.thakkar@arm.com> In-Reply-To: <20210930172735.2675627-1-dharmik.thakkar@arm.com> References: <20210930172735.2675627-1-dharmik.thakkar@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dev-bounces@dpdk.org
Series	mempool: implement index-based per core cache \| [0/1] mempool: implement index-based per core cache [1/1] mempool: implement index-based per core cache

Message ID

20211224225923.806498-1-dharmik.thakkar@arm.com (mailing list archive)

Headers

From: Dharmik Thakkar <dharmik.thakkar@arm.com>
To: 
Cc: dev@dpdk.org, nd@arm.com, honnappa.nagarahalli@arm.com,
 ruifeng.wang@arm.com, Dharmik Thakkar <dharmik.thakkar@arm.com>
Subject: [PATCH 0/1] mempool: implement index-based per core cache
Date: Fri, 24 Dec 2021 16:59:22 -0600
Message-Id: <20211224225923.806498-1-dharmik.thakkar@arm.com>
In-Reply-To: <20210930172735.2675627-1-dharmik.thakkar@arm.com>
References: <20210930172735.2675627-1-dharmik.thakkar@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: dev-bounces@dpdk.org

Series

mempool: implement index-based per core cache |

Message

Dharmik Thakkar Dec. 24, 2021, 10:59 p.m. UTC

  Current mempool per core cache implementation stores pointers to mbufs
On 64b architectures, each pointer consumes 8B
This patch replaces it with index-based implementation,
where in each buffer is addressed by (pool base address + index)
It reduces the amount of memory/cache required for per core cache

L3Fwd performance testing reveals minor improvements in the cache
performance (L1 and L2 misses reduced by 0.60%)
with no change in throughput

Micro-benchmarking the patch using mempool_perf_test shows
significant improvement with majority of the test cases

Number of cores = 1:
n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.01
n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=19.91
n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-20.37 (regression)
n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.01 (regression) 
n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression)
n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81 (regression)
n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=53.93
n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=60.90
n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.64
n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.76
n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-4.71 (regression)
n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.19 (regression)
n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=65.63
n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=75.19
n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.75
n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=15.52
n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.45
n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.58

Number of core = 2:
n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.21
n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=21.89
n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-21.21 (regression)
n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.05 (regression)
n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression)
n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49 (regression)
n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=56.28
n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=67.69
n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.45
n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.84
n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-5.27 (regression)
n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.09 (regression)
n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=76.11
n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=86.06
n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.86
n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=16.55
n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.01
n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.51


From analyzing the results, it is clear that for n_get_bulk and
n_put_bulk sizes of 32 there is no performance regression
IMO, the other sizes are not practical from performance perspective
and the regression in those cases can be safely ignored

Dharmik Thakkar (1):
  mempool: implement index-based per core cache

 lib/mempool/rte_mempool.h             | 114 +++++++++++++++++++++++++-
 lib/mempool/rte_mempool_ops_default.c |   7 ++
 2 files changed, 119 insertions(+), 2 deletions(-)

Comments

Morten Brørup Dec. 25, 2021, 12:16 a.m. UTC | #1

> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com]
> Sent: Friday, 24 December 2021 23.59
> 
> Current mempool per core cache implementation stores pointers to mbufs
> On 64b architectures, each pointer consumes 8B
> This patch replaces it with index-based implementation,
> where in each buffer is addressed by (pool base address + index)
> It reduces the amount of memory/cache required for per core cache
> 
> L3Fwd performance testing reveals minor improvements in the cache
> performance (L1 and L2 misses reduced by 0.60%)
> with no change in throughput
> 
> Micro-benchmarking the patch using mempool_perf_test shows
> significant improvement with majority of the test cases
> 
> Number of cores = 1:
> n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.01
> n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=19.91
> n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-20.37
> (regression)
> n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.01
> (regression)
> n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06
> (regression)
> n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81
> (regression)
> n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=53.93
> n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=60.90
> n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.64
> n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.76
> n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-4.71
> (regression)
> n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.19
> (regression)
> n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=65.63
> n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=75.19
> n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.75
> n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=15.52
> n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.45
> n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.58
> 
> Number of core = 2:
> n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.21
> n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=21.89
> n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-21.21
> (regression)
> n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.05
> (regression)
> n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09
> (regression)
> n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49
> (regression)
> n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=56.28
> n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=67.69
> n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.45
> n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.84
> n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-5.27
> (regression)
> n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.09
> (regression)
> n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=76.11
> n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=86.06
> n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.86
> n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=16.55
> n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.01
> n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.51
> 
> 
> From analyzing the results, it is clear that for n_get_bulk and
> n_put_bulk sizes of 32 there is no performance regression
> IMO, the other sizes are not practical from performance perspective
> and the regression in those cases can be safely ignored
> 
> Dharmik Thakkar (1):
>   mempool: implement index-based per core cache
> 
>  lib/mempool/rte_mempool.h             | 114 +++++++++++++++++++++++++-
>  lib/mempool/rte_mempool_ops_default.c |   7 ++
>  2 files changed, 119 insertions(+), 2 deletions(-)
> 
> --
> 2.25.1
> 

I still think this is very interesting. And your performance numbers are looking good.

However, it limits the size of a mempool to 4 GB. As previously discussed, the max mempool size can be increased by multiplying the index with a constant.

I would suggest using sizeof(uintptr_t) as the constant multiplier, so the mempool can hold objects of any size divisible by sizeof(uintptr_t). And it would be silly to use a mempool to hold objects smaller than sizeof(uintptr_t).

How does the performance look if you multiply the index by sizeof(uintptr_t)?


Med venlig hilsen / Kind regards,
-Morten Brørup

Bruce Richardson Jan. 7, 2022, 11:15 a.m. UTC | #2

On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:
> > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: Friday, 24
> > December 2021 23.59
> > 
> > Current mempool per core cache implementation stores pointers to mbufs
> > On 64b architectures, each pointer consumes 8B This patch replaces it
> > with index-based implementation, where in each buffer is addressed by
> > (pool base address + index) It reduces the amount of memory/cache
> > required for per core cache
> > 
> > L3Fwd performance testing reveals minor improvements in the cache
> > performance (L1 and L2 misses reduced by 0.60%) with no change in
> > throughput
> > 
> > Micro-benchmarking the patch using mempool_perf_test shows significant
> > improvement with majority of the test cases
> > 
> > Number of cores = 1: n_get_bulk=1 n_put_bulk=1 n_keep=32
> > %_change_with_patch=18.01 n_get_bulk=1 n_put_bulk=1 n_keep=128
> > %_change_with_patch=19.91 n_get_bulk=1 n_put_bulk=4 n_keep=32
> > %_change_with_patch=-20.37 (regression) n_get_bulk=1 n_put_bulk=4
> > n_keep=128 %_change_with_patch=-17.01 (regression) n_get_bulk=1
> > n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression)
> > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81
> > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32
> > %_change_with_patch=53.93 n_get_bulk=4 n_put_bulk=1 n_keep=128
> > %_change_with_patch=60.90 n_get_bulk=4 n_put_bulk=4 n_keep=32
> > %_change_with_patch=1.64 n_get_bulk=4 n_put_bulk=4 n_keep=128
> > %_change_with_patch=8.76 n_get_bulk=4 n_put_bulk=32 n_keep=32
> > %_change_with_patch=-4.71 (regression) n_get_bulk=4 n_put_bulk=32
> > n_keep=128 %_change_with_patch=-3.19 (regression) n_get_bulk=32
> > n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 n_get_bulk=32
> > n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 n_get_bulk=32
> > n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 n_get_bulk=32
> > n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 n_get_bulk=32
> > n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 n_get_bulk=32
> > n_put_bulk=32 n_keep=128 %_change_with_patch=11.58
> > 
> > Number of core = 2: n_get_bulk=1 n_put_bulk=1 n_keep=32
> > %_change_with_patch=18.21 n_get_bulk=1 n_put_bulk=1 n_keep=128
> > %_change_with_patch=21.89 n_get_bulk=1 n_put_bulk=4 n_keep=32
> > %_change_with_patch=-21.21 (regression) n_get_bulk=1 n_put_bulk=4
> > n_keep=128 %_change_with_patch=-17.05 (regression) n_get_bulk=1
> > n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression)
> > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49
> > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32
> > %_change_with_patch=56.28 n_get_bulk=4 n_put_bulk=1 n_keep=128
> > %_change_with_patch=67.69 n_get_bulk=4 n_put_bulk=4 n_keep=32
> > %_change_with_patch=1.45 n_get_bulk=4 n_put_bulk=4 n_keep=128
> > %_change_with_patch=8.84 n_get_bulk=4 n_put_bulk=32 n_keep=32
> > %_change_with_patch=-5.27 (regression) n_get_bulk=4 n_put_bulk=32
> > n_keep=128 %_change_with_patch=-3.09 (regression) n_get_bulk=32
> > n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 n_get_bulk=32
> > n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 n_get_bulk=32
> > n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 n_get_bulk=32
> > n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 n_get_bulk=32
> > n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 n_get_bulk=32
> > n_put_bulk=32 n_keep=128 %_change_with_patch=11.51
> > 
> > 
> > From analyzing the results, it is clear that for n_get_bulk and
> > n_put_bulk sizes of 32 there is no performance regression IMO, the
> > other sizes are not practical from performance perspective and the
> > regression in those cases can be safely ignored
> > 
> > Dharmik Thakkar (1): mempool: implement index-based per core cache
> > 
> >  lib/mempool/rte_mempool.h             | 114 +++++++++++++++++++++++++-
> >  lib/mempool/rte_mempool_ops_default.c |   7 ++ 2 files changed, 119
> >  insertions(+), 2 deletions(-)
> > 
> > -- 2.25.1
> > 
> 
> I still think this is very interesting. And your performance numbers are
> looking good.
> 
> However, it limits the size of a mempool to 4 GB. As previously
> discussed, the max mempool size can be increased by multiplying the index
> with a constant.
> 
> I would suggest using sizeof(uintptr_t) as the constant multiplier, so
> the mempool can hold objects of any size divisible by sizeof(uintptr_t).
> And it would be silly to use a mempool to hold objects smaller than
> sizeof(uintptr_t).
> 
> How does the performance look if you multiply the index by
> sizeof(uintptr_t)?
> 

Each mempool entry is cache aligned, so we can use that if we want a bigger
multiplier.

Morten Brørup Jan. 7, 2022, 11:29 a.m. UTC | #3

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 7 January 2022 12.16
> 
> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:
> > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent:
> Friday, 24
> > > December 2021 23.59
> > >
> > > Current mempool per core cache implementation stores pointers to
> mbufs
> > > On 64b architectures, each pointer consumes 8B This patch replaces
> it
> > > with index-based implementation, where in each buffer is addressed
> by
> > > (pool base address + index) It reduces the amount of memory/cache
> > > required for per core cache
> > >
> > > L3Fwd performance testing reveals minor improvements in the cache
> > > performance (L1 and L2 misses reduced by 0.60%) with no change in
> > > throughput
> > >
> > > Micro-benchmarking the patch using mempool_perf_test shows
> significant
> > > improvement with majority of the test cases
> > >
> > > Number of cores = 1: n_get_bulk=1 n_put_bulk=1 n_keep=32
> > > %_change_with_patch=18.01 n_get_bulk=1 n_put_bulk=1 n_keep=128
> > > %_change_with_patch=19.91 n_get_bulk=1 n_put_bulk=4 n_keep=32
> > > %_change_with_patch=-20.37 (regression) n_get_bulk=1 n_put_bulk=4
> > > n_keep=128 %_change_with_patch=-17.01 (regression) n_get_bulk=1
> > > n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression)
> > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81
> > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32
> > > %_change_with_patch=53.93 n_get_bulk=4 n_put_bulk=1 n_keep=128
> > > %_change_with_patch=60.90 n_get_bulk=4 n_put_bulk=4 n_keep=32
> > > %_change_with_patch=1.64 n_get_bulk=4 n_put_bulk=4 n_keep=128
> > > %_change_with_patch=8.76 n_get_bulk=4 n_put_bulk=32 n_keep=32
> > > %_change_with_patch=-4.71 (regression) n_get_bulk=4 n_put_bulk=32
> > > n_keep=128 %_change_with_patch=-3.19 (regression) n_get_bulk=32
> > > n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 n_get_bulk=32
> > > n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 n_get_bulk=32
> > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 n_get_bulk=32
> > > n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 n_get_bulk=32
> > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 n_get_bulk=32
> > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.58
> > >
> > > Number of core = 2: n_get_bulk=1 n_put_bulk=1 n_keep=32
> > > %_change_with_patch=18.21 n_get_bulk=1 n_put_bulk=1 n_keep=128
> > > %_change_with_patch=21.89 n_get_bulk=1 n_put_bulk=4 n_keep=32
> > > %_change_with_patch=-21.21 (regression) n_get_bulk=1 n_put_bulk=4
> > > n_keep=128 %_change_with_patch=-17.05 (regression) n_get_bulk=1
> > > n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression)
> > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49
> > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32
> > > %_change_with_patch=56.28 n_get_bulk=4 n_put_bulk=1 n_keep=128
> > > %_change_with_patch=67.69 n_get_bulk=4 n_put_bulk=4 n_keep=32
> > > %_change_with_patch=1.45 n_get_bulk=4 n_put_bulk=4 n_keep=128
> > > %_change_with_patch=8.84 n_get_bulk=4 n_put_bulk=32 n_keep=32
> > > %_change_with_patch=-5.27 (regression) n_get_bulk=4 n_put_bulk=32
> > > n_keep=128 %_change_with_patch=-3.09 (regression) n_get_bulk=32
> > > n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 n_get_bulk=32
> > > n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 n_get_bulk=32
> > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 n_get_bulk=32
> > > n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 n_get_bulk=32
> > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 n_get_bulk=32
> > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.51
> > >
> > >
> > > From analyzing the results, it is clear that for n_get_bulk and
> > > n_put_bulk sizes of 32 there is no performance regression IMO, the
> > > other sizes are not practical from performance perspective and the
> > > regression in those cases can be safely ignored
> > >
> > > Dharmik Thakkar (1): mempool: implement index-based per core cache
> > >
> > >  lib/mempool/rte_mempool.h             | 114
> +++++++++++++++++++++++++-
> > >  lib/mempool/rte_mempool_ops_default.c |   7 ++ 2 files changed,
> 119
> > >  insertions(+), 2 deletions(-)
> > >
> > > -- 2.25.1
> > >
> >
> > I still think this is very interesting. And your performance numbers
> are
> > looking good.
> >
> > However, it limits the size of a mempool to 4 GB. As previously
> > discussed, the max mempool size can be increased by multiplying the
> index
> > with a constant.
> >
> > I would suggest using sizeof(uintptr_t) as the constant multiplier,
> so
> > the mempool can hold objects of any size divisible by
> sizeof(uintptr_t).
> > And it would be silly to use a mempool to hold objects smaller than
> > sizeof(uintptr_t).
> >
> > How does the performance look if you multiply the index by
> > sizeof(uintptr_t)?
> >
> 
> Each mempool entry is cache aligned, so we can use that if we want a
> bigger
> multiplier.

Thanks for chiming in, Bruce.

Please also read this discussion about the multiplier:
http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/

Bruce Richardson Jan. 7, 2022, 1:50 p.m. UTC | #4

On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Friday, 7 January 2022 12.16
> > 
> > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:
> > > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent:
> > Friday, 24
> > > > December 2021 23.59
> > > >
> > > > Current mempool per core cache implementation stores pointers to
> > mbufs
> > > > On 64b architectures, each pointer consumes 8B This patch replaces
> > it
> > > > with index-based implementation, where in each buffer is addressed
> > by
> > > > (pool base address + index) It reduces the amount of memory/cache
> > > > required for per core cache
> > > >
> > > > L3Fwd performance testing reveals minor improvements in the cache
> > > > performance (L1 and L2 misses reduced by 0.60%) with no change in
> > > > throughput
> > > >
> > > > Micro-benchmarking the patch using mempool_perf_test shows
> > significant
> > > > improvement with majority of the test cases
> > > >
> > > > Number of cores = 1: n_get_bulk=1 n_put_bulk=1 n_keep=32
> > > > %_change_with_patch=18.01 n_get_bulk=1 n_put_bulk=1 n_keep=128
> > > > %_change_with_patch=19.91 n_get_bulk=1 n_put_bulk=4 n_keep=32
> > > > %_change_with_patch=-20.37 (regression) n_get_bulk=1 n_put_bulk=4
> > > > n_keep=128 %_change_with_patch=-17.01 (regression) n_get_bulk=1
> > > > n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression)
> > > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81
> > > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32
> > > > %_change_with_patch=53.93 n_get_bulk=4 n_put_bulk=1 n_keep=128
> > > > %_change_with_patch=60.90 n_get_bulk=4 n_put_bulk=4 n_keep=32
> > > > %_change_with_patch=1.64 n_get_bulk=4 n_put_bulk=4 n_keep=128
> > > > %_change_with_patch=8.76 n_get_bulk=4 n_put_bulk=32 n_keep=32
> > > > %_change_with_patch=-4.71 (regression) n_get_bulk=4 n_put_bulk=32
> > > > n_keep=128 %_change_with_patch=-3.19 (regression) n_get_bulk=32
> > > > n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 n_get_bulk=32
> > > > n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 n_get_bulk=32
> > > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 n_get_bulk=32
> > > > n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 n_get_bulk=32
> > > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 n_get_bulk=32
> > > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.58
> > > >
> > > > Number of core = 2: n_get_bulk=1 n_put_bulk=1 n_keep=32
> > > > %_change_with_patch=18.21 n_get_bulk=1 n_put_bulk=1 n_keep=128
> > > > %_change_with_patch=21.89 n_get_bulk=1 n_put_bulk=4 n_keep=32
> > > > %_change_with_patch=-21.21 (regression) n_get_bulk=1 n_put_bulk=4
> > > > n_keep=128 %_change_with_patch=-17.05 (regression) n_get_bulk=1
> > > > n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression)
> > > > n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49
> > > > (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32
> > > > %_change_with_patch=56.28 n_get_bulk=4 n_put_bulk=1 n_keep=128
> > > > %_change_with_patch=67.69 n_get_bulk=4 n_put_bulk=4 n_keep=32
> > > > %_change_with_patch=1.45 n_get_bulk=4 n_put_bulk=4 n_keep=128
> > > > %_change_with_patch=8.84 n_get_bulk=4 n_put_bulk=32 n_keep=32
> > > > %_change_with_patch=-5.27 (regression) n_get_bulk=4 n_put_bulk=32
> > > > n_keep=128 %_change_with_patch=-3.09 (regression) n_get_bulk=32
> > > > n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 n_get_bulk=32
> > > > n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 n_get_bulk=32
> > > > n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 n_get_bulk=32
> > > > n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 n_get_bulk=32
> > > > n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 n_get_bulk=32
> > > > n_put_bulk=32 n_keep=128 %_change_with_patch=11.51
> > > >
> > > >
> > > > From analyzing the results, it is clear that for n_get_bulk and
> > > > n_put_bulk sizes of 32 there is no performance regression IMO, the
> > > > other sizes are not practical from performance perspective and the
> > > > regression in those cases can be safely ignored
> > > >
> > > > Dharmik Thakkar (1): mempool: implement index-based per core cache
> > > >
> > > >  lib/mempool/rte_mempool.h             | 114
> > +++++++++++++++++++++++++-
> > > >  lib/mempool/rte_mempool_ops_default.c |   7 ++ 2 files changed,
> > 119
> > > >  insertions(+), 2 deletions(-)
> > > >
> > > > -- 2.25.1
> > > >
> > >
> > > I still think this is very interesting. And your performance numbers
> > are
> > > looking good.
> > >
> > > However, it limits the size of a mempool to 4 GB. As previously
> > > discussed, the max mempool size can be increased by multiplying the
> > index
> > > with a constant.
> > >
> > > I would suggest using sizeof(uintptr_t) as the constant multiplier,
> > so
> > > the mempool can hold objects of any size divisible by
> > sizeof(uintptr_t).
> > > And it would be silly to use a mempool to hold objects smaller than
> > > sizeof(uintptr_t).
> > >
> > > How does the performance look if you multiply the index by
> > > sizeof(uintptr_t)?
> > >
> > 
> > Each mempool entry is cache aligned, so we can use that if we want a
> > bigger
> > multiplier.
> 
> Thanks for chiming in, Bruce.
> 
> Please also read this discussion about the multiplier:
> http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/
>

I actually wondered after I had sent the email whether we had indeed an
option to disable the cache alignment or not! Thanks for pointing out that
we do. This brings a couple additional thoughts:

* Using indexes for the cache should probably be a runtime flag rather than
  a build-time one.
* It would seem reasonable to me to disallow use of the indexed-cache flag
  and the non-cache aligned flag simultaneously.
* On the offchance that that restriction is unacceptable, then we can
  make things a little more complicated by doing a runtime computation of
  the "index-shiftwidth" to use.

Overall, I think defaulting to cacheline shiftwidth and disallowing
index-based addressing when using unaligned buffers is simplest and easiest
unless we can come up with a valid usecase for needing more than that.

/Bruce

Morten Brørup Jan. 8, 2022, 9:37 a.m. UTC | #5

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 7 January 2022 14.51
> 
> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Friday, 7 January 2022 12.16
> > >
> > > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:
> > > > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent:
> > > Friday, 24
> > > > > December 2021 23.59
> > > > >
> > > > > Current mempool per core cache implementation stores pointers
> to
> > > mbufs
> > > > > On 64b architectures, each pointer consumes 8B This patch
> replaces
> > > it
> > > > > with index-based implementation, where in each buffer is
> addressed
> > > by
> > > > > (pool base address + index) It reduces the amount of
> memory/cache
> > > > > required for per core cache
> > > > >
> > > > > L3Fwd performance testing reveals minor improvements in the
> cache
> > > > > performance (L1 and L2 misses reduced by 0.60%) with no change
> in
> > > > > throughput
> > > > >
> > > > > Micro-benchmarking the patch using mempool_perf_test shows
> > > significant
> > > > > improvement with majority of the test cases
> > > > >
> > > >
> > > > I still think this is very interesting. And your performance
> numbers
> > > are
> > > > looking good.
> > > >
> > > > However, it limits the size of a mempool to 4 GB. As previously
> > > > discussed, the max mempool size can be increased by multiplying
> the
> > > index
> > > > with a constant.
> > > >
> > > > I would suggest using sizeof(uintptr_t) as the constant
> multiplier,
> > > so
> > > > the mempool can hold objects of any size divisible by
> > > sizeof(uintptr_t).
> > > > And it would be silly to use a mempool to hold objects smaller
> than
> > > > sizeof(uintptr_t).
> > > >
> > > > How does the performance look if you multiply the index by
> > > > sizeof(uintptr_t)?
> > > >
> > >
> > > Each mempool entry is cache aligned, so we can use that if we want
> a
> > > bigger
> > > multiplier.
> >
> > Thanks for chiming in, Bruce.
> >
> > Please also read this discussion about the multiplier:
> > http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/
> >
> 
> I actually wondered after I had sent the email whether we had indeed an
> option to disable the cache alignment or not! Thanks for pointing out
> that
> we do. This brings a couple additional thoughts:
> 
> * Using indexes for the cache should probably be a runtime flag rather
> than
>   a build-time one.
> * It would seem reasonable to me to disallow use of the indexed-cache
> flag
>   and the non-cache aligned flag simultaneously.
> * On the offchance that that restriction is unacceptable, then we can
>   make things a little more complicated by doing a runtime computation
> of
>   the "index-shiftwidth" to use.
> 
> Overall, I think defaulting to cacheline shiftwidth and disallowing
> index-based addressing when using unaligned buffers is simplest and
> easiest
> unless we can come up with a valid usecase for needing more than that.
> 
> /Bruce

This feature is a performance optimization.

With that in mind, it should not introduce function pointers or similar run-time checks or in the fast path, to determine what kind of cache to use per mempool. And if an index multiplier is implemented, it should be a compile time constant, probably something between sizeof(uintptr_t) or RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE).

The patch comes with a tradeoff between better performance and limited mempool size, and possibly some limitations regarding very small objects that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ).

With no multiplier, the only tradeoff is that the mempool size is limited to 4 GB.

If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the mempool size is limited to 32 GB. (And a waste of memory for objects smaller than 8 byte; but I don't think anyone would use a mempool to hold objects smaller than 8 byte.)

If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect.

Note: 32 bit platforms have no benefit from this patch: The pointer already only uses 4 bytes, so replacing the pointer with a 4 byte index makes no difference.

Since this feature is a performance optimization only, and doesn't provide any new features, I don't mind it being a compile time option.

If this feature is a compile time option, and the mempool library is compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public header file, so compilation of applications using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a warning about the flag being ignored. Obviously, rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger than the library supports, possibly emitting a message that the mempool library should be built without this feature to support the larger mempool.

Here is another thought: If only exotic applications use mempools larger than 32 GB, this would be a generally acceptable limit, and DPDK should use index-based cache as default, making the opposite (i.e. pointer-based cache) a compile time option instead. A similar decision was recently made for limiting the RTE_MAX_LCORE default.

Although DPDK is moving away from compile time options in order to better support Linux distros, there should be a general exception for performance and memory optimizations. Otherwise, network appliance vendors will inherit the increasing amount of DPDK bloat, and we (network appliance vendors) will eventually be forced to fork DPDK to get rid of the bloat and achieve the goals originally intended by DPDK. If anyone disagrees with the principle about a general exception for performance and memory optimizations, I would like to pass on the decision to the Techboard!

Jerin Jacob Jan. 10, 2022, 6:38 a.m. UTC | #6

On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Friday, 7 January 2022 14.51
> >
> > On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote:
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Friday, 7 January 2022 12.16
> > > >
> > > > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:
> > > > > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent:
> > > > Friday, 24
> > > > > > December 2021 23.59
> > > > > >
> > > > > > Current mempool per core cache implementation stores pointers
> > to
> > > > mbufs
> > > > > > On 64b architectures, each pointer consumes 8B This patch
> > replaces
> > > > it
> > > > > > with index-based implementation, where in each buffer is
> > addressed
> > > > by
> > > > > > (pool base address + index) It reduces the amount of
> > memory/cache
> > > > > > required for per core cache
> > > > > >
> > > > > > L3Fwd performance testing reveals minor improvements in the
> > cache
> > > > > > performance (L1 and L2 misses reduced by 0.60%) with no change
> > in
> > > > > > throughput
> > > > > >
> > > > > > Micro-benchmarking the patch using mempool_perf_test shows
> > > > significant
> > > > > > improvement with majority of the test cases
> > > > > >
> > > > >
> > > > > I still think this is very interesting. And your performance
> > numbers
> > > > are
> > > > > looking good.
> > > > >
> > > > > However, it limits the size of a mempool to 4 GB. As previously
> > > > > discussed, the max mempool size can be increased by multiplying
> > the
> > > > index
> > > > > with a constant.
> > > > >
> > > > > I would suggest using sizeof(uintptr_t) as the constant
> > multiplier,
> > > > so
> > > > > the mempool can hold objects of any size divisible by
> > > > sizeof(uintptr_t).
> > > > > And it would be silly to use a mempool to hold objects smaller
> > than
> > > > > sizeof(uintptr_t).
> > > > >
> > > > > How does the performance look if you multiply the index by
> > > > > sizeof(uintptr_t)?
> > > > >
> > > >
> > > > Each mempool entry is cache aligned, so we can use that if we want
> > a
> > > > bigger
> > > > multiplier.
> > >
> > > Thanks for chiming in, Bruce.
> > >
> > > Please also read this discussion about the multiplier:
> > > http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/
> > >
> >
> > I actually wondered after I had sent the email whether we had indeed an
> > option to disable the cache alignment or not! Thanks for pointing out
> > that
> > we do. This brings a couple additional thoughts:
> >
> > * Using indexes for the cache should probably be a runtime flag rather
> > than
> >   a build-time one.
> > * It would seem reasonable to me to disallow use of the indexed-cache
> > flag
> >   and the non-cache aligned flag simultaneously.
> > * On the offchance that that restriction is unacceptable, then we can
> >   make things a little more complicated by doing a runtime computation
> > of
> >   the "index-shiftwidth" to use.
> >
> > Overall, I think defaulting to cacheline shiftwidth and disallowing
> > index-based addressing when using unaligned buffers is simplest and
> > easiest
> > unless we can come up with a valid usecase for needing more than that.
> >
> > /Bruce
>
> This feature is a performance optimization.
>
> With that in mind, it should not introduce function pointers or similar run-time checks or in the fast path, to determine what kind of cache to use per mempool. And if an index multiplier is implemented, it should be a compile time constant, probably something between sizeof(uintptr_t) or RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE).
>
> The patch comes with a tradeoff between better performance and limited mempool size, and possibly some limitations regarding very small objects that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ).
>
> With no multiplier, the only tradeoff is that the mempool size is limited to 4 GB.
>
> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the mempool size is limited to 32 GB. (And a waste of memory for objects smaller than 8 byte; but I don't think anyone would use a mempool to hold objects smaller than 8 byte.)
>
> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect.
>
> Note: 32 bit platforms have no benefit from this patch: The pointer already only uses 4 bytes, so replacing the pointer with a 4 byte index makes no difference.
>
>
> Since this feature is a performance optimization only, and doesn't provide any new features, I don't mind it being a compile time option.
>
> If this feature is a compile time option, and the mempool library is compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public header file, so compilation of applications using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a warning about the flag being ignored. Obviously, rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger than the library supports, possibly emitting a message that the mempool library should be built without this feature to support the larger mempool.
>
> Here is another thought: If only exotic applications use mempools larger than 32 GB, this would be a generally acceptable limit, and DPDK should use index-based cache as default, making the opposite (i.e. pointer-based cache) a compile time option instead. A similar decision was recently made for limiting the RTE_MAX_LCORE default.
>
>
> Although DPDK is moving away from compile time options in order to better support Linux distros, there should be a general exception for performance and memory optimizations. Otherwise, network appliance vendors will inherit the increasing amount of DPDK bloat, and we (network appliance vendors) will eventually be forced to fork DPDK to get rid of the bloat and achieve the goals originally intended by DPDK.

Agree with Morten's view on this.

>If anyone disagrees with the principle about a general exception for performance and memory optimizations, I would like to pass on the decision to the Techboard!
>

Dharmik Thakkar Jan. 13, 2022, 5:31 a.m. UTC | #7

Hi,

Thank you for your valuable review comments and suggestions!

I will be sending out a v2 in which I have increased the size of the mempool to 32GB by using division by sizeof(uintptr_t).
However, I am seeing ~5% performance degradation with mempool_perf_autotest (for bulk size of 32) with this change
when compared to the base performance.
Earlier, without this change, I was seeing an improvement of ~13% compared to base performance. So, this is a significant degradation.
I would appreciate your review comments on v2.

Thank you!

> On Jan 10, 2022, at 12:38 AM, Jerin Jacob <jerinjacobk@gmail.com> wrote:
> 
> On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>> 
>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>> Sent: Friday, 7 January 2022 14.51
>>> 
>>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote:
>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>> Sent: Friday, 7 January 2022 12.16
>>>>> 
>>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:
>>>>>>> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent:
>>>>> Friday, 24
>>>>>>> December 2021 23.59
>>>>>>> 
>>>>>>> Current mempool per core cache implementation stores pointers
>>> to
>>>>> mbufs
>>>>>>> On 64b architectures, each pointer consumes 8B This patch
>>> replaces
>>>>> it
>>>>>>> with index-based implementation, where in each buffer is
>>> addressed
>>>>> by
>>>>>>> (pool base address + index) It reduces the amount of
>>> memory/cache
>>>>>>> required for per core cache
>>>>>>> 
>>>>>>> L3Fwd performance testing reveals minor improvements in the
>>> cache
>>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change
>>> in
>>>>>>> throughput
>>>>>>> 
>>>>>>> Micro-benchmarking the patch using mempool_perf_test shows
>>>>> significant
>>>>>>> improvement with majority of the test cases
>>>>>>> 
>>>>>> 
>>>>>> I still think this is very interesting. And your performance
>>> numbers
>>>>> are
>>>>>> looking good.
>>>>>> 
>>>>>> However, it limits the size of a mempool to 4 GB. As previously
>>>>>> discussed, the max mempool size can be increased by multiplying
>>> the
>>>>> index
>>>>>> with a constant.
>>>>>> 
>>>>>> I would suggest using sizeof(uintptr_t) as the constant
>>> multiplier,
>>>>> so
>>>>>> the mempool can hold objects of any size divisible by
>>>>> sizeof(uintptr_t).
>>>>>> And it would be silly to use a mempool to hold objects smaller
>>> than
>>>>>> sizeof(uintptr_t).
>>>>>> 
>>>>>> How does the performance look if you multiply the index by
>>>>>> sizeof(uintptr_t)?
>>>>>> 
>>>>> 
>>>>> Each mempool entry is cache aligned, so we can use that if we want
>>> a
>>>>> bigger
>>>>> multiplier.
>>>> 
>>>> Thanks for chiming in, Bruce.
>>>> 
>>>> Please also read this discussion about the multiplier:
>>>> http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/
>>>> 
>>> 
>>> I actually wondered after I had sent the email whether we had indeed an
>>> option to disable the cache alignment or not! Thanks for pointing out
>>> that
>>> we do. This brings a couple additional thoughts:
>>> 
>>> * Using indexes for the cache should probably be a runtime flag rather
>>> than
>>>  a build-time one.
>>> * It would seem reasonable to me to disallow use of the indexed-cache
>>> flag
>>>  and the non-cache aligned flag simultaneously.
>>> * On the offchance that that restriction is unacceptable, then we can
>>>  make things a little more complicated by doing a runtime computation
>>> of
>>>  the "index-shiftwidth" to use.
>>> 
>>> Overall, I think defaulting to cacheline shiftwidth and disallowing
>>> index-based addressing when using unaligned buffers is simplest and
>>> easiest
>>> unless we can come up with a valid usecase for needing more than that.
>>> 
>>> /Bruce
>> 
>> This feature is a performance optimization.
>> 
>> With that in mind, it should not introduce function pointers or similar run-time checks or in the fast path, to determine what kind of cache to use per mempool. And if an index multiplier is implemented, it should be a compile time constant, probably something between sizeof(uintptr_t) or RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE).
>> 
>> The patch comes with a tradeoff between better performance and limited mempool size, and possibly some limitations regarding very small objects that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ).
>> 
>> With no multiplier, the only tradeoff is that the mempool size is limited to 4 GB.
>> 
>> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the mempool size is limited to 32 GB. (And a waste of memory for objects smaller than 8 byte; but I don't think anyone would use a mempool to hold objects smaller than 8 byte.)
>> 
>> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect.
>> 
>> Note: 32 bit platforms have no benefit from this patch: The pointer already only uses 4 bytes, so replacing the pointer with a 4 byte index makes no difference.
>> 
>> 
>> Since this feature is a performance optimization only, and doesn't provide any new features, I don't mind it being a compile time option.
>> 
>> If this feature is a compile time option, and the mempool library is compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public header file, so compilation of applications using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a warning about the flag being ignored. Obviously, rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger than the library supports, possibly emitting a message that the mempool library should be built without this feature to support the larger mempool.
>> 
>> Here is another thought: If only exotic applications use mempools larger than 32 GB, this would be a generally acceptable limit, and DPDK should use index-based cache as default, making the opposite (i.e. pointer-based cache) a compile time option instead. A similar decision was recently made for limiting the RTE_MAX_LCORE default.
>> 
>> 
>> Although DPDK is moving away from compile time options in order to better support Linux distros, there should be a general exception for performance and memory optimizations. Otherwise, network appliance vendors will inherit the increasing amount of DPDK bloat, and we (network appliance vendors) will eventually be forced to fork DPDK to get rid of the bloat and achieve the goals originally intended by DPDK.
> 
> Agree with Morten's view on this.
> 
>> If anyone disagrees with the principle about a general exception for performance and memory optimizations, I would like to pass on the decision to the Techboard!
>>

Stephen Hemminger July 6, 2023, 5:43 p.m. UTC | #8

On Thu, 13 Jan 2022 05:31:18 +0000
Dharmik Thakkar <Dharmik.Thakkar@arm.com> wrote:

> Hi,
> 
> Thank you for your valuable review comments and suggestions!
> 
> I will be sending out a v2 in which I have increased the size of the mempool to 32GB by using division by sizeof(uintptr_t).
> However, I am seeing ~5% performance degradation with mempool_perf_autotest (for bulk size of 32) with this change
> when compared to the base performance.
> Earlier, without this change, I was seeing an improvement of ~13% compared to base performance. So, this is a significant degradation.
> I would appreciate your review comments on v2.
> 
> Thank you!
> 
> > On Jan 10, 2022, at 12:38 AM, Jerin Jacob <jerinjacobk@gmail.com> wrote:
> > 
> > On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <mb@smartsharesystems.com> wrote:  
> >>   
> >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>> Sent: Friday, 7 January 2022 14.51
> >>> 
> >>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote:  
> >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>> Sent: Friday, 7 January 2022 12.16
> >>>>> 
> >>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:  
> >>>>>>> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent:  
> >>>>> Friday, 24  
> >>>>>>> December 2021 23.59
> >>>>>>> 
> >>>>>>> Current mempool per core cache implementation stores pointers  
> >>> to  
> >>>>> mbufs  
> >>>>>>> On 64b architectures, each pointer consumes 8B This patch  
> >>> replaces  
> >>>>> it  
> >>>>>>> with index-based implementation, where in each buffer is  
> >>> addressed  
> >>>>> by  
> >>>>>>> (pool base address + index) It reduces the amount of  
> >>> memory/cache  
> >>>>>>> required for per core cache
> >>>>>>> 
> >>>>>>> L3Fwd performance testing reveals minor improvements in the  
> >>> cache  
> >>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change  
> >>> in  
> >>>>>>> throughput
> >>>>>>> 
> >>>>>>> Micro-benchmarking the patch using mempool_perf_test shows  
> >>>>> significant  
> >>>>>>> improvement with majority of the test cases
> >>>>>>>   
> >>>>>> 
> >>>>>> I still think this is very interesting. And your performance  
> >>> numbers  
> >>>>> are  
> >>>>>> looking good.
> >>>>>> 
> >>>>>> However, it limits the size of a mempool to 4 GB. As previously
> >>>>>> discussed, the max mempool size can be increased by multiplying  
> >>> the  
> >>>>> index  
> >>>>>> with a constant.
> >>>>>> 
> >>>>>> I would suggest using sizeof(uintptr_t) as the constant  
> >>> multiplier,  
> >>>>> so  
> >>>>>> the mempool can hold objects of any size divisible by  
> >>>>> sizeof(uintptr_t).  
> >>>>>> And it would be silly to use a mempool to hold objects smaller  
> >>> than  
> >>>>>> sizeof(uintptr_t).
> >>>>>> 
> >>>>>> How does the performance look if you multiply the index by
> >>>>>> sizeof(uintptr_t)?
> >>>>>>   
> >>>>> 
> >>>>> Each mempool entry is cache aligned, so we can use that if we want  
> >>> a  
> >>>>> bigger
> >>>>> multiplier.  
> >>>> 
> >>>> Thanks for chiming in, Bruce.
> >>>> 
> >>>> Please also read this discussion about the multiplier:
> >>>> http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/
> >>>>   
> >>> 
> >>> I actually wondered after I had sent the email whether we had indeed an
> >>> option to disable the cache alignment or not! Thanks for pointing out
> >>> that
> >>> we do. This brings a couple additional thoughts:
> >>> 
> >>> * Using indexes for the cache should probably be a runtime flag rather
> >>> than
> >>>  a build-time one.
> >>> * It would seem reasonable to me to disallow use of the indexed-cache
> >>> flag
> >>>  and the non-cache aligned flag simultaneously.
> >>> * On the offchance that that restriction is unacceptable, then we can
> >>>  make things a little more complicated by doing a runtime computation
> >>> of
> >>>  the "index-shiftwidth" to use.
> >>> 
> >>> Overall, I think defaulting to cacheline shiftwidth and disallowing
> >>> index-based addressing when using unaligned buffers is simplest and
> >>> easiest
> >>> unless we can come up with a valid usecase for needing more than that.
> >>> 
> >>> /Bruce  
> >> 
> >> This feature is a performance optimization.
> >> 
> >> With that in mind, it should not introduce function pointers or similar run-time checks or in the fast path, to determine what kind of cache to use per mempool. And if an index multiplier is implemented, it should be a compile time constant, probably something between sizeof(uintptr_t) or RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE).
> >> 
> >> The patch comes with a tradeoff between better performance and limited mempool size, and possibly some limitations regarding very small objects that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ).
> >> 
> >> With no multiplier, the only tradeoff is that the mempool size is limited to 4 GB.
> >> 
> >> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the mempool size is limited to 32 GB. (And a waste of memory for objects smaller than 8 byte; but I don't think anyone would use a mempool to hold objects smaller than 8 byte.)
> >> 
> >> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect.
> >> 
> >> Note: 32 bit platforms have no benefit from this patch: The pointer already only uses 4 bytes, so replacing the pointer with a 4 byte index makes no difference.
> >> 
> >> 
> >> Since this feature is a performance optimization only, and doesn't provide any new features, I don't mind it being a compile time option.
> >> 
> >> If this feature is a compile time option, and the mempool library is compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public header file, so compilation of applications using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a warning about the flag being ignored. Obviously, rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger than the library supports, possibly emitting a message that the mempool library should be built without this feature to support the larger mempool.
> >> 
> >> Here is another thought: If only exotic applications use mempools larger than 32 GB, this would be a generally acceptable limit, and DPDK should use index-based cache as default, making the opposite (i.e. pointer-based cache) a compile time option instead. A similar decision was recently made for limiting the RTE_MAX_LCORE default.
> >> 
> >> 
> >> Although DPDK is moving away from compile time options in order to better support Linux distros, there should be a general exception for performance and memory optimizations. Otherwise, network appliance vendors will inherit the increasing amount of DPDK bloat, and we (network appliance vendors) will eventually be forced to fork DPDK to get rid of the bloat and achieve the goals originally intended by DPDK.  
> > 
> > Agree with Morten's view on this.
> >   
> >> If anyone disagrees with the principle about a general exception for performance and memory optimizations, I would like to pass on the decision to the Techboard!
> >>   

NAK
Having compile time stuff like this means one side or the other is not tested
by CI infrastructure.  There never was sufficient justification, and lots of objections.
Dropping the patch.