[RFC,1/2] config: add optimal burst size configuration

Message ID 20251126082414.91933-1-pbhagavatula@marvell.com (mailing list archive)
State New
Delegated to: Thomas Monjalon
Headers
Series [RFC,1/2] config: add optimal burst size configuration |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Pavan Nikhilesh Bhagavatula Nov. 26, 2025, 8:24 a.m. UTC
From: Pavan Nikhilesh <pbhagavatula@marvell.com>

Add RTE_OPTIMAL_BURST_SIZE to allow platforms to configure the
optimal burst size.

Set default value to 64 for soc_cn10k and 32 generally.

Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
This improves performance by 5% on l2fwd, other examples showed
negligible difference on CN10K.

 config/arm/meson.build | 1 +
 config/meson.build     | 1 +
 2 files changed, 2 insertions(+)

--
2.50.1 (Apple Git-155)
  

Comments

Morten Brørup Nov. 26, 2025, 9:57 a.m. UTC | #1
> From: Pavan Nikhilesh <pbhagavatula@marvell.com>
> 
> Add RTE_OPTIMAL_BURST_SIZE to allow platforms to configure the
> optimal burst size.
> 
> Set default value to 64 for soc_cn10k and 32 generally.
> 
> Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
> This improves performance by 5% on l2fwd, other examples showed
> negligible difference on CN10K.
>

I support the concept of having a recommended mbuf burst size, targeting the majority of generic applications.
Making it CPU dependent seems like a good choice.

It should be named differently.
First of all, "optimal" depends on the use case; if targeting low latency, shorter bursts are better, so "OPTIMAL" should not be part of the name.
Second, I would guess that it only targets mbuf bursts, not also bursts of other operations (e.g. hash lookups), so "MBUF" should be part of the name.

Suggestion:
/* Recommended burst size for generic applications, striking a balance between throughput and latency. */
dpdk_conf.set('RTE_MBUF_BURST_SIZE_MAX' (or _DEFAULT), 64)

<feature creep>
/* Recommended burst size for generic applications targeting low latency. */
dpdk_conf.set('RTE_MBUF_BURST_SIZE_MIN', 4)
</feature creep>

Having these standardized will also allow libraries and drivers to optimize for them, e.g. drivers should support bursts sizes all the way down to RTE_MBUF_BURST_SIZE_MIN, and can static_assert() that the RTE_MBUF_BURST_SIZE_MIN is not lower than supported by the driver/hardware.

<more feature creep>
rte_config.h could have "#define RTE_MBUF_BURST_SIZE RTE_MBUF_BURST_SIZE_MAX", for the application developer to change to RTE_MBUF_BURST_SIZE_MIN for low latency applications.
This will let the libraries and drivers optimize for the specific burst size used by the application.
</more feature creep>

<rambling>
Intuitively, I would assume that the optimal burst size essentially depends on the CPU's L1D cache size and the application's number of non-mbuf cache lines accessed per burst.
Let's say a CPU core has 32 KiB cache (= 512 cache lines), and each burst touches 4 cache lines per packet:
2 cache lines for the mbuf
1 cache line for the packet data
1 cache line per packet for some table lookup/forwarding entry

Then the mbuf burst should be max 512/4 = 128.
But local variables also use memory during processing, so using a burst of 64 would leave room for that and some more.
</rambling>

>  config/arm/meson.build | 1 +
>  config/meson.build     | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 523b0fc0ed50..fa64c07016b1 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -481,6 +481,7 @@ soc_cn10k = {
>          ['RTE_MAX_LCORE', 24],
>          ['RTE_MAX_NUMA_NODES', 1],
>          ['RTE_MEMPOOL_ALIGN', 128],
> +        ['RTE_OPTIMAL_BURST_SIZE', 64],
>      ],
>      'part_number': '0xd49',
>      'extra_march_features': ['crypto'],
> diff --git a/config/meson.build b/config/meson.build
> index 0cb074ab95b7..95367ae88e2d 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -386,6 +386,7 @@ if get_option('mbuf_refcnt_atomic')
>      dpdk_conf.set('RTE_MBUF_REFCNT_ATOMIC', true)
>  endif
>  dpdk_conf.set10('RTE_IOVA_IN_MBUF', get_option('enable_iova_as_pa'))
> +dpdk_conf.set('RTE_OPTIMAL_BURST_SIZE', 32)
> 
>  compile_time_cpuflags = []
>  subdir(arch_subdir)
> --
> 2.50.1 (Apple Git-155)
  
Pavan Nikhilesh Bhagavatula Nov. 26, 2025, 10:58 a.m. UTC | #2
>> From: Pavan Nikhilesh <pbhagavatula@marvell.com>
>>
>> Add RTE_OPTIMAL_BURST_SIZE to allow platforms to configure the
>> optimal burst size.
>>
>> Set default value to 64 for soc_cn10k and 32 generally.
>>
>> Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
>> ---
>> This improves performance by 5% on l2fwd, other examples showed
>> negligible difference on CN10K.
>>
>
>I support the concept of having a recommended mbuf burst size, targeting the majority of generic applications.
>Making it CPU dependent seems like a good choice.
>
>It should be named differently.
>First of all, "optimal" depends on the use case; if targeting low latency, shorter bursts are better, so "OPTIMAL" should not be part of the name.
>Second, I would guess that it only targets mbuf bursts, not also bursts of other operations (e.g. hash lookups), so "MBUF" should be part of the name.
>
>Suggestion:
>/* Recommended burst size for generic applications, striking a balance between throughput and latency. */
>dpdk_conf.set('RTE_MBUF_BURST_SIZE_MAX' (or _DEFAULT), 64)
>

Agreed, would the 

><feature creep>
>/* Recommended burst size for generic applications targeting low latency. */
>dpdk_conf.set('RTE_MBUF_BURST_SIZE_MIN', 4)
></feature creep>
>
>Having these standardized will also allow libraries and drivers to optimize for them, e.g. drivers should support bursts sizes all the way down to RTE_MBUF_BURST_SIZE_MIN, and can static_assert() that the RTE_MBUF_BURST_SIZE_MIN is not lower than supported by the driver/hardware.
>
><more feature creep>
>rte_config.h could have "#define RTE_MBUF_BURST_SIZE RTE_MBUF_BURST_SIZE_MAX", for the application developer to change to RTE_MBUF_BURST_SIZE_MIN for low latency applications.
>This will let the libraries and drivers optimize for the specific burst size used by the application.
></more feature creep>
>
><rambling>
>Intuitively, I would assume that the optimal burst size essentially depends on the CPU's L1D cache size and the application's number of non-mbuf cache lines accessed per burst.
>Let's say a CPU core has 32 KiB cache (= 512 cache lines), and each burst touches 4 cache lines per packet:
>2 cache lines for the mbuf
>1 cache line for the packet data
>1 cache line per packet for some table lookup/forwarding entry
>
>Then the mbuf burst should be max 512/4 = 128.
>But local variables also use memory during processing, so using a burst of 64 would leave room for that and some more.
></rambling>
>
>>  config/arm/meson.build | 1 +
>>  config/meson.build     | 1 +
>>  2 files changed, 2 insertions(+)
>>
>> diff --git a/config/arm/meson.build b/config/arm/meson.build
>> index 523b0fc0ed50..fa64c07016b1 100644
>> --- a/config/arm/meson.build
>> +++ b/config/arm/meson.build
>> @@ -481,6 +481,7 @@ soc_cn10k = {
>>          ['RTE_MAX_LCORE', 24],
>>          ['RTE_MAX_NUMA_NODES', 1],
>>          ['RTE_MEMPOOL_ALIGN', 128],
>> +        ['RTE_OPTIMAL_BURST_SIZE', 64],
>>      ],
>>      'part_number': '0xd49',
>>      'extra_march_features': ['crypto'],
>> diff --git a/config/meson.build b/config/meson.build
>> index 0cb074ab95b7..95367ae88e2d 100644
>> --- a/config/meson.build
>> +++ b/config/meson.build
>> @@ -386,6 +386,7 @@ if get_option('mbuf_refcnt_atomic')
>>      dpdk_conf.set('RTE_MBUF_REFCNT_ATOMIC', true)
>>  endif
>>  dpdk_conf.set10('RTE_IOVA_IN_MBUF', get_option('enable_iova_as_pa'))
>> +dpdk_conf.set('RTE_OPTIMAL_BURST_SIZE', 32)
>>
>>  compile_time_cpuflags = []
>>  subdir(arch_subdir)
>> --
>> 2.50.1 (Apple Git-155)
  
Pavan Nikhilesh Bhagavatula Nov. 26, 2025, 11 a.m. UTC | #3
>> From: Pavan Nikhilesh <pbhagavatula@marvell.com>
>>
>> Add RTE_OPTIMAL_BURST_SIZE to allow platforms to configure the
>> optimal burst size.
>>
>> Set default value to 64 for soc_cn10k and 32 generally.
>>
>> Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
>> ---
>> This improves performance by 5% on l2fwd, other examples showed
>> negligible difference on CN10K.
>>
>
>I support the concept of having a recommended mbuf burst size, targeting the majority of generic applications.
>Making it CPU dependent seems like a good choice.
>
>It should be named differently.
>First of all, "optimal" depends on the use case; if targeting low latency, shorter bursts are better, so "OPTIMAL" should not be part of the name.
>Second, I would guess that it only targets mbuf bursts, not also bursts of other operations (e.g. hash lookups), so "MBUF" should be part of the name.
>
>Suggestion:
>/* Recommended burst size for generic applications, striking a balance between throughput and latency. */
>dpdk_conf.set('RTE_MBUF_BURST_SIZE_MAX' (or _DEFAULT), 64)
>

Agreed, would the comment be enough to say that it is a recommendation and not an enforcement? or should it be added to the macro name?
I am sceptical of changing burst size of 64 since most of the applications _today_ use 32, might cause unintended regression.

RTE_MBUF_BURST_SIZE_(REC)_PERF?

><feature creep>
>/* Recommended burst size for generic applications targeting low latency. */
>dpdk_conf.set('RTE_MBUF_BURST_SIZE_MIN', 4)
></feature creep>

RTE_MBUF_BURST_SIZE_(REC)_LAT?

(I am bad at names)
>
>Having these standardized will also allow libraries and drivers to optimize for them, e.g. drivers should support bursts sizes all the way down to RTE_MBUF_BURST_SIZE_MIN, and can static_assert() that the RTE_MBUF_BURST_SIZE_MIN is not lower than supported by the driver/hardware.
>
><more feature creep>
>rte_config.h could have "#define RTE_MBUF_BURST_SIZE RTE_MBUF_BURST_SIZE_MAX", for the application developer to change to RTE_MBUF_BURST_SIZE_MIN for low latency applications.
>This will let the libraries and drivers optimize for the specific burst size used by the application.
></more feature creep>

This is fine with me, we can wrap it around a meson option to avoid manually changing rte_config.h

>
><rambling>
>Intuitively, I would assume that the optimal burst size essentially depends on the CPU's L1D cache size and the application's number of non-mbuf cache lines accessed per burst.
>Let's say a CPU core has 32 KiB cache (= 512 cache lines), and each burst touches 4 cache lines per packet:
>2 cache lines for the mbuf
>1 cache line for the packet data
>1 cache line per packet for some table lookup/forwarding entry
>
>Then the mbuf burst should be max 512/4 = 128.
>But local variables also use memory during processing, so using a burst of 64 would leave room for that and some more.
></rambling>

We could probably read `/sys/devices/system/cpu/cpu0/cache/index0/size` in meson and calculate the number of lines and burst but, I dont think its
that simple, for example, CN10K has 64KiB L1D cache and anything above 64 burst size causes performance loss.

Thanks,
Pavan

>
>>  config/arm/meson.build | 1 +
>>  config/meson.build     | 1 +
>>  2 files changed, 2 insertions(+)
>>
>> diff --git a/config/arm/meson.build b/config/arm/meson.build
>> index 523b0fc0ed50..fa64c07016b1 100644
>> --- a/config/arm/meson.build
>> +++ b/config/arm/meson.build
>> @@ -481,6 +481,7 @@ soc_cn10k = {
>>          ['RTE_MAX_LCORE', 24],
>>          ['RTE_MAX_NUMA_NODES', 1],
>>          ['RTE_MEMPOOL_ALIGN', 128],
>> +        ['RTE_OPTIMAL_BURST_SIZE', 64],
>>      ],
>>      'part_number': '0xd49',
>>      'extra_march_features': ['crypto'],
>> diff --git a/config/meson.build b/config/meson.build
>> index 0cb074ab95b7..95367ae88e2d 100644
>> --- a/config/meson.build
>> +++ b/config/meson.build
>> @@ -386,6 +386,7 @@ if get_option('mbuf_refcnt_atomic')
>>      dpdk_conf.set('RTE_MBUF_REFCNT_ATOMIC', true)
>>  endif
>>  dpdk_conf.set10('RTE_IOVA_IN_MBUF', get_option('enable_iova_as_pa'))
>> +dpdk_conf.set('RTE_OPTIMAL_BURST_SIZE', 32)
>>
>>  compile_time_cpuflags = []
>>  subdir(arch_subdir)
>> --
>> 2.50.1 (Apple Git-155)
  
Stephen Hemminger Nov. 27, 2025, 10:01 p.m. UTC | #4
On Wed, 26 Nov 2025 10:57:13 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > 
> > Add RTE_OPTIMAL_BURST_SIZE to allow platforms to configure the
> > optimal burst size.
> > 
> > Set default value to 64 for soc_cn10k and 32 generally.
> > 
> > Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> > ---
> > This improves performance by 5% on l2fwd, other examples showed
> > negligible difference on CN10K.
> >  
> 
> I support the concept of having a recommended mbuf burst size, targeting the majority of generic applications.
> Making it CPU dependent seems like a good choice.
> 
> It should be named differently.
> First of all, "optimal" depends on the use case; if targeting low latency, shorter bursts are better, so "OPTIMAL" should not be part of the name.
> Second, I would guess that it only targets mbuf bursts, not also bursts of other operations (e.g. hash lookups), so "MBUF" should be part of the name.
> 
> Suggestion:
> /* Recommended burst size for generic applications, striking a balance between throughput and latency. */
> dpdk_conf.set('RTE_MBUF_BURST_SIZE_MAX' (or _DEFAULT), 64)
> 
> <feature creep>
> /* Recommended burst size for generic applications targeting low latency. */
> dpdk_conf.set('RTE_MBUF_BURST_SIZE_MIN', 4)
> </feature creep>
> 
> Having these standardized will also allow libraries and drivers to optimize for them, e.g. drivers should support bursts sizes all the way down to RTE_MBUF_BURST_SIZE_MIN, and can static_assert() that the RTE_MBUF_BURST_SIZE_MIN is not lower than supported by the driver/hardware.
> 
> <more feature creep>
> rte_config.h could have "#define RTE_MBUF_BURST_SIZE RTE_MBUF_BURST_SIZE_MAX", for the application developer to change to RTE_MBUF_BURST_SIZE_MIN for low latency applications.
> This will let the libraries and drivers optimize for the specific burst size used by the application.
> </more feature creep>
> 
> <rambling>
> Intuitively, I would assume that the optimal burst size essentially depends on the CPU's L1D cache size and the application's number of non-mbuf cache lines accessed per burst.
> Let's say a CPU core has 32 KiB cache (= 512 cache lines), and each burst touches 4 cache lines per packet:
> 2 cache lines for the mbuf
> 1 cache line for the packet data
> 1 cache line per packet for some table lookup/forwarding entry
> 
> Then the mbuf burst should be max 512/4 = 128.
> But local variables also use memory during processing, so using a burst of 64 would leave room for that and some more.
> </rambling>
> 
> >  config/arm/meson.build | 1 +
> >  config/meson.build     | 1 +
> >  2 files changed, 2 insertions(+)
> > 
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 523b0fc0ed50..fa64c07016b1 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -481,6 +481,7 @@ soc_cn10k = {
> >          ['RTE_MAX_LCORE', 24],
> >          ['RTE_MAX_NUMA_NODES', 1],
> >          ['RTE_MEMPOOL_ALIGN', 128],
> > +        ['RTE_OPTIMAL_BURST_SIZE', 64],
> >      ],
> >      'part_number': '0xd49',
> >      'extra_march_features': ['crypto'],
> > diff --git a/config/meson.build b/config/meson.build
> > index 0cb074ab95b7..95367ae88e2d 100644
> > --- a/config/meson.build
> > +++ b/config/meson.build
> > @@ -386,6 +386,7 @@ if get_option('mbuf_refcnt_atomic')
> >      dpdk_conf.set('RTE_MBUF_REFCNT_ATOMIC', true)
> >  endif
> >  dpdk_conf.set10('RTE_IOVA_IN_MBUF', get_option('enable_iova_as_pa'))
> > +dpdk_conf.set('RTE_OPTIMAL_BURST_SIZE', 32)
> > 
> >  compile_time_cpuflags = []
> >  subdir(arch_subdir)
> > --
> > 2.50.1 (Apple Git-155)  

I understand the motivation, and it make sense for a pure embedded system.
But then again on an embedded system the application can just set its burst size;
this config option only impacts performance of testpmd and examples. And the
performance of testpmd is mostly irrelevant what matters is the real application.

Making it a DPDK config option is a problem for DPDK build in distros.
The optimal burst size would be driver dependent etc.

Perhaps better off in the existing rx / tx descriptor hints.
Most of those device configs really need to be relooked at
since they were inherited from how old Intel drivers worked.
  

Patch

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 523b0fc0ed50..fa64c07016b1 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -481,6 +481,7 @@  soc_cn10k = {
         ['RTE_MAX_LCORE', 24],
         ['RTE_MAX_NUMA_NODES', 1],
         ['RTE_MEMPOOL_ALIGN', 128],
+        ['RTE_OPTIMAL_BURST_SIZE', 64],
     ],
     'part_number': '0xd49',
     'extra_march_features': ['crypto'],
diff --git a/config/meson.build b/config/meson.build
index 0cb074ab95b7..95367ae88e2d 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -386,6 +386,7 @@  if get_option('mbuf_refcnt_atomic')
     dpdk_conf.set('RTE_MBUF_REFCNT_ATOMIC', true)
 endif
 dpdk_conf.set10('RTE_IOVA_IN_MBUF', get_option('enable_iova_as_pa'))
+dpdk_conf.set('RTE_OPTIMAL_BURST_SIZE', 32)

 compile_time_cpuflags = []
 subdir(arch_subdir)