[dpdk-dev,1/2] eal: add macro to mark variable mostly read only

Message ID 20180418153035.5972-1-pbhagavatula@caviumnetworks.com (mailing list archive)
State Not Applicable, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Pavan Nikhilesh April 18, 2018, 3:30 p.m. UTC
  Add macro to mark a variable to be mostly read only and place it in a
separate section.

Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
---

 Group together mostly read only data to avoid cacheline bouncing, also
 useful for auditing purposes.

 lib/librte_eal/common/include/rte_common.h | 5 +++++
 1 file changed, 5 insertions(+)

--
2.17.0
  

Comments

Ferruh Yigit April 18, 2018, 5:43 p.m. UTC | #1
On 4/18/2018 4:30 PM, Pavan Nikhilesh wrote:
> Add macro to mark a variable to be mostly read only and place it in a
> separate section.
> 
> Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
> ---
> 
>  Group together mostly read only data to avoid cacheline bouncing, also
>  useful for auditing purposes.
> 
>  lib/librte_eal/common/include/rte_common.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h
> index 6c5bc5a76..f2ff2e9e6 100644
> --- a/lib/librte_eal/common/include/rte_common.h
> +++ b/lib/librte_eal/common/include/rte_common.h
> @@ -114,6 +114,11 @@ static void __attribute__((constructor(prio), used)) func(void)
>   */
>  #define __rte_noinline  __attribute__((noinline))
> 
> +/**
> + * Mark a variable to be mostly read only and place it in a separate section.
> + */
> +#define __rte_read_mostly __attribute__((__section__(".read_mostly")))

Hi Pavan,

Is the section ".read_mostly" treated specially [1] or is this just for grouping
symbols together (to reduce cacheline bouncing)?

[1]
If this is special section, can you please point counter part in the kernel?


> +
>  /*********** Macros for pointer arithmetic ********/
> 
>  /**
> --
> 2.17.0
>
  
Pavan Nikhilesh April 18, 2018, 5:55 p.m. UTC | #2
On Wed, Apr 18, 2018 at 06:43:11PM +0100, Ferruh Yigit wrote:
> On 4/18/2018 4:30 PM, Pavan Nikhilesh wrote:
> > Add macro to mark a variable to be mostly read only and place it in a
> > separate section.
> >
> > Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
> > ---
> >
> >  Group together mostly read only data to avoid cacheline bouncing, also
> >  useful for auditing purposes.
> >
> >  lib/librte_eal/common/include/rte_common.h | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h
> > index 6c5bc5a76..f2ff2e9e6 100644
> > --- a/lib/librte_eal/common/include/rte_common.h
> > +++ b/lib/librte_eal/common/include/rte_common.h
> > @@ -114,6 +114,11 @@ static void __attribute__((constructor(prio), used)) func(void)
> >   */
> >  #define __rte_noinline  __attribute__((noinline))
> >
> > +/**
> > + * Mark a variable to be mostly read only and place it in a separate section.
> > + */
> > +#define __rte_read_mostly __attribute__((__section__(".read_mostly")))
>

Hi Ferruh,

> Hi Pavan,
>
> Is the section ".read_mostly" treated specially [1] or is this just for grouping
> symbols together (to reduce cacheline bouncing)?

The section .read_mostly is not treated specially it's just for grouping
symbols.

>
> [1]
> If this is special section, can you please point counter part in the kernel?

The kernel has something similar[1] but they have a custom linker script to
arrange symbols.

[1] https://github.com/torvalds/linux/blob/a27fc14219f2e3c4a46ba9177b04d9b52c875532/arch/x86/include/asm/cache.h#L11
kernel commit id 54cb27a71f51d304342c79e62fd7667f2171062b

>
>
> > +
> >  /*********** Macros for pointer arithmetic ********/
> >
> >  /**
> > --
> > 2.17.0
> >
>
  
Ferruh Yigit April 18, 2018, 6:03 p.m. UTC | #3
On 4/18/2018 6:55 PM, Pavan Nikhilesh wrote:
> On Wed, Apr 18, 2018 at 06:43:11PM +0100, Ferruh Yigit wrote:
>> On 4/18/2018 4:30 PM, Pavan Nikhilesh wrote:
>>> Add macro to mark a variable to be mostly read only and place it in a
>>> separate section.
>>>
>>> Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
>>> ---
>>>
>>>  Group together mostly read only data to avoid cacheline bouncing, also
>>>  useful for auditing purposes.
>>>
>>>  lib/librte_eal/common/include/rte_common.h | 5 +++++
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h
>>> index 6c5bc5a76..f2ff2e9e6 100644
>>> --- a/lib/librte_eal/common/include/rte_common.h
>>> +++ b/lib/librte_eal/common/include/rte_common.h
>>> @@ -114,6 +114,11 @@ static void __attribute__((constructor(prio), used)) func(void)
>>>   */
>>>  #define __rte_noinline  __attribute__((noinline))
>>>
>>> +/**
>>> + * Mark a variable to be mostly read only and place it in a separate section.
>>> + */
>>> +#define __rte_read_mostly __attribute__((__section__(".read_mostly")))
>>
> 
> Hi Ferruh,
> 
>> Hi Pavan,
>>
>> Is the section ".read_mostly" treated specially [1] or is this just for grouping
>> symbols together (to reduce cacheline bouncing)?
> 
> The section .read_mostly is not treated specially it's just for grouping
> symbols.

I have encounter with a blog post claiming this is not working:

"
The problem with the above approach is that once all the __read_mostly variables
are grouped into one section, the remaining "non-read-mostly" variables end-up
together too. This increases the chances that two frequently used elements (in
the "non-read-mostly" region) will end-up competing for the same position (or
cache-line, the basic fixed-sized block for memory<-->cache transfers) in the
cache. Thus frequent accesses will cause excessive cache thrashing on that
particular cache-line thereby degrading the overall system performance.
"

https://thecodeartist.blogspot.com/2011/12/why-readmostly-does-not-work-as-it.html

> 
>>
>> [1]
>> If this is special section, can you please point counter part in the kernel?
> 
> The kernel has something similar[1] but they have a custom linker script to
> arrange symbols.
> 
> [1] https://github.com/torvalds/linux/blob/a27fc14219f2e3c4a46ba9177b04d9b52c875532/arch/x86/include/asm/cache.h#L11
> kernel commit id 54cb27a71f51d304342c79e62fd7667f2171062b
> 
>>
>>
>>> +
>>>  /*********** Macros for pointer arithmetic ********/
>>>
>>>  /**
>>> --
>>> 2.17.0
>>>
>>
  
Pavan Nikhilesh April 19, 2018, 9:20 a.m. UTC | #4
On Wed, Apr 18, 2018 at 07:03:06PM +0100, Ferruh Yigit wrote:
> On 4/18/2018 6:55 PM, Pavan Nikhilesh wrote:
> > On Wed, Apr 18, 2018 at 06:43:11PM +0100, Ferruh Yigit wrote:
> >> On 4/18/2018 4:30 PM, Pavan Nikhilesh wrote:
> >>> Add macro to mark a variable to be mostly read only and place it in a
> >>> separate section.
> >>>
> >>> Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
> >>> ---
> >>>
> >>>  Group together mostly read only data to avoid cacheline bouncing, also
> >>>  useful for auditing purposes.
> >>>
> >>>  lib/librte_eal/common/include/rte_common.h | 5 +++++
> >>>  1 file changed, 5 insertions(+)
> >>>
> >>> diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h
> >>> index 6c5bc5a76..f2ff2e9e6 100644
> >>> --- a/lib/librte_eal/common/include/rte_common.h
> >>> +++ b/lib/librte_eal/common/include/rte_common.h
> >>> @@ -114,6 +114,11 @@ static void __attribute__((constructor(prio), used)) func(void)
> >>>   */
> >>>  #define __rte_noinline  __attribute__((noinline))
> >>>
> >>> +/**
> >>> + * Mark a variable to be mostly read only and place it in a separate section.
> >>> + */
> >>> +#define __rte_read_mostly __attribute__((__section__(".read_mostly")))
> >>
> >
> > Hi Ferruh,
> >
> >> Hi Pavan,
> >>
> >> Is the section ".read_mostly" treated specially [1] or is this just for grouping
> >> symbols together (to reduce cacheline bouncing)?
> >
> > The section .read_mostly is not treated specially it's just for grouping
> > symbols.
>
> I have encounter with a blog post claiming this is not working:
>
> "
> The problem with the above approach is that once all the __read_mostly variables
> are grouped into one section, the remaining "non-read-mostly" variables end-up
> together too. This increases the chances that two frequently used elements (in
> the "non-read-mostly" region) will end-up competing for the same position (or
> cache-line, the basic fixed-sized block for memory<-->cache transfers) in the
> cache. Thus frequent accesses will cause excessive cache thrashing on that
> particular cache-line thereby degrading the overall system performance.
> "
>
> https://thecodeartist.blogspot.com/2011/12/why-readmostly-does-not-work-as-it.html
>

The author is concerned about processors with less cache set-associativity,
almost all modern processors have >= 16 way set associativity. And the above
issue can happen even now when two frequently written global variables are
placed next to each other.

Currently, we don't have much control over how the global variables are
arranged and a single addition/deletion to the global variables causes change
in alignment and in some cases minor performance regression.
Tagging them as __read_mostly we can easily identify the alignment changes
across builds by comparing map files global variable section.

I have verified the patch-set on arm64 (16-way set-associative) and didn't
notice any performance regression.
Did you have a chance to verify if there is any performance regression?

> >
> >>
> >> [1]
> >> If this is special section, can you please point counter part in the kernel?
> >
> > The kernel has something similar[1] but they have a custom linker script to
> > arrange symbols.
> >
> > [1] https://github.com/torvalds/linux/blob/a27fc14219f2e3c4a46ba9177b04d9b52c875532/arch/x86/include/asm/cache.h#L11
> > kernel commit id 54cb27a71f51d304342c79e62fd7667f2171062b
> >
> >>
> >>
> >>> +
> >>>  /*********** Macros for pointer arithmetic ********/
> >>>
> >>>  /**
> >>> --
> >>> 2.17.0
> >>>
> >>
>
  
Bruce Richardson April 19, 2018, 12:09 p.m. UTC | #5
On Thu, Apr 19, 2018 at 02:50:52PM +0530, Pavan Nikhilesh wrote:
> On Wed, Apr 18, 2018 at 07:03:06PM +0100, Ferruh Yigit wrote:
> > On 4/18/2018 6:55 PM, Pavan Nikhilesh wrote:
> > > On Wed, Apr 18, 2018 at 06:43:11PM +0100, Ferruh Yigit wrote:
> > >> On 4/18/2018 4:30 PM, Pavan Nikhilesh wrote:
> > >>> Add macro to mark a variable to be mostly read only and place it in a
> > >>> separate section.
> > >>>
> > >>> Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
> > >>> ---
> > >>>
> > >>>  Group together mostly read only data to avoid cacheline bouncing, also
> > >>>  useful for auditing purposes.
> > >>>
> > >>>  lib/librte_eal/common/include/rte_common.h | 5 +++++
> > >>>  1 file changed, 5 insertions(+)
> > >>>
> > >>> diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h
> > >>> index 6c5bc5a76..f2ff2e9e6 100644
> > >>> --- a/lib/librte_eal/common/include/rte_common.h
> > >>> +++ b/lib/librte_eal/common/include/rte_common.h
> > >>> @@ -114,6 +114,11 @@ static void __attribute__((constructor(prio), used)) func(void)
> > >>>   */
> > >>>  #define __rte_noinline  __attribute__((noinline))
> > >>>
> > >>> +/**
> > >>> + * Mark a variable to be mostly read only and place it in a separate section.
> > >>> + */
> > >>> +#define __rte_read_mostly __attribute__((__section__(".read_mostly")))
> > >>
> > >
> > > Hi Ferruh,
> > >
> > >> Hi Pavan,
> > >>
> > >> Is the section ".read_mostly" treated specially [1] or is this just for grouping
> > >> symbols together (to reduce cacheline bouncing)?
> > >
> > > The section .read_mostly is not treated specially it's just for grouping
> > > symbols.
> >
> > I have encounter with a blog post claiming this is not working:
> >
> > "
> > The problem with the above approach is that once all the __read_mostly variables
> > are grouped into one section, the remaining "non-read-mostly" variables end-up
> > together too. This increases the chances that two frequently used elements (in
> > the "non-read-mostly" region) will end-up competing for the same position (or
> > cache-line, the basic fixed-sized block for memory<-->cache transfers) in the
> > cache. Thus frequent accesses will cause excessive cache thrashing on that
> > particular cache-line thereby degrading the overall system performance.
> > "
> >
> > https://thecodeartist.blogspot.com/2011/12/why-readmostly-does-not-work-as-it.html
> >
> 
> The author is concerned about processors with less cache set-associativity,
> almost all modern processors have >= 16 way set associativity. And the above
> issue can happen even now when two frequently written global variables are
> placed next to each other.
> 
> Currently, we don't have much control over how the global variables are
> arranged and a single addition/deletion to the global variables causes change
> in alignment and in some cases minor performance regression.
> Tagging them as __read_mostly we can easily identify the alignment changes
> across builds by comparing map files global variable section.
> 
> I have verified the patch-set on arm64 (16-way set-associative) and didn't
> notice any performance regression.
> Did you have a chance to verify if there is any performance regression?
> 
Is there a performance improvement? It's seems a relatively strange change
to me, so I'd like to know that it really improves performance in test
cases.

/Bruce
  
Pavan Nikhilesh April 19, 2018, 3:18 p.m. UTC | #6
On Thu, Apr 19, 2018 at 01:09:58PM +0100, Bruce Richardson wrote:
> On Thu, Apr 19, 2018 at 02:50:52PM +0530, Pavan Nikhilesh wrote:
> > On Wed, Apr 18, 2018 at 07:03:06PM +0100, Ferruh Yigit wrote:
> > > On 4/18/2018 6:55 PM, Pavan Nikhilesh wrote:
> > > > On Wed, Apr 18, 2018 at 06:43:11PM +0100, Ferruh Yigit wrote:
> > > >> On 4/18/2018 4:30 PM, Pavan Nikhilesh wrote:
> > > >>> Add macro to mark a variable to be mostly read only and place it in a
> > > >>> separate section.
> > > >>>
> > > >>> Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
> > > >>> ---
> > > >>>
> > > >>>  Group together mostly read only data to avoid cacheline bouncing, also
> > > >>>  useful for auditing purposes.
> > > >>>
> > > >>>  lib/librte_eal/common/include/rte_common.h | 5 +++++
> > > >>>  1 file changed, 5 insertions(+)
> > > >>>
> > > >>> diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h
> > > >>> index 6c5bc5a76..f2ff2e9e6 100644
> > > >>> --- a/lib/librte_eal/common/include/rte_common.h
> > > >>> +++ b/lib/librte_eal/common/include/rte_common.h
> > > >>> @@ -114,6 +114,11 @@ static void __attribute__((constructor(prio), used)) func(void)
> > > >>>   */
> > > >>>  #define __rte_noinline  __attribute__((noinline))
> > > >>>
> > > >>> +/**
> > > >>> + * Mark a variable to be mostly read only and place it in a separate section.
> > > >>> + */
> > > >>> +#define __rte_read_mostly __attribute__((__section__(".read_mostly")))
> > > >>
> > > >
> > > > Hi Ferruh,
> > > >
> > > >> Hi Pavan,
> > > >>
> > > >> Is the section ".read_mostly" treated specially [1] or is this just for grouping
> > > >> symbols together (to reduce cacheline bouncing)?
> > > >
> > > > The section .read_mostly is not treated specially it's just for grouping
> > > > symbols.
> > >
> > > I have encounter with a blog post claiming this is not working:
> > >
> > > "
> > > The problem with the above approach is that once all the __read_mostly variables
> > > are grouped into one section, the remaining "non-read-mostly" variables end-up
> > > together too. This increases the chances that two frequently used elements (in
> > > the "non-read-mostly" region) will end-up competing for the same position (or
> > > cache-line, the basic fixed-sized block for memory<-->cache transfers) in the
> > > cache. Thus frequent accesses will cause excessive cache thrashing on that
> > > particular cache-line thereby degrading the overall system performance.
> > > "
> > >
> > > https://thecodeartist.blogspot.com/2011/12/why-readmostly-does-not-work-as-it.html
> > >
> >
> > The author is concerned about processors with less cache set-associativity,
> > almost all modern processors have >= 16 way set associativity. And the above
> > issue can happen even now when two frequently written global variables are
> > placed next to each other.
> >
> > Currently, we don't have much control over how the global variables are
> > arranged and a single addition/deletion to the global variables causes change
> > in alignment and in some cases minor performance regression.
> > Tagging them as __read_mostly we can easily identify the alignment changes
> > across builds by comparing map files global variable section.
> >
> > I have verified the patch-set on arm64 (16-way set-associative) and didn't
> > notice any performance regression.
> > Did you have a chance to verify if there is any performance regression?
> >
> Is there a performance improvement? It's seems a relatively strange change
> to me, so I'd like to know that it really improves performance in test
> cases.

We had a performance regression of ~200k between 17.11 and 18.02 due enabling
dpaa/dpaa2 in default config this was due to new global variables being added
and changing the alignment.
Moving read mostly global variables (logtypes/device arrays) to a separate
section helps when tracking performance regression between builds.

>
> /Bruce
  
Bruce Richardson April 19, 2018, 3:37 p.m. UTC | #7
On Thu, Apr 19, 2018 at 08:48:33PM +0530, Pavan Nikhilesh wrote:
> On Thu, Apr 19, 2018 at 01:09:58PM +0100, Bruce Richardson wrote:
> > On Thu, Apr 19, 2018 at 02:50:52PM +0530, Pavan Nikhilesh wrote:
> > > On Wed, Apr 18, 2018 at 07:03:06PM +0100, Ferruh Yigit wrote:
> > > > On 4/18/2018 6:55 PM, Pavan Nikhilesh wrote:
> > > > > On Wed, Apr 18, 2018 at 06:43:11PM +0100, Ferruh Yigit wrote:
> > > > >> On 4/18/2018 4:30 PM, Pavan Nikhilesh wrote:
> > > > >>> Add macro to mark a variable to be mostly read only and place it in a
> > > > >>> separate section.
> > > > >>>
> > > > >>> Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
> > > > >>> ---
> > > > >>>
> > > > >>>  Group together mostly read only data to avoid cacheline bouncing, also
> > > > >>>  useful for auditing purposes.
> > > > >>>
> > > > >>>  lib/librte_eal/common/include/rte_common.h | 5 +++++
> > > > >>>  1 file changed, 5 insertions(+)
> > > > >>>
> > > > >>> diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h
> > > > >>> index 6c5bc5a76..f2ff2e9e6 100644
> > > > >>> --- a/lib/librte_eal/common/include/rte_common.h
> > > > >>> +++ b/lib/librte_eal/common/include/rte_common.h
> > > > >>> @@ -114,6 +114,11 @@ static void __attribute__((constructor(prio), used)) func(void)
> > > > >>>   */
> > > > >>>  #define __rte_noinline  __attribute__((noinline))
> > > > >>>
> > > > >>> +/**
> > > > >>> + * Mark a variable to be mostly read only and place it in a separate section.
> > > > >>> + */
> > > > >>> +#define __rte_read_mostly __attribute__((__section__(".read_mostly")))
> > > > >>
> > > > >
> > > > > Hi Ferruh,
> > > > >
> > > > >> Hi Pavan,
> > > > >>
> > > > >> Is the section ".read_mostly" treated specially [1] or is this just for grouping
> > > > >> symbols together (to reduce cacheline bouncing)?
> > > > >
> > > > > The section .read_mostly is not treated specially it's just for grouping
> > > > > symbols.
> > > >
> > > > I have encounter with a blog post claiming this is not working:
> > > >
> > > > "
> > > > The problem with the above approach is that once all the __read_mostly variables
> > > > are grouped into one section, the remaining "non-read-mostly" variables end-up
> > > > together too. This increases the chances that two frequently used elements (in
> > > > the "non-read-mostly" region) will end-up competing for the same position (or
> > > > cache-line, the basic fixed-sized block for memory<-->cache transfers) in the
> > > > cache. Thus frequent accesses will cause excessive cache thrashing on that
> > > > particular cache-line thereby degrading the overall system performance.
> > > > "
> > > >
> > > > https://thecodeartist.blogspot.com/2011/12/why-readmostly-does-not-work-as-it.html
> > > >
> > >
> > > The author is concerned about processors with less cache set-associativity,
> > > almost all modern processors have >= 16 way set associativity. And the above
> > > issue can happen even now when two frequently written global variables are
> > > placed next to each other.
> > >
> > > Currently, we don't have much control over how the global variables are
> > > arranged and a single addition/deletion to the global variables causes change
> > > in alignment and in some cases minor performance regression.
> > > Tagging them as __read_mostly we can easily identify the alignment changes
> > > across builds by comparing map files global variable section.
> > >
> > > I have verified the patch-set on arm64 (16-way set-associative) and didn't
> > > notice any performance regression.
> > > Did you have a chance to verify if there is any performance regression?
> > >
> > Is there a performance improvement? It's seems a relatively strange change
> > to me, so I'd like to know that it really improves performance in test
> > cases.
> 
> We had a performance regression of ~200k between 17.11 and 18.02 due enabling
> dpaa/dpaa2 in default config this was due to new global variables being added
> and changing the alignment.
> Moving read mostly global variables (logtypes/device arrays) to a separate
> section helps when tracking performance regression between builds.
> 
So it's of use when debugging, rather than providing a performance boost in
and of itself, right?

If performance regressions are appearing, should we then see about marking
globals with __rte_cache_align to force them all onto difference
cachelines?
  
Pavan Nikhilesh April 19, 2018, 3:55 p.m. UTC | #8
On Thu, Apr 19, 2018 at 04:37:23PM +0100, Bruce Richardson wrote:
> On Thu, Apr 19, 2018 at 08:48:33PM +0530, Pavan Nikhilesh wrote:
> > On Thu, Apr 19, 2018 at 01:09:58PM +0100, Bruce Richardson wrote:
> > > On Thu, Apr 19, 2018 at 02:50:52PM +0530, Pavan Nikhilesh wrote:
> > > > On Wed, Apr 18, 2018 at 07:03:06PM +0100, Ferruh Yigit wrote:
> > > > > On 4/18/2018 6:55 PM, Pavan Nikhilesh wrote:
> > > > > > On Wed, Apr 18, 2018 at 06:43:11PM +0100, Ferruh Yigit wrote:
> > > > > >> On 4/18/2018 4:30 PM, Pavan Nikhilesh wrote:
> > > > > >>> Add macro to mark a variable to be mostly read only and place it in a
> > > > > >>> separate section.
> > > > > >>>
> > > > > >>> Signed-off-by: Pavan Nikhilesh <pbhagavatula@caviumnetworks.com>
> > > > > >>> ---
> > > > > >>>
> > > > > >>>  Group together mostly read only data to avoid cacheline bouncing, also
> > > > > >>>  useful for auditing purposes.
> > > > > >>>
> > > > > >>>  lib/librte_eal/common/include/rte_common.h | 5 +++++
> > > > > >>>  1 file changed, 5 insertions(+)
> > > > > >>>
> > > > > >>> diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h
> > > > > >>> index 6c5bc5a76..f2ff2e9e6 100644
> > > > > >>> --- a/lib/librte_eal/common/include/rte_common.h
> > > > > >>> +++ b/lib/librte_eal/common/include/rte_common.h
> > > > > >>> @@ -114,6 +114,11 @@ static void __attribute__((constructor(prio), used)) func(void)
> > > > > >>>   */
> > > > > >>>  #define __rte_noinline  __attribute__((noinline))
> > > > > >>>
> > > > > >>> +/**
> > > > > >>> + * Mark a variable to be mostly read only and place it in a separate section.
> > > > > >>> + */
> > > > > >>> +#define __rte_read_mostly __attribute__((__section__(".read_mostly")))
> > > > > >>
> > > > > >
> > > > > > Hi Ferruh,
> > > > > >
> > > > > >> Hi Pavan,
> > > > > >>
> > > > > >> Is the section ".read_mostly" treated specially [1] or is this just for grouping
> > > > > >> symbols together (to reduce cacheline bouncing)?
> > > > > >
> > > > > > The section .read_mostly is not treated specially it's just for grouping
> > > > > > symbols.
> > > > >
> > > > > I have encounter with a blog post claiming this is not working:
> > > > >
> > > > > "
> > > > > The problem with the above approach is that once all the __read_mostly variables
> > > > > are grouped into one section, the remaining "non-read-mostly" variables end-up
> > > > > together too. This increases the chances that two frequently used elements (in
> > > > > the "non-read-mostly" region) will end-up competing for the same position (or
> > > > > cache-line, the basic fixed-sized block for memory<-->cache transfers) in the
> > > > > cache. Thus frequent accesses will cause excessive cache thrashing on that
> > > > > particular cache-line thereby degrading the overall system performance.
> > > > > "
> > > > >
> > > > > https://thecodeartist.blogspot.com/2011/12/why-readmostly-does-not-work-as-it.html
> > > > >
> > > >
> > > > The author is concerned about processors with less cache set-associativity,
> > > > almost all modern processors have >= 16 way set associativity. And the above
> > > > issue can happen even now when two frequently written global variables are
> > > > placed next to each other.
> > > >
> > > > Currently, we don't have much control over how the global variables are
> > > > arranged and a single addition/deletion to the global variables causes change
> > > > in alignment and in some cases minor performance regression.
> > > > Tagging them as __read_mostly we can easily identify the alignment changes
> > > > across builds by comparing map files global variable section.
> > > >
> > > > I have verified the patch-set on arm64 (16-way set-associative) and didn't
> > > > notice any performance regression.
> > > > Did you have a chance to verify if there is any performance regression?
> > > >
> > > Is there a performance improvement? It's seems a relatively strange change
> > > to me, so I'd like to know that it really improves performance in test
> > > cases.
> >
> > We had a performance regression of ~200k between 17.11 and 18.02 due enabling
> > dpaa/dpaa2 in default config this was due to new global variables being added
> > and changing the alignment.
> > Moving read mostly global variables (logtypes/device arrays) to a separate
> > section helps when tracking performance regression between builds.
> >
> So it's of use when debugging, rather than providing a performance boost in
> and of itself, right?
>
> If performance regressions are appearing, should we then see about marking
> globals with __rte_cache_align to force them all onto difference
> cachelines?

I think that would be a bit of overkill considering the number of logtype
variables.

Currently there are

29 global variables not found in any map file
List of globals
("['fw_file', 'mode_8023ad_ports', 'ecore_mz_count', 'igb_filter_rss_list', "
 "'crc16_ccitt_pmull', 'igb_filter_flex_list', 'rte_crypto_devices', "
 "'igb_flow_list', 'qman_clk', 'bman_ccsr_map', 'dpaa_portal_key', "
 "'bman_ip_rev', 'cons_filter', 'rte_rawdevices', 'internal_config', "
 "'bman_pool_max', 'qman_version', 'qman_ccsr_map', 'rte_event_devices', "
 "'qman_ip_rev', 'netcfg_interface', 'igb_filter_ntuple_list', "
 "'igb_filter_ethertype_list', 'skeldev_init_once', 'igb_filter_syn_list', "
 "'crc32_eth_pmull', 'global_portals_used', 'virtio_hw_internal', "
 "'vhost_devices']")

These are stats without including the logtype variables across all drivers.
With logtype variables included there are 70+.
  

Patch

diff --git a/lib/librte_eal/common/include/rte_common.h b/lib/librte_eal/common/include/rte_common.h
index 6c5bc5a76..f2ff2e9e6 100644
--- a/lib/librte_eal/common/include/rte_common.h
+++ b/lib/librte_eal/common/include/rte_common.h
@@ -114,6 +114,11 @@  static void __attribute__((constructor(prio), used)) func(void)
  */
 #define __rte_noinline  __attribute__((noinline))

+/**
+ * Mark a variable to be mostly read only and place it in a separate section.
+ */
+#define __rte_read_mostly __attribute__((__section__(".read_mostly")))
+
 /*********** Macros for pointer arithmetic ********/

 /**