[v6,1/4] lib: add generic support for reading PMU events
Checks
Commit Message
Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.
This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.
Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
MAINTAINERS | 5 +
app/test/meson.build | 4 +
app/test/test_pmu.c | 42 +++
doc/api/doxy-api-index.md | 3 +-
doc/api/doxy-api.conf.in | 1 +
doc/guides/prog_guide/profile_app.rst | 8 +
doc/guides/rel_notes/release_23_03.rst | 7 +
lib/meson.build | 1 +
lib/pmu/meson.build | 13 +
lib/pmu/pmu_private.h | 29 ++
lib/pmu/rte_pmu.c | 436 +++++++++++++++++++++++++
lib/pmu/rte_pmu.h | 206 ++++++++++++
lib/pmu/version.map | 19 ++
13 files changed, 773 insertions(+), 1 deletion(-)
create mode 100644 app/test/test_pmu.c
create mode 100644 lib/pmu/meson.build
create mode 100644 lib/pmu/pmu_private.h
create mode 100644 lib/pmu/rte_pmu.c
create mode 100644 lib/pmu/rte_pmu.h
create mode 100644 lib/pmu/version.map
Comments
> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Friday, 20 January 2023 00.39
>
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
>
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
>
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
If you insist on passing lcore_id around as a function parameter, the function description must mention that the lcore_id parameter must be set to rte_lcore_id() for the functions where this is a requirement, including all functions that use those functions.
Alternatively, follow my previous suggestion: Omit the lcore_id function parameter, and use rte_lcore_id() instead.
On Fri, Jan 20, 2023 at 12:39:12AM +0100, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
>
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
>
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
> MAINTAINERS | 5 +
> app/test/meson.build | 4 +
> app/test/test_pmu.c | 42 +++
> doc/api/doxy-api-index.md | 3 +-
> doc/api/doxy-api.conf.in | 1 +
> doc/guides/prog_guide/profile_app.rst | 8 +
> doc/guides/rel_notes/release_23_03.rst | 7 +
> lib/meson.build | 1 +
> lib/pmu/meson.build | 13 +
> lib/pmu/pmu_private.h | 29 ++
> lib/pmu/rte_pmu.c | 436 +++++++++++++++++++++++++
> lib/pmu/rte_pmu.h | 206 ++++++++++++
> lib/pmu/version.map | 19 ++
> 13 files changed, 773 insertions(+), 1 deletion(-)
> create mode 100644 app/test/test_pmu.c
> create mode 100644 lib/pmu/meson.build
> create mode 100644 lib/pmu/pmu_private.h
> create mode 100644 lib/pmu/rte_pmu.c
> create mode 100644 lib/pmu/rte_pmu.h
> create mode 100644 lib/pmu/version.map
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9a0f416d2e..9f13eafd95 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
> M: Pavan Nikhilesh <pbhagavatula@marvell.com>
> F: lib/node/
>
> +PMU - EXPERIMENTAL
> +M: Tomasz Duszynski <tduszynski@marvell.com>
> +F: lib/pmu/
> +F: app/test/test_pmu*
> +
>
> Test Applications
> -----------------
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..b2c2a618b1 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
> test_sources += ['test_metrics.c']
> fast_tests += [['metrics_autotest', true, true]]
> endif
> +if is_linux
> + test_sources += ['test_pmu.c']
> + fast_tests += [['pmu_autotest', true, true]]
> +endif
traditionally we don't conditionally include tests at the meson.build
level, instead we run all tests and have them skip when executed for
unsupported exec environments.
you can take a look at test_eventdev.c as an example for a test that is
skipped on windows, i'm sure it could be adapted to skip on freebsd if
you aren't supporting it.
>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Friday, January 20, 2023 7:30 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>; Jerin Jacob Kollanukkaran
><jerinj@marvell.com>; mb@smartsharesystems.com; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com
>Subject: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>On Fri, Jan 20, 2023 at 12:39:12AM +0100, Tomasz Duszynski wrote:
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>> MAINTAINERS | 5 +
>> app/test/meson.build | 4 +
>> app/test/test_pmu.c | 42 +++
>> doc/api/doxy-api-index.md | 3 +-
>> doc/api/doxy-api.conf.in | 1 +
>> doc/guides/prog_guide/profile_app.rst | 8 +
>> doc/guides/rel_notes/release_23_03.rst | 7 +
>> lib/meson.build | 1 +
>> lib/pmu/meson.build | 13 +
>> lib/pmu/pmu_private.h | 29 ++
>> lib/pmu/rte_pmu.c | 436 +++++++++++++++++++++++++
>> lib/pmu/rte_pmu.h | 206 ++++++++++++
>> lib/pmu/version.map | 19 ++
>> 13 files changed, 773 insertions(+), 1 deletion(-) create mode
>> 100644 app/test/test_pmu.c create mode 100644 lib/pmu/meson.build
>> create mode 100644 lib/pmu/pmu_private.h create mode 100644
>> lib/pmu/rte_pmu.c create mode 100644 lib/pmu/rte_pmu.h create mode
>> 100644 lib/pmu/version.map
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS index 9a0f416d2e..9f13eafd95
>> 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>> M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>> F: lib/node/
>>
>> +PMU - EXPERIMENTAL
>> +M: Tomasz Duszynski <tduszynski@marvell.com>
>> +F: lib/pmu/
>> +F: app/test/test_pmu*
>> +
>>
>> Test Applications
>> -----------------
>> diff --git a/app/test/meson.build b/app/test/meson.build index
>> f34d19e3c3..b2c2a618b1 100644
>> --- a/app/test/meson.build
>> +++ b/app/test/meson.build
>> @@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
>> test_sources += ['test_metrics.c']
>> fast_tests += [['metrics_autotest', true, true]] endif
>> +if is_linux
>> + test_sources += ['test_pmu.c']
>> + fast_tests += [['pmu_autotest', true, true]] endif
>
>traditionally we don't conditionally include tests at the meson.build level, instead we run all
>tests and have them skip when executed for unsupported exec environments.
>
>you can take a look at test_eventdev.c as an example for a test that is skipped on windows, i'm
>sure it could be adapted to skip on freebsd if you aren't supporting it.
Right, this looks better. Thanks for pointing this out.
>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Friday, January 20, 2023 10:47 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>
>Cc: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com;
>roretzla@linux.microsoft.com
>Subject: [EXT] RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Friday, 20 January 2023 00.39
>>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>
>If you insist on passing lcore_id around as a function parameter, the function description must
>mention that the lcore_id parameter must be set to rte_lcore_id() for the functions where this is a
>requirement, including all functions that use those functions.
>
Not sure why are you insisting so much on removing that rte_lcore_id(). Yes that macro evaluates
to integer but if you don't think about internals this resembles a function call.
Then natural pattern is to call it once and reuse results if possible. Passing lcore_id around
implies that calls are per l-core, why would that confuse anyone reading that code?
Besides, all functions taking it are internal stuff hence you cannot call it elsewhere.
>Alternatively, follow my previous suggestion: Omit the lcore_id function parameter, and use
>rte_lcore_id() instead.
>
> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Thursday, 26 January 2023 10.40
>
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Friday, January 20, 2023 10:47 AM
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Friday, 20 January 2023 00.39
> >>
> >> Add support for programming PMU counters and reading their values in
> >> runtime bypassing kernel completely.
> >>
> >> This is especially useful in cases where CPU cores are isolated
> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> >> standard perf utility without sacrificing latency and performance.
> >>
> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> ---
> >
> >If you insist on passing lcore_id around as a function parameter, the
> function description must
> >mention that the lcore_id parameter must be set to rte_lcore_id() for
> the functions where this is a
> >requirement, including all functions that use those functions.
Perhaps I'm stating this wrong, so let me try to rephrase:
As I understand it, some of the setup functions must be called from the EAL thread that executes that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread itself.
Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter, which you at the bottom of this email [1] explained is not possible.
[1]: http://inbox.dpdk.org/dev/DM4PR18MB4368461EC42603F77A7DC1BCD2E09@DM4PR18MB4368.namprd18.prod.outlook.com/
> >
>
> Not sure why are you insisting so much on removing that rte_lcore_id().
> Yes that macro evaluates
> to integer but if you don't think about internals this resembles a
> function call.
I agree with this argument. And for that reason, passing lcore_id around could be relevant.
I only wanted to bring your attention to the low cost of fetching it inside the functions, as an alternative to passing it as an argument.
>
> Then natural pattern is to call it once and reuse results if possible.
Yes, and I would usually agree to using this pattern.
> Passing lcore_id around
> implies that calls are per l-core, why would that confuse anyone
> reading that code?
This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the function is running on that lcore!
E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a parameter, and does not assume that lcore_id==rte_lcore_id().
[2]: https://elixir.bootlin.com/dpdk/latest/source/lib/mempool/rte_mempool.h#L1315
>
> Besides, all functions taking it are internal stuff hence you cannot
> call it elsewhere.
OK. I agree that this reduces the risk of incorrect use.
Generally, I think that internal functions should be documented too. Not to the full extent, like public functions, but some documentation is nice.
And if there are special requirements to a function parameter, it should be documented with that function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special requirement.
It might just be me worrying too much, so... If nobody else complains about this, I can live with it as is. Assuming that none of the public functions have this special requirement (either directly or indirectly, by calling functions with the special requirement).
>
> >Alternatively, follow my previous suggestion: Omit the lcore_id
> function parameter, and use
> >rte_lcore_id() instead.
> >
>
On Thu, Jan 26, 2023 at 01:29:36PM +0100, Morten Brørup wrote:
> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > Sent: Thursday, 26 January 2023 10.40
> >
> > >From: Morten Brørup <mb@smartsharesystems.com>
> > >Sent: Friday, January 20, 2023 10:47 AM
> > >
> > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > >> Sent: Friday, 20 January 2023 00.39
> > >>
> > >> Add support for programming PMU counters and reading their values in
> > >> runtime bypassing kernel completely.
> > >>
> > >> This is especially useful in cases where CPU cores are isolated
> > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> > >> standard perf utility without sacrificing latency and performance.
> > >>
> > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > >> ---
> > >
> > >If you insist on passing lcore_id around as a function parameter, the
> > function description must
> > >mention that the lcore_id parameter must be set to rte_lcore_id() for
> > the functions where this is a
> > >requirement, including all functions that use those functions.
>
> Perhaps I'm stating this wrong, so let me try to rephrase:
>
> As I understand it, some of the setup functions must be called from the EAL thread that executes that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread itself.
>
> Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter, which you at the bottom of this email [1] explained is not possible.
>
> [1]: http://inbox.dpdk.org/dev/DM4PR18MB4368461EC42603F77A7DC1BCD2E09@DM4PR18MB4368.namprd18.prod.outlook.com/
>
> > >
> >
> > Not sure why are you insisting so much on removing that rte_lcore_id().
> > Yes that macro evaluates
> > to integer but if you don't think about internals this resembles a
> > function call.
>
> I agree with this argument. And for that reason, passing lcore_id around could be relevant.
>
> I only wanted to bring your attention to the low cost of fetching it inside the functions, as an alternative to passing it as an argument.
>
> >
> > Then natural pattern is to call it once and reuse results if possible.
>
> Yes, and I would usually agree to using this pattern.
>
> > Passing lcore_id around
> > implies that calls are per l-core, why would that confuse anyone
> > reading that code?
>
> This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the function is running on that lcore!
>
> E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a parameter, and does not assume that lcore_id==rte_lcore_id().
>
> [2]: https://elixir.bootlin.com/dpdk/latest/source/lib/mempool/rte_mempool.h#L1315
>
> >
> > Besides, all functions taking it are internal stuff hence you cannot
> > call it elsewhere.
>
> OK. I agree that this reduces the risk of incorrect use.
>
> Generally, I think that internal functions should be documented too. Not to the full extent, like public functions, but some documentation is nice.
>
> And if there are special requirements to a function parameter, it should be documented with that function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special requirement.
>
> It might just be me worrying too much, so... If nobody else complains about this, I can live with it as is. Assuming that none of the public functions have this special requirement (either directly or indirectly, by calling functions with the special requirement).
>
I would tend to agree with you Morten. If the lcore_id parameter to the
function must be rte_lcore_id(), then I think it's error prone to have that
as an explicit parameter, and that the function should always get the core
id itself.
Other possible complication is - how does this work with threads that are
not pinned to a particular physical core? Do things work as expected in
that case?
/Bruce
>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Thursday, January 26, 2023 1:30 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>
>Cc: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com;
>roretzla@linux.microsoft.com
>Subject: [EXT] RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Thursday, 26 January 2023 10.40
>>
>> >From: Morten Brørup <mb@smartsharesystems.com>
>> >Sent: Friday, January 20, 2023 10:47 AM
>> >
>> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> >> Sent: Friday, 20 January 2023 00.39
>> >>
>> >> Add support for programming PMU counters and reading their values
>> >> in runtime bypassing kernel completely.
>> >>
>> >> This is especially useful in cases where CPU cores are isolated
>> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> >> standard perf utility without sacrificing latency and performance.
>> >>
>> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> >> ---
>> >
>> >If you insist on passing lcore_id around as a function parameter, the
>> function description must
>> >mention that the lcore_id parameter must be set to rte_lcore_id() for
>> the functions where this is a
>> >requirement, including all functions that use those functions.
>
>Perhaps I'm stating this wrong, so let me try to rephrase:
>
>As I understand it, some of the setup functions must be called from the EAL thread that executes
>that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread
>itself.
>
>Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call
>those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter,
>which you at the bottom of this email [1] explained is not possible.
>
>[1]: https://urldefense.proofpoint.com/v2/url?u=http-
>3A__inbox.dpdk.org_dev_DM4PR18MB4368461EC42603F77A7DC1BCD2E09-
>40DM4PR18MB4368.namprd18.prod.outlook.com_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxI
>xRndyEUwWU_ad5ce22YI6Is&m=QkMcmM2epUOCdRd6xI5o3d2nQaqruy0GvOQUgbn75cLlzobEVMwLBUiGXADiuvVz&s=5K9oM8
>e7u52C_0_5xtWIKl31aXRHhJDKoQTDQp5EHWY&e=
>
>> >
>>
>> Not sure why are you insisting so much on removing that rte_lcore_id().
>> Yes that macro evaluates
>> to integer but if you don't think about internals this resembles a
>> function call.
>
>I agree with this argument. And for that reason, passing lcore_id around could be relevant.
>
>I only wanted to bring your attention to the low cost of fetching it inside the functions, as an
>alternative to passing it as an argument.
>
>>
>> Then natural pattern is to call it once and reuse results if possible.
>
>Yes, and I would usually agree to using this pattern.
>
>> Passing lcore_id around
>> implies that calls are per l-core, why would that confuse anyone
>> reading that code?
>
>This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the
>function is running on that lcore!
>
>E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a
>parameter, and does not assume that lcore_id==rte_lcore_id().
>
Oh, now I got your point!
Okay then, if this is going to cause confusion because of misleading
self-documenting code I'll change that.
>[2]: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__elixir.bootlin.com_dpdk_latest_source_lib_mempool_rte-5Fmempool.h-
>23L1315&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=QkMcmM2ep
>UOCdRd6xI5o3d2nQaqruy0GvOQUgbn75cLlzobEVMwLBUiGXADiuvVz&s=4pnL_TZcVhj476u19ybcn2Rbad6OTb3k2U-
>nhFvhZ0k&e=
>
>>
>> Besides, all functions taking it are internal stuff hence you cannot
>> call it elsewhere.
>
>OK. I agree that this reduces the risk of incorrect use.
>
>Generally, I think that internal functions should be documented too. Not to the full extent, like
>public functions, but some documentation is nice.
>
>And if there are special requirements to a function parameter, it should be documented with that
>function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special
>requirement.
>
>It might just be me worrying too much, so... If nobody else complains about this, I can live with
>it as is. Assuming that none of the public functions have this special requirement (either directly
>or indirectly, by calling functions with the special requirement).
>
>>
>> >Alternatively, follow my previous suggestion: Omit the lcore_id
>> function parameter, and use
>> >rte_lcore_id() instead.
>> >
>>
>-----Original Message-----
>From: Bruce Richardson <bruce.richardson@intel.com>
>Sent: Thursday, January 26, 2023 1:59 PM
>To: Morten Brørup <mb@smartsharesystems.com>
>Cc: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>;
>Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; roretzla@linux.microsoft.com
>Subject: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>On Thu, Jan 26, 2023 at 01:29:36PM +0100, Morten Brørup wrote:
>> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> > Sent: Thursday, 26 January 2023 10.40
>> >
>> > >From: Morten Brørup <mb@smartsharesystems.com>
>> > >Sent: Friday, January 20, 2023 10:47 AM
>> > >
>> > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> > >> Sent: Friday, 20 January 2023 00.39
>> > >>
>> > >> Add support for programming PMU counters and reading their values
>> > >> in runtime bypassing kernel completely.
>> > >>
>> > >> This is especially useful in cases where CPU cores are isolated
>> > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> > >> standard perf utility without sacrificing latency and performance.
>> > >>
>> > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> > >> ---
>> > >
>> > >If you insist on passing lcore_id around as a function parameter,
>> > >the
>> > function description must
>> > >mention that the lcore_id parameter must be set to rte_lcore_id()
>> > >for
>> > the functions where this is a
>> > >requirement, including all functions that use those functions.
>>
>> Perhaps I'm stating this wrong, so let me try to rephrase:
>>
>> As I understand it, some of the setup functions must be called from the EAL thread that executes
>that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread
>itself.
>>
>> Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to
>call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a
>parameter, which you at the bottom of this email [1] explained is not possible.
>>
>> [1]:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__inbox.dpdk.org_dev
>> _DM4PR18MB4368461EC42603F77A7DC1BCD2E09-40DM4PR18MB4368.namprd18.prod.
>> outlook.com_&d=DwIDAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIx
>> RndyEUwWU_ad5ce22YI6Is&m=wEvFmuH_S_EhAgRZQTC7z3pQ1Sr_cEsbFAXxgE2Fi2ESd
>> 4sMgg-tgVOVDepp-JYO&s=wU4g1LLV4EHyRYpj2inWOK8MDcUKq7txrZ7RXZhUM2I&e=
>>
>> > >
>> >
>> > Not sure why are you insisting so much on removing that rte_lcore_id().
>> > Yes that macro evaluates
>> > to integer but if you don't think about internals this resembles a
>> > function call.
>>
>> I agree with this argument. And for that reason, passing lcore_id around could be relevant.
>>
>> I only wanted to bring your attention to the low cost of fetching it inside the functions, as an
>alternative to passing it as an argument.
>>
>> >
>> > Then natural pattern is to call it once and reuse results if possible.
>>
>> Yes, and I would usually agree to using this pattern.
>>
>> > Passing lcore_id around
>> > implies that calls are per l-core, why would that confuse anyone
>> > reading that code?
>>
>> This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the
>function is running on that lcore!
>>
>> E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a
>parameter, and does not assume that lcore_id==rte_lcore_id().
>>
>> [2]:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.co
>> m_dpdk_latest_source_lib_mempool_rte-5Fmempool.h-23L1315&d=DwIDAw&c=nK
>> jWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=w
>> EvFmuH_S_EhAgRZQTC7z3pQ1Sr_cEsbFAXxgE2Fi2ESd4sMgg-tgVOVDepp-JYO&s=Ayyj
>> pEtATWUHfWnGMn5j2XDLMjgxxJTh5gQV0m77z5Q&e=
>>
>> >
>> > Besides, all functions taking it are internal stuff hence you cannot
>> > call it elsewhere.
>>
>> OK. I agree that this reduces the risk of incorrect use.
>>
>> Generally, I think that internal functions should be documented too. Not to the full extent, like
>public functions, but some documentation is nice.
>>
>> And if there are special requirements to a function parameter, it should be documented with that
>function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special
>requirement.
>>
>> It might just be me worrying too much, so... If nobody else complains about this, I can live with
>it as is. Assuming that none of the public functions have this special requirement (either directly
>or indirectly, by calling functions with the special requirement).
>>
>I would tend to agree with you Morten. If the lcore_id parameter to the function must be
>rte_lcore_id(), then I think it's error prone to have that as an explicit parameter, and that the
>function should always get the core id itself.
>
>Other possible complication is - how does this work with threads that are not pinned to a
>particular physical core? Do things work as expected in that case?
>
It's assumed that once set of counters is enabled on particular l-core then this thread shouldn't be migrating
back and for the obvious reasons.
But, once scheduled elsewhere all should still work as expected.
>/Bruce
> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Thursday, 26 January 2023 16.28
>
> >From: Bruce Richardson <bruce.richardson@intel.com>
> >Sent: Thursday, January 26, 2023 1:59 PM
> >
> >Other possible complication is - how does this work with threads that
> are not pinned to a
> >particular physical core? Do things work as expected in that case?
> >
>
> It's assumed that once set of counters is enabled on particular l-core
> then this thread shouldn't be migrating
> back and for the obvious reasons.
>
> But, once scheduled elsewhere all should still work as expected.
Just to elaborate what Tomasz stated here...
The patch contains this line (code comments are mine):
return syscall(SYS_perf_event_open, &attr, /*pid*/ 0, /*cpu*/ -1, group_fd, 0);
And man 2 perf_event_open [1] says:
pid == 0 and cpu == -1:
This measures the calling process/thread on any CPU.
So it should work just fine, even when a thread is not pinned to a particular physical core.
[1]: https://man7.org/linux/man-pages/man2/perf_event_open.2.html
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
M: Pavan Nikhilesh <pbhagavatula@marvell.com>
F: lib/node/
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
Test Applications
-----------------
@@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
test_sources += ['test_metrics.c']
fast_tests += [['metrics_autotest', true, true]]
endif
+if is_linux
+ test_sources += ['test_pmu.c']
+ fast_tests += [['pmu_autotest', true, true]]
+endif
if not is_windows and dpdk_conf.has('RTE_LIB_TELEMETRY')
test_sources += ['test_telemetry_json.c', 'test_telemetry_data.c']
fast_tests += [['telemetry_json_autotest', true, true]]
new file mode 100644
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+ int tries = 10, event = -1;
+ uint64_t val = 0;
+
+ if (rte_pmu_init() < 0)
+ return TEST_FAILED;
+
+ while (tries--)
+ val += rte_pmu_read(event);
+
+ rte_pmu_fini();
+
+ return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+ .suite_name = "pmu autotest",
+ .setup = NULL,
+ .teardown = NULL,
+ .unit_test_cases = {
+ TEST_CASE(test_pmu_read),
+ TEST_CASES_END()
+ }
+};
+
+static int
+test_pmu(void)
+{
+ return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
@@ -222,7 +222,8 @@ The public API headers are grouped by topics:
[log](@ref rte_log.h),
[errno](@ref rte_errno.h),
[trace](@ref rte_trace.h),
- [trace_point](@ref rte_trace_point.h)
+ [trace_point](@ref rte_trace_point.h),
+ [pmu](@ref rte_pmu.h)
- **misc**:
[EAL config](@ref rte_eal.h),
@@ -63,6 +63,7 @@ INPUT = @TOPDIR@/doc/api/doxy-api-index.md \
@TOPDIR@/lib/pci \
@TOPDIR@/lib/pdump \
@TOPDIR@/lib/pipeline \
+ @TOPDIR@/lib/pmu \
@TOPDIR@/lib/port \
@TOPDIR@/lib/power \
@TOPDIR@/lib/rawdev \
@@ -7,6 +7,14 @@ Profile Your Application
The following sections describe methods of profiling DPDK applications on
different architectures.
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
Profiling on x86
----------------
@@ -57,6 +57,13 @@ New Features
* **Added multi-process support for axgbe PMD.**
+* **Added PMU library.**
+
+ Added a new PMU (performance measurement unit) library which allows applications
+ to perform self monitoring activities without depending on external utilities like perf.
+ After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+ can be stored in CTF format for further analysis.
+
Removed Items
-------------
@@ -11,6 +11,7 @@
libraries = [
'kvargs', # eal depends on kvargs
'telemetry', # basic info querying
+ 'pmu',
'eal', # everything depends on eal
'ring',
'rcu', # rcu depends on ring
new file mode 100644
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+ build = false
+ reason = 'only supported on Linux'
+ subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
new file mode 100644
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ * 0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
new file mode 100644
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+ return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+ char *config = NULL;
+ char path[PATH_MAX];
+ int high, low, ret;
+ FILE *fp;
+
+ /* quiesce -Wmaybe-uninitialized warning */
+ *num = 0;
+ *mask = 0;
+
+ snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+ fp = fopen(path, "r");
+ if (fp == NULL)
+ return -errno;
+
+ errno = 0;
+ ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+ if (ret < 2) {
+ ret = -ENODATA;
+ goto out;
+ }
+ if (errno) {
+ ret = -errno;
+ goto out;
+ }
+
+ if (ret == 2)
+ high = low;
+
+ *mask = GENMASK_ULL(high, low);
+ /* Last digit should be [012]. If last digit is missing 0 is implied. */
+ *num = config[strlen(config) - 1];
+ *num = isdigit(*num) ? *num - '0' : 0;
+
+ ret = 0;
+out:
+ free(config);
+ fclose(fp);
+
+ return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+ char *token, *term;
+ int num, ret, val;
+ uint64_t mask;
+
+ config[0] = config[1] = config[2] = 0;
+
+ token = strtok(buf, ",");
+ while (token) {
+ errno = 0;
+ /* <term>=<value> */
+ ret = sscanf(token, "%m[^=]=%i", &term, &val);
+ if (ret < 1)
+ return -ENODATA;
+ if (errno)
+ return -errno;
+ if (ret == 1)
+ val = 1;
+
+ ret = get_term_format(term, &num, &mask);
+ free(term);
+ if (ret)
+ return ret;
+
+ config[num] |= FIELD_PREP(mask, val);
+ token = strtok(NULL, ",");
+ }
+
+ return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+ char path[PATH_MAX], buf[BUFSIZ];
+ FILE *fp;
+ int ret;
+
+ snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+ fp = fopen(path, "r");
+ if (fp == NULL)
+ return -errno;
+
+ ret = fread(buf, 1, sizeof(buf), fp);
+ if (ret == 0) {
+ fclose(fp);
+
+ return -EINVAL;
+ }
+ fclose(fp);
+ buf[ret] = '\0';
+
+ return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+ struct perf_event_attr attr = {
+ .size = sizeof(struct perf_event_attr),
+ .type = PERF_TYPE_RAW,
+ .exclude_kernel = 1,
+ .exclude_hv = 1,
+ .disabled = 1,
+ };
+
+ pmu_arch_fixup_config(config);
+
+ attr.config = config[0];
+ attr.config1 = config[1];
+ attr.config2 = config[2];
+
+ return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(unsigned int lcore_id)
+{
+ struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+ struct rte_pmu_event *event;
+ uint64_t config[3];
+ int num = 0, ret;
+
+ /* group leader gets created first, with fd = -1 */
+ group->fds[0] = -1;
+
+ TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+ ret = get_event_config(event->name, config);
+ if (ret)
+ continue;
+
+ ret = do_perf_event_open(config, group->fds[0]);
+ if (ret == -1) {
+ ret = -errno;
+ goto out;
+ }
+
+ group->fds[event->index] = ret;
+ num++;
+ }
+
+ return 0;
+out:
+ for (--num; num >= 0; num--) {
+ close(group->fds[num]);
+ group->fds[num] = -1;
+ }
+
+
+ return ret;
+}
+
+static int
+mmap_events(unsigned int lcore_id)
+{
+ struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+ long page_size = sysconf(_SC_PAGE_SIZE);
+ unsigned int i;
+ void *addr;
+ int ret;
+
+ for (i = 0; i < rte_pmu.num_group_events; i++) {
+ addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+ if (addr == MAP_FAILED) {
+ ret = -errno;
+ goto out;
+ }
+
+ group->mmap_pages[i] = addr;
+ }
+
+ return 0;
+out:
+ for (; i; i--) {
+ munmap(group->mmap_pages[i - 1], page_size);
+ group->mmap_pages[i - 1] = NULL;
+ }
+
+ return ret;
+}
+
+static void
+cleanup_events(unsigned int lcore_id)
+{
+ struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+ unsigned int i;
+
+ if (group->fds[0] != -1)
+ ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+ for (i = 0; i < rte_pmu.num_group_events; i++) {
+ if (group->mmap_pages[i]) {
+ munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+ group->mmap_pages[i] = NULL;
+ }
+
+ if (group->fds[i] != -1) {
+ close(group->fds[i]);
+ group->fds[i] = -1;
+ }
+ }
+
+ group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(unsigned int lcore_id)
+{
+ struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+ int ret;
+
+ if (rte_pmu.num_group_events == 0)
+ return -ENODEV;
+
+ ret = open_events(lcore_id);
+ if (ret)
+ goto out;
+
+ ret = mmap_events(lcore_id);
+ if (ret)
+ goto out;
+
+ if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+ ret = -errno;
+ goto out;
+ }
+
+ if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+ ret = -errno;
+ goto out;
+ }
+
+ return 0;
+
+out:
+ cleanup_events(lcore_id);
+
+ return ret;
+}
+
+static int
+scan_pmus(void)
+{
+ char path[PATH_MAX];
+ struct dirent *dent;
+ const char *name;
+ DIR *dirp;
+
+ dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+ if (dirp == NULL)
+ return -errno;
+
+ while ((dent = readdir(dirp))) {
+ name = dent->d_name;
+ if (name[0] == '.')
+ continue;
+
+ /* sysfs entry should either contain cpus or be a cpu */
+ if (!strcmp(name, "cpu"))
+ break;
+
+ snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+ if (access(path, F_OK) == 0)
+ break;
+ }
+
+ closedir(dirp);
+
+ if (dent) {
+ rte_pmu.name = strdup(name);
+ if (rte_pmu.name == NULL)
+ return -ENOMEM;
+ }
+
+ return rte_pmu.name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+ struct rte_pmu_event *event;
+ char path[PATH_MAX];
+
+ if (rte_pmu.name == NULL)
+ return -ENODEV;
+
+ if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+ return -ENOSPC;
+
+ snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+ if (access(path, R_OK))
+ return -ENODEV;
+
+ TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+ if (!strcmp(event->name, name))
+ return event->index;
+ continue;
+ }
+
+ event = calloc(1, sizeof(*event));
+ if (event == NULL)
+ return -ENOMEM;
+
+ event->name = strdup(name);
+ if (event->name == NULL) {
+ free(event);
+
+ return -ENOMEM;
+ }
+
+ event->index = rte_pmu.num_group_events++;
+ TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+ return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+ int ret;
+
+ /* Allow calling init from multiple contexts within a single thread. This simplifies
+ * resource management a bit e.g in case fast-path tracepoint has already been enabled
+ * via command line but application doesn't care enough and performs init/fini again.
+ */
+ if (rte_pmu.initialized) {
+ rte_pmu.initialized++;
+
+ return 0;
+ }
+
+ ret = scan_pmus();
+ if (ret)
+ goto out;
+
+ TAILQ_INIT(&rte_pmu.event_list);
+
+ ret = pmu_arch_init();
+ if (ret)
+ goto out;
+
+ rte_pmu.initialized = 1;
+
+ return 0;
+out:
+ free(rte_pmu.name);
+ rte_pmu.name = NULL;
+
+ return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+ struct rte_pmu_event *event, *tmp;
+ unsigned int i;
+
+ /* cleanup once init count drops to zero */
+ if (!rte_pmu.initialized || --rte_pmu.initialized)
+ return;
+
+ RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp) {
+ TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+ free(event->name);
+ free(event);
+ }
+
+ for (i = 0; i < rte_pmu.num_group_events; i++)
+ cleanup_events(i);
+
+ pmu_arch_fini();
+ free(rte_pmu.name);
+ rte_pmu.name = NULL;
+ rte_pmu.num_group_events = 0;
+}
new file mode 100644
@@ -0,0 +1,206 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_lcore.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+ struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+ int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+ bool enabled; /**< true if group was enabled on particular lcore */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+ char *name; /**< name of an event */
+ unsigned int index; /**< event index into fds/mmap_pages */
+ TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+ char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+ struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+ unsigned int num_group_events; /**< number of events in a group */
+ TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+ unsigned int initialized; /**< initialization counter */
+};
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ * Pointer to the mmapped user page.
+ * @return
+ * Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+ uint64_t width, offset;
+ uint32_t seq, index;
+ int64_t pmc;
+
+ for (;;) {
+ seq = pc->lock;
+ rte_compiler_barrier();
+ index = pc->index;
+ offset = pc->offset;
+ width = pc->pmc_width;
+
+ /* index set to 0 means that particular counter cannot be used */
+ if (likely(pc->cap_user_rdpmc && index)) {
+ pmc = rte_pmu_pmc_read(index - 1);
+ pmc <<= 64 - width;
+ pmc >>= 64 - width;
+ offset += pmc;
+ }
+
+ rte_compiler_barrier();
+
+ if (likely(pc->lock == seq))
+ return offset;
+ }
+
+ return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on a given lcore.
+ *
+ * @param lcore_id
+ * The identifier of the lcore.
+ * @return
+ * 0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ * 0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ * Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ * Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ * Index of an event to be read.
+ * @return
+ * Event value read from register. In case of errors or lack of support
+ * 0 is returned. In other words, stream of zeros in a trace file
+ * indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+ unsigned int lcore_id = rte_lcore_id();
+ struct rte_pmu_event_group *group;
+ int ret;
+
+ if (unlikely(!rte_pmu.initialized))
+ return 0;
+
+ group = &rte_pmu.group[lcore_id];
+ if (unlikely(!group->enabled)) {
+ ret = rte_pmu_enable_group(lcore_id);
+ if (ret)
+ return 0;
+
+ group->enabled = true;
+ }
+
+ if (unlikely(index >= rte_pmu.num_group_events))
+ return 0;
+
+ return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
new file mode 100644
@@ -0,0 +1,19 @@
+DPDK_23 {
+ local: *;
+};
+
+EXPERIMENTAL {
+ global:
+
+ rte_pmu;
+ rte_pmu_add_event;
+ rte_pmu_fini;
+ rte_pmu_init;
+ rte_pmu_read;
+};
+
+INTERNAL {
+ global:
+
+ rte_pmu_enable_group;
+};