[v6,1/4] lib: add generic support for reading PMU events

Message ID 20230119233916.4029128-2-tduszynski@marvell.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series add support for self monitoring |

Checks

Context Check Description
ci/loongarch-compilation warning apply patch failure
ci/checkpatch success coding style OK
ci/iol-testing warning apply patch failure

Commit Message

Tomasz Duszynski Jan. 19, 2023, 11:39 p.m. UTC
  Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   4 +
 app/test/test_pmu.c                    |  42 +++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |   8 +
 doc/guides/rel_notes/release_23_03.rst |   7 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  29 ++
 lib/pmu/rte_pmu.c                      | 436 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 206 ++++++++++++
 lib/pmu/version.map                    |  19 ++
 13 files changed, 773 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map
  

Comments

Morten Brørup Jan. 20, 2023, 9:46 a.m. UTC | #1
> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Friday, 20 January 2023 00.39
> 
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---

If you insist on passing lcore_id around as a function parameter, the function description must mention that the lcore_id parameter must be set to rte_lcore_id() for the functions where this is a requirement, including all functions that use those functions.

Alternatively, follow my previous suggestion: Omit the lcore_id function parameter, and use rte_lcore_id() instead.
  
Tyler Retzlaff Jan. 20, 2023, 6:29 p.m. UTC | #2
On Fri, Jan 20, 2023 at 12:39:12AM +0100, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>  MAINTAINERS                            |   5 +
>  app/test/meson.build                   |   4 +
>  app/test/test_pmu.c                    |  42 +++
>  doc/api/doxy-api-index.md              |   3 +-
>  doc/api/doxy-api.conf.in               |   1 +
>  doc/guides/prog_guide/profile_app.rst  |   8 +
>  doc/guides/rel_notes/release_23_03.rst |   7 +
>  lib/meson.build                        |   1 +
>  lib/pmu/meson.build                    |  13 +
>  lib/pmu/pmu_private.h                  |  29 ++
>  lib/pmu/rte_pmu.c                      | 436 +++++++++++++++++++++++++
>  lib/pmu/rte_pmu.h                      | 206 ++++++++++++
>  lib/pmu/version.map                    |  19 ++
>  13 files changed, 773 insertions(+), 1 deletion(-)
>  create mode 100644 app/test/test_pmu.c
>  create mode 100644 lib/pmu/meson.build
>  create mode 100644 lib/pmu/pmu_private.h
>  create mode 100644 lib/pmu/rte_pmu.c
>  create mode 100644 lib/pmu/rte_pmu.h
>  create mode 100644 lib/pmu/version.map
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9a0f416d2e..9f13eafd95 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>  M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>  F: lib/node/
>  
> +PMU - EXPERIMENTAL
> +M: Tomasz Duszynski <tduszynski@marvell.com>
> +F: lib/pmu/
> +F: app/test/test_pmu*
> +
>  
>  Test Applications
>  -----------------
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..b2c2a618b1 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
>      test_sources += ['test_metrics.c']
>      fast_tests += [['metrics_autotest', true, true]]
>  endif
> +if is_linux
> +    test_sources += ['test_pmu.c']
> +    fast_tests += [['pmu_autotest', true, true]]
> +endif

traditionally we don't conditionally include tests at the meson.build
level, instead we run all tests and have them skip when executed for
unsupported exec environments.

you can take a look at test_eventdev.c as an example for a test that is
skipped on windows, i'm sure it could be adapted to skip on freebsd if
you aren't supporting it.
  
Tomasz Duszynski Jan. 26, 2023, 9:05 a.m. UTC | #3
>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Friday, January 20, 2023 7:30 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>; Jerin Jacob Kollanukkaran
><jerinj@marvell.com>; mb@smartsharesystems.com; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com
>Subject: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>On Fri, Jan 20, 2023 at 12:39:12AM +0100, Tomasz Duszynski wrote:
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>>  MAINTAINERS                            |   5 +
>>  app/test/meson.build                   |   4 +
>>  app/test/test_pmu.c                    |  42 +++
>>  doc/api/doxy-api-index.md              |   3 +-
>>  doc/api/doxy-api.conf.in               |   1 +
>>  doc/guides/prog_guide/profile_app.rst  |   8 +
>>  doc/guides/rel_notes/release_23_03.rst |   7 +
>>  lib/meson.build                        |   1 +
>>  lib/pmu/meson.build                    |  13 +
>>  lib/pmu/pmu_private.h                  |  29 ++
>>  lib/pmu/rte_pmu.c                      | 436 +++++++++++++++++++++++++
>>  lib/pmu/rte_pmu.h                      | 206 ++++++++++++
>>  lib/pmu/version.map                    |  19 ++
>>  13 files changed, 773 insertions(+), 1 deletion(-)  create mode
>> 100644 app/test/test_pmu.c  create mode 100644 lib/pmu/meson.build
>> create mode 100644 lib/pmu/pmu_private.h  create mode 100644
>> lib/pmu/rte_pmu.c  create mode 100644 lib/pmu/rte_pmu.h  create mode
>> 100644 lib/pmu/version.map
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS index 9a0f416d2e..9f13eafd95
>> 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>>  M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>>  F: lib/node/
>>
>> +PMU - EXPERIMENTAL
>> +M: Tomasz Duszynski <tduszynski@marvell.com>
>> +F: lib/pmu/
>> +F: app/test/test_pmu*
>> +
>>
>>  Test Applications
>>  -----------------
>> diff --git a/app/test/meson.build b/app/test/meson.build index
>> f34d19e3c3..b2c2a618b1 100644
>> --- a/app/test/meson.build
>> +++ b/app/test/meson.build
>> @@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
>>      test_sources += ['test_metrics.c']
>>      fast_tests += [['metrics_autotest', true, true]]  endif
>> +if is_linux
>> +    test_sources += ['test_pmu.c']
>> +    fast_tests += [['pmu_autotest', true, true]] endif
>
>traditionally we don't conditionally include tests at the meson.build level, instead we run all
>tests and have them skip when executed for unsupported exec environments.
>
>you can take a look at test_eventdev.c as an example for a test that is skipped on windows, i'm
>sure it could be adapted to skip on freebsd if you aren't supporting it.

Right, this looks better. Thanks for pointing this out.
  
Tomasz Duszynski Jan. 26, 2023, 9:40 a.m. UTC | #4
>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Friday, January 20, 2023 10:47 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>
>Cc: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com;
>roretzla@linux.microsoft.com
>Subject: [EXT] RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Friday, 20 January 2023 00.39
>>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>
>If you insist on passing lcore_id around as a function parameter, the function description must
>mention that the lcore_id parameter must be set to rte_lcore_id() for the functions where this is a
>requirement, including all functions that use those functions.
>

Not sure why are you insisting so much on removing that rte_lcore_id(). Yes that macro evaluates
to integer but if you don't think about internals this resembles a function call.

Then natural pattern is to call it once and reuse results if possible. Passing lcore_id around
implies that calls are per l-core, why would that confuse anyone reading that code?

Besides, all functions taking it are internal stuff hence you cannot call it elsewhere. 

>Alternatively, follow my previous suggestion: Omit the lcore_id function parameter, and use
>rte_lcore_id() instead.
>
  
Morten Brørup Jan. 26, 2023, 12:29 p.m. UTC | #5
> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Thursday, 26 January 2023 10.40
> 
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Friday, January 20, 2023 10:47 AM
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Friday, 20 January 2023 00.39
> >>
> >> Add support for programming PMU counters and reading their values in
> >> runtime bypassing kernel completely.
> >>
> >> This is especially useful in cases where CPU cores are isolated
> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> >> standard perf utility without sacrificing latency and performance.
> >>
> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> ---
> >
> >If you insist on passing lcore_id around as a function parameter, the
> function description must
> >mention that the lcore_id parameter must be set to rte_lcore_id() for
> the functions where this is a
> >requirement, including all functions that use those functions.

Perhaps I'm stating this wrong, so let me try to rephrase:

As I understand it, some of the setup functions must be called from the EAL thread that executes that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread itself.

Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter, which you at the bottom of this email [1] explained is not possible.

[1]: http://inbox.dpdk.org/dev/DM4PR18MB4368461EC42603F77A7DC1BCD2E09@DM4PR18MB4368.namprd18.prod.outlook.com/

> >
> 
> Not sure why are you insisting so much on removing that rte_lcore_id().
> Yes that macro evaluates
> to integer but if you don't think about internals this resembles a
> function call.

I agree with this argument. And for that reason, passing lcore_id around could be relevant.

I only wanted to bring your attention to the low cost of fetching it inside the functions, as an alternative to passing it as an argument.

> 
> Then natural pattern is to call it once and reuse results if possible.

Yes, and I would usually agree to using this pattern.

> Passing lcore_id around
> implies that calls are per l-core, why would that confuse anyone
> reading that code?

This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the function is running on that lcore!

E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a parameter, and does not assume that lcore_id==rte_lcore_id().

[2]: https://elixir.bootlin.com/dpdk/latest/source/lib/mempool/rte_mempool.h#L1315

> 
> Besides, all functions taking it are internal stuff hence you cannot
> call it elsewhere.

OK. I agree that this reduces the risk of incorrect use.

Generally, I think that internal functions should be documented too. Not to the full extent, like public functions, but some documentation is nice.

And if there are special requirements to a function parameter, it should be documented with that function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special requirement.

It might just be me worrying too much, so... If nobody else complains about this, I can live with it as is. Assuming that none of the public functions have this special requirement (either directly or indirectly, by calling functions with the special requirement).

> 
> >Alternatively, follow my previous suggestion: Omit the lcore_id
> function parameter, and use
> >rte_lcore_id() instead.
> >
>
  
Bruce Richardson Jan. 26, 2023, 12:59 p.m. UTC | #6
On Thu, Jan 26, 2023 at 01:29:36PM +0100, Morten Brørup wrote:
> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > Sent: Thursday, 26 January 2023 10.40
> > 
> > >From: Morten Brørup <mb@smartsharesystems.com>
> > >Sent: Friday, January 20, 2023 10:47 AM
> > >
> > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > >> Sent: Friday, 20 January 2023 00.39
> > >>
> > >> Add support for programming PMU counters and reading their values in
> > >> runtime bypassing kernel completely.
> > >>
> > >> This is especially useful in cases where CPU cores are isolated
> > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> > >> standard perf utility without sacrificing latency and performance.
> > >>
> > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > >> ---
> > >
> > >If you insist on passing lcore_id around as a function parameter, the
> > function description must
> > >mention that the lcore_id parameter must be set to rte_lcore_id() for
> > the functions where this is a
> > >requirement, including all functions that use those functions.
> 
> Perhaps I'm stating this wrong, so let me try to rephrase:
> 
> As I understand it, some of the setup functions must be called from the EAL thread that executes that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread itself.
> 
> Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter, which you at the bottom of this email [1] explained is not possible.
> 
> [1]: http://inbox.dpdk.org/dev/DM4PR18MB4368461EC42603F77A7DC1BCD2E09@DM4PR18MB4368.namprd18.prod.outlook.com/
> 
> > >
> > 
> > Not sure why are you insisting so much on removing that rte_lcore_id().
> > Yes that macro evaluates
> > to integer but if you don't think about internals this resembles a
> > function call.
> 
> I agree with this argument. And for that reason, passing lcore_id around could be relevant.
> 
> I only wanted to bring your attention to the low cost of fetching it inside the functions, as an alternative to passing it as an argument.
> 
> > 
> > Then natural pattern is to call it once and reuse results if possible.
> 
> Yes, and I would usually agree to using this pattern.
> 
> > Passing lcore_id around
> > implies that calls are per l-core, why would that confuse anyone
> > reading that code?
> 
> This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the function is running on that lcore!
> 
> E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a parameter, and does not assume that lcore_id==rte_lcore_id().
> 
> [2]: https://elixir.bootlin.com/dpdk/latest/source/lib/mempool/rte_mempool.h#L1315
> 
> > 
> > Besides, all functions taking it are internal stuff hence you cannot
> > call it elsewhere.
> 
> OK. I agree that this reduces the risk of incorrect use.
> 
> Generally, I think that internal functions should be documented too. Not to the full extent, like public functions, but some documentation is nice.
> 
> And if there are special requirements to a function parameter, it should be documented with that function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special requirement.
> 
> It might just be me worrying too much, so... If nobody else complains about this, I can live with it as is. Assuming that none of the public functions have this special requirement (either directly or indirectly, by calling functions with the special requirement).
> 
I would tend to agree with you Morten. If the lcore_id parameter to the
function must be rte_lcore_id(), then I think it's error prone to have that
as an explicit parameter, and that the function should always get the core
id itself.

Other possible complication is - how does this work with threads that are
not pinned to a particular physical core? Do things work as expected in
that case?

/Bruce
  
Tomasz Duszynski Jan. 26, 2023, 3:17 p.m. UTC | #7
>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Thursday, January 26, 2023 1:30 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>
>Cc: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com;
>roretzla@linux.microsoft.com
>Subject: [EXT] RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Thursday, 26 January 2023 10.40
>>
>> >From: Morten Brørup <mb@smartsharesystems.com>
>> >Sent: Friday, January 20, 2023 10:47 AM
>> >
>> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> >> Sent: Friday, 20 January 2023 00.39
>> >>
>> >> Add support for programming PMU counters and reading their values
>> >> in runtime bypassing kernel completely.
>> >>
>> >> This is especially useful in cases where CPU cores are isolated
>> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> >> standard perf utility without sacrificing latency and performance.
>> >>
>> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> >> ---
>> >
>> >If you insist on passing lcore_id around as a function parameter, the
>> function description must
>> >mention that the lcore_id parameter must be set to rte_lcore_id() for
>> the functions where this is a
>> >requirement, including all functions that use those functions.
>
>Perhaps I'm stating this wrong, so let me try to rephrase:
>
>As I understand it, some of the setup functions must be called from the EAL thread that executes
>that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread
>itself.
>
>Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call
>those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter,
>which you at the bottom of this email [1] explained is not possible.
>
>[1]: https://urldefense.proofpoint.com/v2/url?u=http-
>3A__inbox.dpdk.org_dev_DM4PR18MB4368461EC42603F77A7DC1BCD2E09-
>40DM4PR18MB4368.namprd18.prod.outlook.com_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxI
>xRndyEUwWU_ad5ce22YI6Is&m=QkMcmM2epUOCdRd6xI5o3d2nQaqruy0GvOQUgbn75cLlzobEVMwLBUiGXADiuvVz&s=5K9oM8
>e7u52C_0_5xtWIKl31aXRHhJDKoQTDQp5EHWY&e=
>
>> >
>>
>> Not sure why are you insisting so much on removing that rte_lcore_id().
>> Yes that macro evaluates
>> to integer but if you don't think about internals this resembles a
>> function call.
>
>I agree with this argument. And for that reason, passing lcore_id around could be relevant.
>
>I only wanted to bring your attention to the low cost of fetching it inside the functions, as an
>alternative to passing it as an argument.
>
>>
>> Then natural pattern is to call it once and reuse results if possible.
>
>Yes, and I would usually agree to using this pattern.
>
>> Passing lcore_id around
>> implies that calls are per l-core, why would that confuse anyone
>> reading that code?
>
>This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the
>function is running on that lcore!
>
>E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a
>parameter, and does not assume that lcore_id==rte_lcore_id().
>

Oh, now I got your point!

Okay then, if this is going to cause confusion because of misleading
self-documenting code I'll change that.  

>[2]: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__elixir.bootlin.com_dpdk_latest_source_lib_mempool_rte-5Fmempool.h-
>23L1315&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=QkMcmM2ep
>UOCdRd6xI5o3d2nQaqruy0GvOQUgbn75cLlzobEVMwLBUiGXADiuvVz&s=4pnL_TZcVhj476u19ybcn2Rbad6OTb3k2U-
>nhFvhZ0k&e=
>
>>
>> Besides, all functions taking it are internal stuff hence you cannot
>> call it elsewhere.
>
>OK. I agree that this reduces the risk of incorrect use.
>
>Generally, I think that internal functions should be documented too. Not to the full extent, like
>public functions, but some documentation is nice.
>
>And if there are special requirements to a function parameter, it should be documented with that
>function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special
>requirement.
>
>It might just be me worrying too much, so... If nobody else complains about this, I can live with
>it as is. Assuming that none of the public functions have this special requirement (either directly
>or indirectly, by calling functions with the special requirement).
>
>>
>> >Alternatively, follow my previous suggestion: Omit the lcore_id
>> function parameter, and use
>> >rte_lcore_id() instead.
>> >
>>
  
Tomasz Duszynski Jan. 26, 2023, 3:28 p.m. UTC | #8
>-----Original Message-----
>From: Bruce Richardson <bruce.richardson@intel.com>
>Sent: Thursday, January 26, 2023 1:59 PM
>To: Morten Brørup <mb@smartsharesystems.com>
>Cc: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>;
>Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; roretzla@linux.microsoft.com
>Subject: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>On Thu, Jan 26, 2023 at 01:29:36PM +0100, Morten Brørup wrote:
>> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> > Sent: Thursday, 26 January 2023 10.40
>> >
>> > >From: Morten Brørup <mb@smartsharesystems.com>
>> > >Sent: Friday, January 20, 2023 10:47 AM
>> > >
>> > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> > >> Sent: Friday, 20 January 2023 00.39
>> > >>
>> > >> Add support for programming PMU counters and reading their values
>> > >> in runtime bypassing kernel completely.
>> > >>
>> > >> This is especially useful in cases where CPU cores are isolated
>> > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> > >> standard perf utility without sacrificing latency and performance.
>> > >>
>> > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> > >> ---
>> > >
>> > >If you insist on passing lcore_id around as a function parameter,
>> > >the
>> > function description must
>> > >mention that the lcore_id parameter must be set to rte_lcore_id()
>> > >for
>> > the functions where this is a
>> > >requirement, including all functions that use those functions.
>>
>> Perhaps I'm stating this wrong, so let me try to rephrase:
>>
>> As I understand it, some of the setup functions must be called from the EAL thread that executes
>that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread
>itself.
>>
>> Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to
>call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a
>parameter, which you at the bottom of this email [1] explained is not possible.
>>
>> [1]:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__inbox.dpdk.org_dev
>> _DM4PR18MB4368461EC42603F77A7DC1BCD2E09-40DM4PR18MB4368.namprd18.prod.
>> outlook.com_&d=DwIDAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIx
>> RndyEUwWU_ad5ce22YI6Is&m=wEvFmuH_S_EhAgRZQTC7z3pQ1Sr_cEsbFAXxgE2Fi2ESd
>> 4sMgg-tgVOVDepp-JYO&s=wU4g1LLV4EHyRYpj2inWOK8MDcUKq7txrZ7RXZhUM2I&e=
>>
>> > >
>> >
>> > Not sure why are you insisting so much on removing that rte_lcore_id().
>> > Yes that macro evaluates
>> > to integer but if you don't think about internals this resembles a
>> > function call.
>>
>> I agree with this argument. And for that reason, passing lcore_id around could be relevant.
>>
>> I only wanted to bring your attention to the low cost of fetching it inside the functions, as an
>alternative to passing it as an argument.
>>
>> >
>> > Then natural pattern is to call it once and reuse results if possible.
>>
>> Yes, and I would usually agree to using this pattern.
>>
>> > Passing lcore_id around
>> > implies that calls are per l-core, why would that confuse anyone
>> > reading that code?
>>
>> This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the
>function is running on that lcore!
>>
>> E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a
>parameter, and does not assume that lcore_id==rte_lcore_id().
>>
>> [2]:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.co
>> m_dpdk_latest_source_lib_mempool_rte-5Fmempool.h-23L1315&d=DwIDAw&c=nK
>> jWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=w
>> EvFmuH_S_EhAgRZQTC7z3pQ1Sr_cEsbFAXxgE2Fi2ESd4sMgg-tgVOVDepp-JYO&s=Ayyj
>> pEtATWUHfWnGMn5j2XDLMjgxxJTh5gQV0m77z5Q&e=
>>
>> >
>> > Besides, all functions taking it are internal stuff hence you cannot
>> > call it elsewhere.
>>
>> OK. I agree that this reduces the risk of incorrect use.
>>
>> Generally, I think that internal functions should be documented too. Not to the full extent, like
>public functions, but some documentation is nice.
>>
>> And if there are special requirements to a function parameter, it should be documented with that
>function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special
>requirement.
>>
>> It might just be me worrying too much, so... If nobody else complains about this, I can live with
>it as is. Assuming that none of the public functions have this special requirement (either directly
>or indirectly, by calling functions with the special requirement).
>>
>I would tend to agree with you Morten. If the lcore_id parameter to the function must be
>rte_lcore_id(), then I think it's error prone to have that as an explicit parameter, and that the
>function should always get the core id itself.
>
>Other possible complication is - how does this work with threads that are not pinned to a
>particular physical core? Do things work as expected in that case?
>

It's assumed that once set of counters is enabled on particular l-core then this thread shouldn't be migrating 
back and for the obvious reasons. 

But, once scheduled elsewhere all should still work as expected. 

>/Bruce
  
Morten Brørup Feb. 2, 2023, 2:27 p.m. UTC | #9
> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Thursday, 26 January 2023 16.28
> 
> >From: Bruce Richardson <bruce.richardson@intel.com>
> >Sent: Thursday, January 26, 2023 1:59 PM
> >
> >Other possible complication is - how does this work with threads that
> are not pinned to a
> >particular physical core? Do things work as expected in that case?
> >
> 
> It's assumed that once set of counters is enabled on particular l-core
> then this thread shouldn't be migrating
> back and for the obvious reasons.
> 
> But, once scheduled elsewhere all should still work as expected.

Just to elaborate what Tomasz stated here...

The patch contains this line (code comments are mine):

return syscall(SYS_perf_event_open, &attr, /*pid*/ 0, /*cpu*/ -1, group_fd, 0);

And man 2 perf_event_open [1] says:

pid == 0 and cpu == -1:
	This measures the calling process/thread on any CPU.

So it should work just fine, even when a thread is not pinned to a particular physical core.

[1]: https://man7.org/linux/man-pages/man2/perf_event_open.2.html
  

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..9f13eafd95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@  M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..b2c2a618b1 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -360,6 +360,10 @@  if dpdk_conf.has('RTE_LIB_METRICS')
     test_sources += ['test_metrics.c']
     fast_tests += [['metrics_autotest', true, true]]
 endif
+if is_linux
+    test_sources += ['test_pmu.c']
+    fast_tests += [['pmu_autotest', true, true]]
+endif
 if not is_windows and dpdk_conf.has('RTE_LIB_TELEMETRY')
     test_sources += ['test_telemetry_json.c', 'test_telemetry_data.c']
     fast_tests += [['telemetry_json_autotest', true, true]]
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..7c3cf18ed9
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,42 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	int tries = 10, event = -1;
+	uint64_t val = 0;
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index de488c7abf..7f1938f92f 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,7 +222,8 @@  The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index f0886c3bd1..920e615996 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@  INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@  Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 92ec1e4b88..f43bd62376 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -57,6 +57,13 @@  New Features
 
 * **Added multi-process support for axgbe PMD.**
 
+* **Added PMU library.**
+
+  Added a new PMU (performance measurement unit) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 
 Removed Items
 -------------
diff --git a/lib/meson.build b/lib/meson.build
index a90fee31b7..7132131b5c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@ 
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@ 
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..849549b125
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,29 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..f8369b9dc7
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,436 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(lcore_id);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(lcore_id);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL)
+			return -ENOMEM;
+	}
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized) {
+		rte_pmu.initialized++;
+
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	unsigned int i;
+
+	/* cleanup once init count drops to zero */
+	if (!rte_pmu.initialized || --rte_pmu.initialized)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free(event->name);
+		free(event);
+	}
+
+	for (i = 0; i < rte_pmu.num_group_events; i++)
+		cleanup_events(i);
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..42c764fa9e
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,206 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_lcore.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	group = &rte_pmu.group[lcore_id];
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..e15e21156a
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,19 @@ 
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
+
+INTERNAL {
+	global:
+
+	rte_pmu_enable_group;
+};