mbox series

[v4,0/4] Introduce Topology NUMA grouping for lcores

Message ID 20241105102849.1947-1-vipin.varghese@amd.com (mailing list archive)
Headers
Series Introduce Topology NUMA grouping for lcores |

Message

Varghese, Vipin Nov. 5, 2024, 10:28 a.m. UTC
This patch introduces improvements for NUMA topology awareness in
relation to DPDK logical cores. The goal is to expose API which allows
users to select optimal logical cores for any application. These logical
cores can be selected from various NUMA domains like CPU and I/O.

Change Summary:
 - Introduces the concept of NUMA domain partitioning based on CPU and
   I/O topology.
 - Adds support for grouping DPDK logical cores within the same Cache
   and I/O domain for improved locality.
 - Implements topology detection and core grouping logic that
   distinguishes between the following NUMA configurations:
    * CPU topology & I/O topology (e.g., AMD SoC EPYC, Intel Xeon SPR)
    * CPU+I/O topology (e.g., Ampere One with SLC, Intel Xeon SPR with SNC)
 - Enhances performance by minimizing lcore dispersion across tiles|compute
   package with different L2/L3 cache or IO domains.

Reason:
 - Applications using DPDK libraries relies on consistent memory access.
 - Lcores being closer to same NUMA domain as IO.
 - Lcores sharing same cache.

Latency is minimized by using lcores that share the same NUMA topology.
Memory access is optimized by utilizing cores within the same NUMA
domain or tile. Cache coherence is preserved within the same shared cache
domain, reducing the remote access from tile|compute package via snooping
(local hit in either L2 or L3 within same NUMA domain).

Library dependency: hwloc

Topology Flags:
---------------
 - RTE_LCORE_DOMAIN_L1: to group cores sharing same L1 cache
 - RTE_LCORE_DOMAIN_SMT: same as RTE_LCORE_DOMAIN_L1
 - RTE_LCORE_DOMAIN_L2: group cores sharing same L2 cache
 - RTE_LCORE_DOMAIN_L3: group cores sharing same L3 cache
 - RTE_LCORE_DOMAIN_L4: group cores sharing same L4 cache
 - RTE_LCORE_DOMAIN_IO: group cores sharing same IO

< Function: Purpose >
---------------------
 - rte_get_domain_count: get domain count based on Topology Flag
 - rte_lcore_count_from_domain: get valid lcores count under each domain
 - rte_get_lcore_in_domain: valid lcore id based on index
 - rte_lcore_cpuset_in_domain: return valid cpuset based on index
 - rte_lcore_is_main_in_domain: return true|false if main lcore is present
 - rte_get_next_lcore_from_domain: next valid lcore within domain
 - rte_get_next_lcore_from_next_domain: next valid lcore from next domain

Note:
 1. Topology is NUMA grouping.
 2. Domain is various sub-groups within a specific Topology.

Topology example: L1, L2, L3, L4, IO
Domian example: IO-A, IO-B

< MACRO: Purpose >
------------------
 - RTE_LCORE_FOREACH_DOMAIN: iterate lcores from all domains
 - RTE_LCORE_FOREACH_WORKER_DOMAIN: iterate worker lcores from all domains
 - RTE_LCORE_FORN_NEXT_DOMAIN: iterate domain select n'th lcore
 - RTE_LCORE_FORN_WORKER_NEXT_DOMAIN: iterate domain for worker n'th lcore.

Future work (after merge):
--------------------------
 - dma-perf per IO NUMA
 - eventdev per L3 NUMA
 - pipeline per SMT|L3 NUMA
 - distributor per L3 for Port-Queue
 - l2fwd-power per SMT
 - testpmd option for IO NUMA per port

Platform tested on:
-------------------
 - INTEL(R) XEON(R) PLATINUM 8562Y+ (support IO numa 1 & 2)
 - AMD EPYC 8534P (supports IO numa 1 & 2)
 - AMD EPYC 9554 (supports IO numa 1, 2, 4)

Logs:
-----
1. INTEL(R) XEON(R) PLATINUM 8562Y+:
 - SNC=1
        Domain (IO): at index (0) there are 48 core, with (0) at index 0
 - SNC=2
        Domain (IO): at index (0) there are 24 core, with (0) at index 0
        Domain (IO): at index (1) there are 24 core, with (12) at index 0

2. AMD EPYC 8534P:
 - NPS=1:
        Domain (IO): at index (0) there are 128 core, with (0) at index 0
 - NPS=2:
        Domain (IO): at index (0) there are 64 core, with (0) at index 0
        Domain (IO): at index (1) there are 64 core, with (32) at index 0

Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>

Vipin Varghese (4):
  eal/lcore: add topology based functions
  test/lcore: enable tests for topology
  doc: add topology grouping details
  examples: update with lcore topology API

 app/test/test_lcores.c                        | 528 +++++++++++++
 config/meson.build                            |  18 +
 .../prog_guide/env_abstraction_layer.rst      |  22 +
 examples/helloworld/main.c                    | 154 +++-
 examples/l2fwd/main.c                         |  56 +-
 examples/skeleton/basicfwd.c                  |  22 +
 lib/eal/common/eal_common_lcore.c             | 714 ++++++++++++++++++
 lib/eal/common/eal_private.h                  |  58 ++
 lib/eal/freebsd/eal.c                         |  10 +
 lib/eal/include/rte_lcore.h                   | 209 +++++
 lib/eal/linux/eal.c                           |  11 +
 lib/eal/meson.build                           |   4 +
 lib/eal/version.map                           |  11 +
 lib/eal/windows/eal.c                         |  12 +
 14 files changed, 1819 insertions(+), 10 deletions(-)
  

Comments

Varghese, Vipin Feb. 13, 2025, 3:09 a.m. UTC | #1
[AMD Official Use Only - AMD Internal Distribution Only]

Adding Thomas and Ajit to the loop.

Hi Ajit, we have been using the patch series for identifying the topology and getting l3 cache id for populating the steering tag for Device Specific Model & MSI-x driven af-xdp for the experimental STAG firmware on Thor.

Hence current use of topology library helps in
1. workload placement in same Cache or IO domain
2. populating id for MSIx or Device specific model for steering tags.

Thomas and Ajith can we get some help to get this mainline too?



> -----Original Message-----
> From: Vipin Varghese <vipin.varghese@amd.com>
> Sent: Tuesday, November 5, 2024 3:59 PM
> To: dev@dpdk.org; roretzla@linux.microsoft.com; bruce.richardson@intel.com;
> john.mcnamara@intel.com; dmitry.kozliuk@gmail.com
> Cc: pbhagavatula@marvell.com; jerinj@marvell.com; ruifeng.wang@arm.com;
> mattias.ronnblom@ericsson.com; anatoly.burakov@intel.com;
> stephen@networkplumber.org; Yigit, Ferruh <Ferruh.Yigit@amd.com>;
> honnappa.nagarahalli@arm.com; wathsala.vithanage@arm.com;
> konstantin.ananyev@huawei.com; mb@smartsharesystems.com
> Subject: [PATCH v4 0/4] Introduce Topology NUMA grouping for lcores
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> This patch introduces improvements for NUMA topology awareness in relation to
> DPDK logical cores. The goal is to expose API which allows users to select optimal
> logical cores for any application. These logical cores can be selected from various
> NUMA domains like CPU and I/O.
>
> Change Summary:
>  - Introduces the concept of NUMA domain partitioning based on CPU and
>    I/O topology.
>  - Adds support for grouping DPDK logical cores within the same Cache
>    and I/O domain for improved locality.
>  - Implements topology detection and core grouping logic that
>    distinguishes between the following NUMA configurations:
>     * CPU topology & I/O topology (e.g., AMD SoC EPYC, Intel Xeon SPR)
>     * CPU+I/O topology (e.g., Ampere One with SLC, Intel Xeon SPR with SNC)
>  - Enhances performance by minimizing lcore dispersion across tiles|compute
>    package with different L2/L3 cache or IO domains.
>
> Reason:
>  - Applications using DPDK libraries relies on consistent memory access.
>  - Lcores being closer to same NUMA domain as IO.
>  - Lcores sharing same cache.
>
> Latency is minimized by using lcores that share the same NUMA topology.
> Memory access is optimized by utilizing cores within the same NUMA domain or
> tile. Cache coherence is preserved within the same shared cache domain, reducing
> the remote access from tile|compute package via snooping (local hit in either L2 or
> L3 within same NUMA domain).
>
> Library dependency: hwloc
>
> Topology Flags:
> ---------------
>  - RTE_LCORE_DOMAIN_L1: to group cores sharing same L1 cache
>  - RTE_LCORE_DOMAIN_SMT: same as RTE_LCORE_DOMAIN_L1
>  - RTE_LCORE_DOMAIN_L2: group cores sharing same L2 cache
>  - RTE_LCORE_DOMAIN_L3: group cores sharing same L3 cache
>  - RTE_LCORE_DOMAIN_L4: group cores sharing same L4 cache
>  - RTE_LCORE_DOMAIN_IO: group cores sharing same IO
>
> < Function: Purpose >
> ---------------------
>  - rte_get_domain_count: get domain count based on Topology Flag
>  - rte_lcore_count_from_domain: get valid lcores count under each domain
>  - rte_get_lcore_in_domain: valid lcore id based on index
>  - rte_lcore_cpuset_in_domain: return valid cpuset based on index
>  - rte_lcore_is_main_in_domain: return true|false if main lcore is present
>  - rte_get_next_lcore_from_domain: next valid lcore within domain
>  - rte_get_next_lcore_from_next_domain: next valid lcore from next domain
>
> Note:
>  1. Topology is NUMA grouping.
>  2. Domain is various sub-groups within a specific Topology.
>
> Topology example: L1, L2, L3, L4, IO
> Domian example: IO-A, IO-B
>
> < MACRO: Purpose >
> ------------------
>  - RTE_LCORE_FOREACH_DOMAIN: iterate lcores from all domains
>  - RTE_LCORE_FOREACH_WORKER_DOMAIN: iterate worker lcores from all
> domains
>  - RTE_LCORE_FORN_NEXT_DOMAIN: iterate domain select n'th lcore
>  - RTE_LCORE_FORN_WORKER_NEXT_DOMAIN: iterate domain for worker n'th
> lcore.
>
> Future work (after merge):
> --------------------------
>  - dma-perf per IO NUMA
>  - eventdev per L3 NUMA
>  - pipeline per SMT|L3 NUMA
>  - distributor per L3 for Port-Queue
>  - l2fwd-power per SMT
>  - testpmd option for IO NUMA per port
>
> Platform tested on:
> -------------------
>  - INTEL(R) XEON(R) PLATINUM 8562Y+ (support IO numa 1 & 2)
>  - AMD EPYC 8534P (supports IO numa 1 & 2)
>  - AMD EPYC 9554 (supports IO numa 1, 2, 4)
>
> Logs:
> -----
> 1. INTEL(R) XEON(R) PLATINUM 8562Y+:
>  - SNC=1
>         Domain (IO): at index (0) there are 48 core, with (0) at index 0
>  - SNC=2
>         Domain (IO): at index (0) there are 24 core, with (0) at index 0
>         Domain (IO): at index (1) there are 24 core, with (12) at index 0
>
> 2. AMD EPYC 8534P:
>  - NPS=1:
>         Domain (IO): at index (0) there are 128 core, with (0) at index 0
>  - NPS=2:
>         Domain (IO): at index (0) there are 64 core, with (0) at index 0
>         Domain (IO): at index (1) there are 64 core, with (32) at index 0
>
> Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
>
> Vipin Varghese (4):
>   eal/lcore: add topology based functions
>   test/lcore: enable tests for topology
>   doc: add topology grouping details
>   examples: update with lcore topology API
>
>  app/test/test_lcores.c                        | 528 +++++++++++++
>  config/meson.build                            |  18 +
>  .../prog_guide/env_abstraction_layer.rst      |  22 +
>  examples/helloworld/main.c                    | 154 +++-
>  examples/l2fwd/main.c                         |  56 +-
>  examples/skeleton/basicfwd.c                  |  22 +
>  lib/eal/common/eal_common_lcore.c             | 714 ++++++++++++++++++
>  lib/eal/common/eal_private.h                  |  58 ++
>  lib/eal/freebsd/eal.c                         |  10 +
>  lib/eal/include/rte_lcore.h                   | 209 +++++
>  lib/eal/linux/eal.c                           |  11 +
>  lib/eal/meson.build                           |   4 +
>  lib/eal/version.map                           |  11 +
>  lib/eal/windows/eal.c                         |  12 +
>  14 files changed, 1819 insertions(+), 10 deletions(-)
>
> --
> 2.34.1
  
Thomas Monjalon Feb. 13, 2025, 8:34 a.m. UTC | #2
13/02/2025 04:09, Varghese, Vipin:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> Adding Thomas and Ajit to the loop.
> 
> Hi Ajit, we have been using the patch series for identifying the topology and getting l3 cache id for populating the steering tag for Device Specific Model & MSI-x driven af-xdp for the experimental STAG firmware on Thor.
> 
> Hence current use of topology library helps in
> 1. workload placement in same Cache or IO domain
> 2. populating id for MSIx or Device specific model for steering tags.
> 
> Thomas and Ajith can we get some help to get this mainline too?

Yes, sorry the review discussions did not start.
It has been forgotten.

You could rebase a v2 to make it more visible.

Minor note: the changelog should be after --- in the commit log.
  
Morten Brørup Feb. 13, 2025, 9:18 a.m. UTC | #3
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, 13 February 2025 09.34
> 
> 13/02/2025 04:09, Varghese, Vipin:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Adding Thomas and Ajit to the loop.
> >
> > Hi Ajit, we have been using the patch series for identifying the
> topology and getting l3 cache id for populating the steering tag for
> Device Specific Model & MSI-x driven af-xdp for the experimental STAG
> firmware on Thor.

Excellent. A real life example use case helps the review process a lot!

> >
> > Hence current use of topology library helps in
> > 1. workload placement in same Cache or IO domain
> > 2. populating id for MSIx or Device specific model for steering tags.
> >
> > Thomas and Ajith can we get some help to get this mainline too?
> 
> Yes, sorry the review discussions did not start.
> It has been forgotten.

I think the topology/domain API in the EAL should be co-designed with the steering tag API in the ethdev library, so the design can be reviewed/discussed in its entirety.

To help the review discussion, please consider describing the following:
Which APIs are for slow path, and which are for fast path?
Which APIs are "must have", i.e. core to making it work at all, and which APIs are "nice to have", i.e. support APIs to ease use of the new features?

I haven't looked at the hwloc library's API; but I guess these new EAL functions are closely related. Is it a thin wrapper around the hwloc library, or is it very different?
  
Varghese, Vipin March 3, 2025, 8:59 a.m. UTC | #4
[Public]

Hi Thomas

snipped

> >
> > Thomas and Ajith can we get some help to get this mainline too?
>
> Yes, sorry the review discussions did not start.
> It has been forgotten.
>
> You could rebase a v2 to make it more visible.
Sure will do this week.

>
> Minor note: the changelog should be after --- in the commit log.
>
  
Varghese, Vipin March 3, 2025, 9:06 a.m. UTC | #5
[Public]

Hi Morten,

snipped


>
>
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Thursday, 13 February 2025 09.34
> >
> > 13/02/2025 04:09, Varghese, Vipin:
> > > [AMD Official Use Only - AMD Internal Distribution Only]
> > >
> > > Adding Thomas and Ajit to the loop.
> > >
> > > Hi Ajit, we have been using the patch series for identifying the
> > topology and getting l3 cache id for populating the steering tag for
> > Device Specific Model & MSI-x driven af-xdp for the experimental STAG
> > firmware on Thor.
>
> Excellent. A real life example use case helps the review process a lot!

Steering tag is one of the examples or uses, as shared in the current patch series  we make use of these for other examples too.
Eventdev, pkt-distributor and graph nodes are also in works to exploit L2|L3 cache local coherency too.

>
> > >
> > > Hence current use of topology library helps in 1. workload placement
> > > in same Cache or IO domain 2. populating id for MSIx or Device
> > > specific model for steering tags.
> > >
> > > Thomas and Ajith can we get some help to get this mainline too?
> >
> > Yes, sorry the review discussions did not start.
> > It has been forgotten.
>
> I think the topology/domain API in the EAL should be co-designed with the steering
> tag API in the ethdev library, so the design can be reviewed/discussed in its entirety.

As shared in the discussion, we have been exploring simplified approach for steering tags, namely

1. pci-dev args (crude way)
2. flow api for RX (experimental way)

Based on the platform (in case of AMD EPYC, these are translated to `L3 id + 1`)

We do agree rte_ethdev library can use topology API. Current topology API are designed to be made independent from steering tags, as other examples do make use of the same.

>
> To help the review discussion, please consider describing the following:
> Which APIs are for slow path, and which are for fast path?
> Which APIs are "must have", i.e. core to making it work at all, and which APIs are
> "nice to have", i.e. support APIs to ease use of the new features?

Yes, will try to do the same in updated version. For Slow and Fast path API I might need some help, as I was under the impression current behavior is same rte_lcore (invoked at setup and before remote launch). But will check again.

>
> I haven't looked at the hwloc library's API; but I guess these new EAL functions are
> closely related. Is it a thin wrapper around the hwloc library, or is it very different?
This is very thin wrapper on top of hwloc library only. But with DPDK RTE_MAX_LCORE & RTE_NUMA boundary check and population.
  
Morten Brørup March 4, 2025, 10:08 a.m. UTC | #6
> From: Varghese, Vipin [mailto:Vipin.Varghese@amd.com]
> Sent: Monday, 3 March 2025 10.06
> 
> [Public]
> 
> Hi Morten,
> 
> snipped
> 
> >
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Thursday, 13 February 2025 09.34
> > >
> > > 13/02/2025 04:09, Varghese, Vipin:
> > > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > >
> > > > Adding Thomas and Ajit to the loop.
> > > >
> > > > Hi Ajit, we have been using the patch series for identifying the
> > > topology and getting l3 cache id for populating the steering tag
> for
> > > Device Specific Model & MSI-x driven af-xdp for the experimental
> STAG
> > > firmware on Thor.
> >
> > Excellent. A real life example use case helps the review process a
> lot!
> 
> Steering tag is one of the examples or uses, as shared in the current
> patch series  we make use of these for other examples too.
> Eventdev, pkt-distributor and graph nodes are also in works to exploit
> L2|L3 cache local coherency too.
> 
> >
> > > >
> > > > Hence current use of topology library helps in 1. workload
> placement
> > > > in same Cache or IO domain 2. populating id for MSIx or Device
> > > > specific model for steering tags.
> > > >
> > > > Thomas and Ajith can we get some help to get this mainline too?
> > >
> > > Yes, sorry the review discussions did not start.
> > > It has been forgotten.
> >
> > I think the topology/domain API in the EAL should be co-designed with
> the steering
> > tag API in the ethdev library, so the design can be
> reviewed/discussed in its entirety.
> 
> As shared in the discussion, we have been exploring simplified approach
> for steering tags, namely
> 
> 1. pci-dev args (crude way)
> 2. flow api for RX (experimental way)
> 
> Based on the platform (in case of AMD EPYC, these are translated to `L3
> id + 1`)
> 
> We do agree rte_ethdev library can use topology API. Current topology
> API are designed to be made independent from steering tags, as other
> examples do make use of the same.
> 
> >
> > To help the review discussion, please consider describing the
> following:
> > Which APIs are for slow path, and which are for fast path?
> > Which APIs are "must have", i.e. core to making it work at all, and
> which APIs are
> > "nice to have", i.e. support APIs to ease use of the new features?
> 
> Yes, will try to do the same in updated version. For Slow and Fast path
> API I might need some help, as I was under the impression current
> behavior is same rte_lcore (invoked at setup and before remote launch).
> But will check again.

Probably they are all used for configuration only, and thus all slow path; but if there are any fast path APIs, they should be highlighted as such.

> 
> >
> > I haven't looked at the hwloc library's API; but I guess these new
> EAL functions are
> > closely related. Is it a thin wrapper around the hwloc library, or is
> it very different?
> This is very thin wrapper on top of hwloc library only. But with DPDK
> RTE_MAX_LCORE & RTE_NUMA boundary check and population.

OK. The hwloc library is portable across Linux, BSD and Windows, which is great!

Please also describe the benefits of using this DPDK library, compared to directly using the hwloc library.
  
Mattias Rönnblom March 5, 2025, 7:43 a.m. UTC | #7
On 2025-03-04 11:08, Morten Brørup wrote:
>> From: Varghese, Vipin [mailto:Vipin.Varghese@amd.com]
>> Sent: Monday, 3 March 2025 10.06
>>
>> [Public]
>>
>> Hi Morten,
>>
>> snipped
>>
>>>
>>>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>>>> Sent: Thursday, 13 February 2025 09.34
>>>>
>>>> 13/02/2025 04:09, Varghese, Vipin:
>>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>>
>>>>> Adding Thomas and Ajit to the loop.
>>>>>
>>>>> Hi Ajit, we have been using the patch series for identifying the
>>>> topology and getting l3 cache id for populating the steering tag
>> for
>>>> Device Specific Model & MSI-x driven af-xdp for the experimental
>> STAG
>>>> firmware on Thor.
>>>
>>> Excellent. A real life example use case helps the review process a
>> lot!
>>
>> Steering tag is one of the examples or uses, as shared in the current
>> patch series  we make use of these for other examples too.
>> Eventdev, pkt-distributor and graph nodes are also in works to exploit
>> L2|L3 cache local coherency too.
>>
>>>
>>>>>
>>>>> Hence current use of topology library helps in 1. workload
>> placement
>>>>> in same Cache or IO domain 2. populating id for MSIx or Device
>>>>> specific model for steering tags.
>>>>>
>>>>> Thomas and Ajith can we get some help to get this mainline too?
>>>>
>>>> Yes, sorry the review discussions did not start.
>>>> It has been forgotten.
>>>
>>> I think the topology/domain API in the EAL should be co-designed with
>> the steering
>>> tag API in the ethdev library, so the design can be
>> reviewed/discussed in its entirety.
>>
>> As shared in the discussion, we have been exploring simplified approach
>> for steering tags, namely
>>
>> 1. pci-dev args (crude way)
>> 2. flow api for RX (experimental way)
>>
>> Based on the platform (in case of AMD EPYC, these are translated to `L3
>> id + 1`)
>>
>> We do agree rte_ethdev library can use topology API. Current topology
>> API are designed to be made independent from steering tags, as other
>> examples do make use of the same.
>>
>>>
>>> To help the review discussion, please consider describing the
>> following:
>>> Which APIs are for slow path, and which are for fast path?
>>> Which APIs are "must have", i.e. core to making it work at all, and
>> which APIs are
>>> "nice to have", i.e. support APIs to ease use of the new features?
>>
>> Yes, will try to do the same in updated version. For Slow and Fast path
>> API I might need some help, as I was under the impression current
>> behavior is same rte_lcore (invoked at setup and before remote launch).
>> But will check again.
> 
> Probably they are all used for configuration only, and thus all slow path; but if there are any fast path APIs, they should be highlighted as such.
> 

Preferably, software work schedulers like DSW should be able to read 
topology information during run-time/steady-state operation. If topology 
APIs are slow or non-MT-safe, they will need to build up their own data 
structures for such information (which is not crazy idea, but leads to 
duplication).

I didn't follow the hwloc discussions, so I may lack some context for 
this discussion.

>>
>>>
>>> I haven't looked at the hwloc library's API; but I guess these new
>> EAL functions are
>>> closely related. Is it a thin wrapper around the hwloc library, or is
>> it very different?
>> This is very thin wrapper on top of hwloc library only. But with DPDK
>> RTE_MAX_LCORE & RTE_NUMA boundary check and population.
> 
> OK. The hwloc library is portable across Linux, BSD and Windows, which is great!
> 
> Please also describe the benefits of using this DPDK library, compared to directly using the hwloc library.
>
  
Jan Viktorin March 17, 2025, 1:46 p.m. UTC | #8
Hello Vipin and others,

please, will there be any progress or update on this series?

I successfully tested those changes on our Intel and AMD machines and
would like to use it in production soon.

The API is a little bit unintuitive, at least for me, but I
successfully integrated into our software.

I am missing a clear relation to the NUMA socket approach used in DPDK.
E.g. I would like to be able to easily walk over a list of lcores from
a specific NUMA node grouped by L3 domain. Yes, there is the
RTE_LCORE_DOMAIN_IO, but would it always match the appropriate socket
IDs?

Also, I do not clearly understand what is the purpose of using domain
selector like:

  RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2

or even:

  RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2

the documentation does not explain this. I could not spot any kind of
grouping that would help me in any way. Some "best practices" examples
would be nice to have to understand the intentions better.

I found a little catch when running DPDK with more lcores than there
are physical or SMT CPU cores. This happens when using e.g. an option
like --lcores=(0-15)@(0-1). The results from the topology API would not
match the lcores because hwloc is not aware of the lcores concept. This
might be mentioned somewhere.

Anyway, I really appreciate this work and would like to see it upstream.
Especially for AMD machines, some framework like this is a must.

Kind regards,
Jan

On Tue, 5 Nov 2024 15:58:45 +0530
Vipin Varghese <vipin.varghese@amd.com> wrote:

> This patch introduces improvements for NUMA topology awareness in
> relation to DPDK logical cores. The goal is to expose API which allows
> users to select optimal logical cores for any application. These
> logical cores can be selected from various NUMA domains like CPU and
> I/O.
> 
> Change Summary:
>  - Introduces the concept of NUMA domain partitioning based on CPU and
>    I/O topology.
>  - Adds support for grouping DPDK logical cores within the same Cache
>    and I/O domain for improved locality.
>  - Implements topology detection and core grouping logic that
>    distinguishes between the following NUMA configurations:
>     * CPU topology & I/O topology (e.g., AMD SoC EPYC, Intel Xeon SPR)
>     * CPU+I/O topology (e.g., Ampere One with SLC, Intel Xeon SPR
> with SNC)
>  - Enhances performance by minimizing lcore dispersion across
> tiles|compute package with different L2/L3 cache or IO domains.
> 
> Reason:
>  - Applications using DPDK libraries relies on consistent memory
> access.
>  - Lcores being closer to same NUMA domain as IO.
>  - Lcores sharing same cache.
> 
> Latency is minimized by using lcores that share the same NUMA
> topology. Memory access is optimized by utilizing cores within the
> same NUMA domain or tile. Cache coherence is preserved within the
> same shared cache domain, reducing the remote access from
> tile|compute package via snooping (local hit in either L2 or L3
> within same NUMA domain).
> 
> Library dependency: hwloc
> 
> Topology Flags:
> ---------------
>  - RTE_LCORE_DOMAIN_L1: to group cores sharing same L1 cache
>  - RTE_LCORE_DOMAIN_SMT: same as RTE_LCORE_DOMAIN_L1
>  - RTE_LCORE_DOMAIN_L2: group cores sharing same L2 cache
>  - RTE_LCORE_DOMAIN_L3: group cores sharing same L3 cache
>  - RTE_LCORE_DOMAIN_L4: group cores sharing same L4 cache
>  - RTE_LCORE_DOMAIN_IO: group cores sharing same IO
> 
> < Function: Purpose >
> ---------------------
>  - rte_get_domain_count: get domain count based on Topology Flag
>  - rte_lcore_count_from_domain: get valid lcores count under each
> domain
>  - rte_get_lcore_in_domain: valid lcore id based on index
>  - rte_lcore_cpuset_in_domain: return valid cpuset based on index
>  - rte_lcore_is_main_in_domain: return true|false if main lcore is
> present
>  - rte_get_next_lcore_from_domain: next valid lcore within domain
>  - rte_get_next_lcore_from_next_domain: next valid lcore from next
> domain
> 
> Note:
>  1. Topology is NUMA grouping.
>  2. Domain is various sub-groups within a specific Topology.
> 
> Topology example: L1, L2, L3, L4, IO
> Domian example: IO-A, IO-B
> 
> < MACRO: Purpose >
> ------------------
>  - RTE_LCORE_FOREACH_DOMAIN: iterate lcores from all domains
>  - RTE_LCORE_FOREACH_WORKER_DOMAIN: iterate worker lcores from all
> domains
>  - RTE_LCORE_FORN_NEXT_DOMAIN: iterate domain select n'th lcore
>  - RTE_LCORE_FORN_WORKER_NEXT_DOMAIN: iterate domain for worker n'th
> lcore.
> 
> Future work (after merge):
> --------------------------
>  - dma-perf per IO NUMA
>  - eventdev per L3 NUMA
>  - pipeline per SMT|L3 NUMA
>  - distributor per L3 for Port-Queue
>  - l2fwd-power per SMT
>  - testpmd option for IO NUMA per port
> 
> Platform tested on:
> -------------------
>  - INTEL(R) XEON(R) PLATINUM 8562Y+ (support IO numa 1 & 2)
>  - AMD EPYC 8534P (supports IO numa 1 & 2)
>  - AMD EPYC 9554 (supports IO numa 1, 2, 4)
> 
> Logs:
> -----
> 1. INTEL(R) XEON(R) PLATINUM 8562Y+:
>  - SNC=1
>         Domain (IO): at index (0) there are 48 core, with (0) at
> index 0
>  - SNC=2
>         Domain (IO): at index (0) there are 24 core, with (0) at
> index 0 Domain (IO): at index (1) there are 24 core, with (12) at
> index 0
> 
> 2. AMD EPYC 8534P:
>  - NPS=1:
>         Domain (IO): at index (0) there are 128 core, with (0) at
> index 0
>  - NPS=2:
>         Domain (IO): at index (0) there are 64 core, with (0) at
> index 0 Domain (IO): at index (1) there are 64 core, with (32) at
> index 0
> 
> Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
> 
> Vipin Varghese (4):
>   eal/lcore: add topology based functions
>   test/lcore: enable tests for topology
>   doc: add topology grouping details
>   examples: update with lcore topology API
> 
>  app/test/test_lcores.c                        | 528 +++++++++++++
>  config/meson.build                            |  18 +
>  .../prog_guide/env_abstraction_layer.rst      |  22 +
>  examples/helloworld/main.c                    | 154 +++-
>  examples/l2fwd/main.c                         |  56 +-
>  examples/skeleton/basicfwd.c                  |  22 +
>  lib/eal/common/eal_common_lcore.c             | 714
> ++++++++++++++++++ lib/eal/common/eal_private.h                  |
> 58 ++ lib/eal/freebsd/eal.c                         |  10 +
>  lib/eal/include/rte_lcore.h                   | 209 +++++
>  lib/eal/linux/eal.c                           |  11 +
>  lib/eal/meson.build                           |   4 +
>  lib/eal/version.map                           |  11 +
>  lib/eal/windows/eal.c                         |  12 +
>  14 files changed, 1819 insertions(+), 10 deletions(-)
>
  
Varghese, Vipin April 9, 2025, 10:08 a.m. UTC | #9
[AMD Official Use Only - AMD Internal Distribution Only]

Snipped

>
> Hello Vipin and others,
>
> please, will there be any progress or update on this series?

Apologies, we did a small update in slack, and missed this out here. Let me try to address your questions below

>
> I successfully tested those changes on our Intel and AMD machines and would like
> to use it in production soon.
>
> The API is a little bit unintuitive, at least for me, but I successfully integrated into our
> software.
>
> I am missing a clear relation to the NUMA socket approach used in DPDK.
> E.g. I would like to be able to easily walk over a list of lcores from a specific NUMA
> node grouped by L3 domain. Yes, there is the RTE_LCORE_DOMAIN_IO, but would
> it always match the appropriate socket IDs?

Yes, we from AMD were internally debating the same. But since there is an API in lcore API as ` rte_lcore_to_socket_id`, adding yet another variation or argument lack it luster.
Hence we internally debating when using the new API why not check if it is desired Physical Socket or Sub Socket Numa domain?

Hence, we did not add the option.

>
> Also, I do not clearly understand what is the purpose of using domain selector like:
>
>   RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2
>
> or even:
>
>   RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2

I believe we have mentioned in documents to choose 1, if used multiple combo based on the code flow only 1 will be picked up.

real use of these are to select physical cores, under same cache or io domain.
Example: certain SoC has 4 cores sharing L2, which makes pipeline processing more convinent (less data movement). In such cases select lcores within same L2 topologoly.

>
> the documentation does not explain this. I could not spot any kind of grouping that
> would help me in any way. Some "best practices" examples would be nice to have to
> understand the intentions better.

From https://patches.dpdk.org/project/dpdk/cover/20241105102849.1947-1-vipin.varghese@amd.com/

```
Reason:
 - Applications using DPDK libraries relies on consistent memory access.
 - Lcores being closer to same NUMA domain as IO.
 - Lcores sharing same cache.

Latency is minimized by using lcores that share the same NUMA topology.
Memory access is optimized by utilizing cores within the same NUMA
domain or tile. Cache coherence is preserved within the same shared cache
domain, reducing the remote access from tile|compute package via snooping
(local hit in either L2 or L3 within same NUMA domain).
```

>
> I found a little catch when running DPDK with more lcores than there are physical or
> SMT CPU cores. This happens when using e.g. an option like --lcores=(0-15)@(0-1).
> The results from the topology API would not match the lcores because hwloc is not
> aware of the lcores concept. This might be mentioned somewhere.

Yes, this is expected. As one can map any cpu cores to dpdk lcore with `lcore-map`.
We did mentioned this in RFCv4, but when upgraded to RFCv5 we missed to mention it back.

>
> Anyway, I really appreciate this work and would like to see it upstream.
> Especially for AMD machines, some framework like this is a must.
>
> Kind regards,
> Jan
>

We are planning to remove RFC tag and share the final version for upcoming release for DPDK shortly.
  
Varghese, Vipin June 3, 2025, 6:03 a.m. UTC | #10
[Public]

Hi All,

Saring `rte_topology_` API patch next version targeted for upcoming release.
Extras adding support for Cache-ID for L2 and L3 for Cache line stashing, Code Data Prioritization too.

Snipped

>
> >
> > Hello Vipin and others,
> >
> > please, will there be any progress or update on this series?
>
> Apologies, we did a small update in slack, and missed this out here. Let me try to
> address your questions below
>
> >
> > I successfully tested those changes on our Intel and AMD machines and
> > would like to use it in production soon.
> >
> > The API is a little bit unintuitive, at least for me, but I
> > successfully integrated into our software.
> >
> > I am missing a clear relation to the NUMA socket approach used in DPDK.
> > E.g. I would like to be able to easily walk over a list of lcores from
> > a specific NUMA node grouped by L3 domain. Yes, there is the
> > RTE_LCORE_DOMAIN_IO, but would it always match the appropriate socket
> IDs?
>
> Yes, we from AMD were internally debating the same. But since there is an API in
> lcore API as ` rte_lcore_to_socket_id`, adding yet another variation or argument
> lack it luster.
> Hence we internally debating when using the new API why not check if it is desired
> Physical Socket or Sub Socket Numa domain?
>
> Hence, we did not add the option.
>
> >
> > Also, I do not clearly understand what is the purpose of using domain selector
> like:
> >
> >   RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2
> >
> > or even:
> >
> >   RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2
>
> I believe we have mentioned in documents to choose 1, if used multiple combo
> based on the code flow only 1 will be picked up.
>
> real use of these are to select physical cores, under same cache or io domain.
> Example: certain SoC has 4 cores sharing L2, which makes pipeline processing
> more convinent (less data movement). In such cases select lcores within same L2
> topologoly.
>
> >
> > the documentation does not explain this. I could not spot any kind of
> > grouping that would help me in any way. Some "best practices" examples
> > would be nice to have to understand the intentions better.
>
> From https://patches.dpdk.org/project/dpdk/cover/20241105102849.1947-1-
> vipin.varghese@amd.com/
>
> ```
> Reason:
>  - Applications using DPDK libraries relies on consistent memory access.
>  - Lcores being closer to same NUMA domain as IO.
>  - Lcores sharing same cache.
>
> Latency is minimized by using lcores that share the same NUMA topology.
> Memory access is optimized by utilizing cores within the same NUMA domain or
> tile. Cache coherence is preserved within the same shared cache domain, reducing
> the remote access from tile|compute package via snooping (local hit in either L2 or
> L3 within same NUMA domain).
> ```
>
> >
> > I found a little catch when running DPDK with more lcores than there
> > are physical or SMT CPU cores. This happens when using e.g. an option like --
> lcores=(0-15)@(0-1).
> > The results from the topology API would not match the lcores because
> > hwloc is not aware of the lcores concept. This might be mentioned somewhere.
>
> Yes, this is expected. As one can map any cpu cores to dpdk lcore with `lcore-
> map`.
> We did mentioned this in RFCv4, but when upgraded to RFCv5 we missed to
> mention it back.
>
> >
> > Anyway, I really appreciate this work and would like to see it upstream.
> > Especially for AMD machines, some framework like this is a must.
> >
> > Kind regards,
> > Jan
> >
>
> We are planning to remove RFC tag and share the final version for upcoming
> release for DPDK shortly.