Message ID | 20241105102849.1947-1-vipin.varghese@amd.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 7CC7045C87; Tue, 5 Nov 2024 11:29:18 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 19DEE402B3; Tue, 5 Nov 2024 11:29:18 +0100 (CET) Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2074.outbound.protection.outlook.com [40.107.237.74]) by mails.dpdk.org (Postfix) with ESMTP id 3421A40151 for <dev@dpdk.org>; Tue, 5 Nov 2024 11:29:16 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=ME44f6a4UP5aL6i031aRElcvNHY2J9T/CIfeGuQuFqN3VgW6zyL/EIoQgbJ17IN48XO4IDxagmnKOwYMbWlV4SZZoJA9ACYTJTc5pOW7St5TU/BpPX599ULcQyja+N+YEskNkEHC8buyjBI/I/qknewqpCPfERb+2NaXSZdkIbpSNm2wMA9t8ji+6BJ0Ef9FVr0vIYtEUKh3mPcQ3JFs7RWLsTqp0JMHrtPy05X9R6eMI44vXVO1zKibmF6Z9I9AsocBjUdJBLvKrAB2GFAqjwWVmKkXs+kPMHfJTBSLGpcqAqr81gJrXNlWitIeC6N432jr6Ebvhw6fkvm3l2AW2w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=057eWRqEAz8Wr6XL4kZUM4UBc+UCaGLa5QIxH6uBS0E=; b=az9dY3QTnNW6ZxX6R52cNzRTCACKwkC0B5oKh2yLaHVNa4ulQWkGoiHjjd7JR1TpYhR4hak/dH4EdRJbGqGHq215GuRKOahP+8ATJu8axVWbkEdEh+bFUWnucHNK61pUWLac9aHN7sQ6Y2X04VsHdWHW48Rs0eecsDMZr+o8a8C6C0Y53a+CbY4LbG+FaP1r7pwmxYAtfkFm995CqlnMXvhdrrBK5DZNXYjVn4SWJPDx9yhKUBdCDKpHqvrYF1Xy4eOBIepK8APnqbWRwdycv4ggGgVlqtU3v5D+fIybdlTnxjny5+vBOMGY1/76T7av77yIZ9+FqQN2tkOti0IH6Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=dpdk.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=057eWRqEAz8Wr6XL4kZUM4UBc+UCaGLa5QIxH6uBS0E=; b=Sv74Dup6qEVQ0f6ooewNyxYn9fn4ThoVwp4Sgqlyg+wrR8WPF91DuQ57xfpVXb6RADJAyVqDikTmmCC80DeJvMDDNXPRdRoatHvmGzC4F+X2OWhVHyeYGuQeJ5Ap3iMa9DefLyM1M26ZQGgxsWrJVT+VI40dJKP4/124R4oHgvE= Received: from SA9PR13CA0121.namprd13.prod.outlook.com (2603:10b6:806:27::6) by CH3PR12MB8755.namprd12.prod.outlook.com (2603:10b6:610:17e::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8114.31; Tue, 5 Nov 2024 10:29:13 +0000 Received: from SN1PEPF000252A3.namprd05.prod.outlook.com (2603:10b6:806:27:cafe::b4) by SA9PR13CA0121.outlook.office365.com (2603:10b6:806:27::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8137.16 via Frontend Transport; Tue, 5 Nov 2024 10:29:13 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by SN1PEPF000252A3.mail.protection.outlook.com (10.167.242.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.8137.17 via Frontend Transport; Tue, 5 Nov 2024 10:29:13 +0000 Received: from BLRVIVARGHE.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Tue, 5 Nov 2024 04:29:08 -0600 From: Vipin Varghese <vipin.varghese@amd.com> To: <dev@dpdk.org>, <roretzla@linux.microsoft.com>, <bruce.richardson@intel.com>, <john.mcnamara@intel.com>, <dmitry.kozliuk@gmail.com> CC: <pbhagavatula@marvell.com>, <jerinj@marvell.com>, <ruifeng.wang@arm.com>, <mattias.ronnblom@ericsson.com>, <anatoly.burakov@intel.com>, <stephen@networkplumber.org>, <ferruh.yigit@amd.com>, <honnappa.nagarahalli@arm.com>, <wathsala.vithanage@arm.com>, <konstantin.ananyev@huawei.com>, <mb@smartsharesystems.com> Subject: [PATCH v4 0/4] Introduce Topology NUMA grouping for lcores Date: Tue, 5 Nov 2024 15:58:45 +0530 Message-ID: <20241105102849.1947-1-vipin.varghese@amd.com> X-Mailer: git-send-email 2.47.0.windows.1 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SN1PEPF000252A3:EE_|CH3PR12MB8755:EE_ X-MS-Office365-Filtering-Correlation-Id: 43704778-47f6-4687-74ac-08dcfd84b15f X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; ARA:13230040|7416014|376014|82310400026|1800799024|36860700013; X-Microsoft-Antispam-Message-Info: =?utf-8?q?fxOV0KzE70fP8GMD9xjEzfnke+6JRn+?= =?utf-8?q?DZXqCnld+AQWJiM1YEmxClBjC4LWLKnF2S7pOwiuhIxyhuAPDjXtqxa57rtsqlx44?= =?utf-8?q?E39/ths45t4HTIsHkFgKitUboOrWL1Q25cb9D3IlMzMPEQLFpV9FPGN31KqzJKnMm?= =?utf-8?q?yMvxE3BSpGsGcPyzuqZufSjse4DOxyGNmph7VPjPJwe4ELZbJtzVCx5cp+ndDPr+k?= =?utf-8?q?qKE8Ro8K2iU7RZjOnX9VlBmg0slilwrPK7m2ii38WMLrIVuSfJF4zPOZWIbDFEJNx?= =?utf-8?q?VPy4dBr7/yZ2y8mXe+I63QTnJ6RCiuPCyfA6GhrO0o5YTjamlJUFdc5NJExU88H3i?= =?utf-8?q?BJymOllhZY0hXmzY20CkZKARiRBkjxxFD1CqsZsyXseOe4ZL+0aGRdOjtFW8t6tcX?= =?utf-8?q?bhFUFYkpBpSOijLXizHYGeSx7h8fCHp0bTh0vivYqiEAob8rp2uEua8lDhTmYsnmn?= =?utf-8?q?vXdbCyno+RIOgYbzJ6lW6CePAiTMb3KUe/+QgbJ+Ag+8M0/EhGF/bApwNPM94q87t?= =?utf-8?q?n08fC3EaoirCLoZImBGQ1J+LyIvgPZKHAcbigz4oZln688w2iIrxxrPIix9AVN/oR?= =?utf-8?q?CmMj+UjhmbBGo9Wz4mtvMKq37y51PlfOvxjjZR42gmcZl292MR40YTmzG2MIFPUNg?= =?utf-8?q?qUe5Ad6c4sAG0DZfj92i5LOv+z4uRkdYVemZfG74Gg6qvpDroaJ7XqAyDxGYkB3h3?= =?utf-8?q?JQeI61tH4qvxSirRl7G0YnOiAgb9shUFE0McGXtWQrERRAeUJVfIEvd0BwhVK3YVz?= =?utf-8?q?6fdLEZrZo0ifvf+O+2PKhgUttSG+rndmF41rOENA0jgKgnnEYkXWntz9M2DjIs0I6?= =?utf-8?q?F+uDeLZoOj8TchGYNAZfoKevAXzexY37BhAC2xgAocSvKgZttxX2PxonrdVMq5wjC?= =?utf-8?q?IhsyUz2gh7WmFlJCiYuKyaJ/ncf8aVg/2oNST9ry9IfFm3+YkEs1IjwhLvkhhAgCS?= =?utf-8?q?nFBOh7715KnvbQYSDH3iDrrmO3g+cRsX/RSpse00ujSh0XJf3X9pSsbvLViQ7NIPa?= =?utf-8?q?LN5JY8REzcFXMC+dy6b8uPH/DHw8x632+P55+dQLjGWFS5yMBlqhYlYZ9g1C3VFqd?= =?utf-8?q?ILcm80C4RsvdB9x4ZKxMy8oeYQAjPq5TLArlM07ik8v52rj1S+kuclxlqOa1KGVh5?= =?utf-8?q?WHNMae9y7MyL+l6daG6r3/bMJGY9I5E05KFikJ+Em/cb3IWD7kW20ogszf1cNNutw?= =?utf-8?q?2VhLjN9VGO6CavXmVz9MA7GPdcM+gBnqMYtWRKYpyVO6tWfyPM0314zCaS7PwUnSH?= =?utf-8?q?dzdGQA6Gfz/avgSIhpuFVI9S/hXkAWWrzz7ovobVYxiMg3X4Drr6MBOY7R1zFhnBl?= =?utf-8?q?F2Z85p7jYkjwXA8SMYmWNPrfD+Q3BHOIuA=3D=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:SATLEXMB04.amd.com; PTR:InfoDomainNonexistent; CAT:NONE; SFS:(13230040)(7416014)(376014)(82310400026)(1800799024)(36860700013); DIR:OUT; SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 Nov 2024 10:29:13.2163 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 43704778-47f6-4687-74ac-08dcfd84b15f X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d; Ip=[165.204.84.17]; Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SN1PEPF000252A3.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH3PR12MB8755 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org |
Series |
Introduce Topology NUMA grouping for lcores
|
|
Message
Varghese, Vipin
Nov. 5, 2024, 10:28 a.m. UTC
This patch introduces improvements for NUMA topology awareness in
relation to DPDK logical cores. The goal is to expose API which allows
users to select optimal logical cores for any application. These logical
cores can be selected from various NUMA domains like CPU and I/O.
Change Summary:
- Introduces the concept of NUMA domain partitioning based on CPU and
I/O topology.
- Adds support for grouping DPDK logical cores within the same Cache
and I/O domain for improved locality.
- Implements topology detection and core grouping logic that
distinguishes between the following NUMA configurations:
* CPU topology & I/O topology (e.g., AMD SoC EPYC, Intel Xeon SPR)
* CPU+I/O topology (e.g., Ampere One with SLC, Intel Xeon SPR with SNC)
- Enhances performance by minimizing lcore dispersion across tiles|compute
package with different L2/L3 cache or IO domains.
Reason:
- Applications using DPDK libraries relies on consistent memory access.
- Lcores being closer to same NUMA domain as IO.
- Lcores sharing same cache.
Latency is minimized by using lcores that share the same NUMA topology.
Memory access is optimized by utilizing cores within the same NUMA
domain or tile. Cache coherence is preserved within the same shared cache
domain, reducing the remote access from tile|compute package via snooping
(local hit in either L2 or L3 within same NUMA domain).
Library dependency: hwloc
Topology Flags:
---------------
- RTE_LCORE_DOMAIN_L1: to group cores sharing same L1 cache
- RTE_LCORE_DOMAIN_SMT: same as RTE_LCORE_DOMAIN_L1
- RTE_LCORE_DOMAIN_L2: group cores sharing same L2 cache
- RTE_LCORE_DOMAIN_L3: group cores sharing same L3 cache
- RTE_LCORE_DOMAIN_L4: group cores sharing same L4 cache
- RTE_LCORE_DOMAIN_IO: group cores sharing same IO
< Function: Purpose >
---------------------
- rte_get_domain_count: get domain count based on Topology Flag
- rte_lcore_count_from_domain: get valid lcores count under each domain
- rte_get_lcore_in_domain: valid lcore id based on index
- rte_lcore_cpuset_in_domain: return valid cpuset based on index
- rte_lcore_is_main_in_domain: return true|false if main lcore is present
- rte_get_next_lcore_from_domain: next valid lcore within domain
- rte_get_next_lcore_from_next_domain: next valid lcore from next domain
Note:
1. Topology is NUMA grouping.
2. Domain is various sub-groups within a specific Topology.
Topology example: L1, L2, L3, L4, IO
Domian example: IO-A, IO-B
< MACRO: Purpose >
------------------
- RTE_LCORE_FOREACH_DOMAIN: iterate lcores from all domains
- RTE_LCORE_FOREACH_WORKER_DOMAIN: iterate worker lcores from all domains
- RTE_LCORE_FORN_NEXT_DOMAIN: iterate domain select n'th lcore
- RTE_LCORE_FORN_WORKER_NEXT_DOMAIN: iterate domain for worker n'th lcore.
Future work (after merge):
--------------------------
- dma-perf per IO NUMA
- eventdev per L3 NUMA
- pipeline per SMT|L3 NUMA
- distributor per L3 for Port-Queue
- l2fwd-power per SMT
- testpmd option for IO NUMA per port
Platform tested on:
-------------------
- INTEL(R) XEON(R) PLATINUM 8562Y+ (support IO numa 1 & 2)
- AMD EPYC 8534P (supports IO numa 1 & 2)
- AMD EPYC 9554 (supports IO numa 1, 2, 4)
Logs:
-----
1. INTEL(R) XEON(R) PLATINUM 8562Y+:
- SNC=1
Domain (IO): at index (0) there are 48 core, with (0) at index 0
- SNC=2
Domain (IO): at index (0) there are 24 core, with (0) at index 0
Domain (IO): at index (1) there are 24 core, with (12) at index 0
2. AMD EPYC 8534P:
- NPS=1:
Domain (IO): at index (0) there are 128 core, with (0) at index 0
- NPS=2:
Domain (IO): at index (0) there are 64 core, with (0) at index 0
Domain (IO): at index (1) there are 64 core, with (32) at index 0
Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
Vipin Varghese (4):
eal/lcore: add topology based functions
test/lcore: enable tests for topology
doc: add topology grouping details
examples: update with lcore topology API
app/test/test_lcores.c | 528 +++++++++++++
config/meson.build | 18 +
.../prog_guide/env_abstraction_layer.rst | 22 +
examples/helloworld/main.c | 154 +++-
examples/l2fwd/main.c | 56 +-
examples/skeleton/basicfwd.c | 22 +
lib/eal/common/eal_common_lcore.c | 714 ++++++++++++++++++
lib/eal/common/eal_private.h | 58 ++
lib/eal/freebsd/eal.c | 10 +
lib/eal/include/rte_lcore.h | 209 +++++
lib/eal/linux/eal.c | 11 +
lib/eal/meson.build | 4 +
lib/eal/version.map | 11 +
lib/eal/windows/eal.c | 12 +
14 files changed, 1819 insertions(+), 10 deletions(-)
Comments
[AMD Official Use Only - AMD Internal Distribution Only] Adding Thomas and Ajit to the loop. Hi Ajit, we have been using the patch series for identifying the topology and getting l3 cache id for populating the steering tag for Device Specific Model & MSI-x driven af-xdp for the experimental STAG firmware on Thor. Hence current use of topology library helps in 1. workload placement in same Cache or IO domain 2. populating id for MSIx or Device specific model for steering tags. Thomas and Ajith can we get some help to get this mainline too? > -----Original Message----- > From: Vipin Varghese <vipin.varghese@amd.com> > Sent: Tuesday, November 5, 2024 3:59 PM > To: dev@dpdk.org; roretzla@linux.microsoft.com; bruce.richardson@intel.com; > john.mcnamara@intel.com; dmitry.kozliuk@gmail.com > Cc: pbhagavatula@marvell.com; jerinj@marvell.com; ruifeng.wang@arm.com; > mattias.ronnblom@ericsson.com; anatoly.burakov@intel.com; > stephen@networkplumber.org; Yigit, Ferruh <Ferruh.Yigit@amd.com>; > honnappa.nagarahalli@arm.com; wathsala.vithanage@arm.com; > konstantin.ananyev@huawei.com; mb@smartsharesystems.com > Subject: [PATCH v4 0/4] Introduce Topology NUMA grouping for lcores > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > This patch introduces improvements for NUMA topology awareness in relation to > DPDK logical cores. The goal is to expose API which allows users to select optimal > logical cores for any application. These logical cores can be selected from various > NUMA domains like CPU and I/O. > > Change Summary: > - Introduces the concept of NUMA domain partitioning based on CPU and > I/O topology. > - Adds support for grouping DPDK logical cores within the same Cache > and I/O domain for improved locality. > - Implements topology detection and core grouping logic that > distinguishes between the following NUMA configurations: > * CPU topology & I/O topology (e.g., AMD SoC EPYC, Intel Xeon SPR) > * CPU+I/O topology (e.g., Ampere One with SLC, Intel Xeon SPR with SNC) > - Enhances performance by minimizing lcore dispersion across tiles|compute > package with different L2/L3 cache or IO domains. > > Reason: > - Applications using DPDK libraries relies on consistent memory access. > - Lcores being closer to same NUMA domain as IO. > - Lcores sharing same cache. > > Latency is minimized by using lcores that share the same NUMA topology. > Memory access is optimized by utilizing cores within the same NUMA domain or > tile. Cache coherence is preserved within the same shared cache domain, reducing > the remote access from tile|compute package via snooping (local hit in either L2 or > L3 within same NUMA domain). > > Library dependency: hwloc > > Topology Flags: > --------------- > - RTE_LCORE_DOMAIN_L1: to group cores sharing same L1 cache > - RTE_LCORE_DOMAIN_SMT: same as RTE_LCORE_DOMAIN_L1 > - RTE_LCORE_DOMAIN_L2: group cores sharing same L2 cache > - RTE_LCORE_DOMAIN_L3: group cores sharing same L3 cache > - RTE_LCORE_DOMAIN_L4: group cores sharing same L4 cache > - RTE_LCORE_DOMAIN_IO: group cores sharing same IO > > < Function: Purpose > > --------------------- > - rte_get_domain_count: get domain count based on Topology Flag > - rte_lcore_count_from_domain: get valid lcores count under each domain > - rte_get_lcore_in_domain: valid lcore id based on index > - rte_lcore_cpuset_in_domain: return valid cpuset based on index > - rte_lcore_is_main_in_domain: return true|false if main lcore is present > - rte_get_next_lcore_from_domain: next valid lcore within domain > - rte_get_next_lcore_from_next_domain: next valid lcore from next domain > > Note: > 1. Topology is NUMA grouping. > 2. Domain is various sub-groups within a specific Topology. > > Topology example: L1, L2, L3, L4, IO > Domian example: IO-A, IO-B > > < MACRO: Purpose > > ------------------ > - RTE_LCORE_FOREACH_DOMAIN: iterate lcores from all domains > - RTE_LCORE_FOREACH_WORKER_DOMAIN: iterate worker lcores from all > domains > - RTE_LCORE_FORN_NEXT_DOMAIN: iterate domain select n'th lcore > - RTE_LCORE_FORN_WORKER_NEXT_DOMAIN: iterate domain for worker n'th > lcore. > > Future work (after merge): > -------------------------- > - dma-perf per IO NUMA > - eventdev per L3 NUMA > - pipeline per SMT|L3 NUMA > - distributor per L3 for Port-Queue > - l2fwd-power per SMT > - testpmd option for IO NUMA per port > > Platform tested on: > ------------------- > - INTEL(R) XEON(R) PLATINUM 8562Y+ (support IO numa 1 & 2) > - AMD EPYC 8534P (supports IO numa 1 & 2) > - AMD EPYC 9554 (supports IO numa 1, 2, 4) > > Logs: > ----- > 1. INTEL(R) XEON(R) PLATINUM 8562Y+: > - SNC=1 > Domain (IO): at index (0) there are 48 core, with (0) at index 0 > - SNC=2 > Domain (IO): at index (0) there are 24 core, with (0) at index 0 > Domain (IO): at index (1) there are 24 core, with (12) at index 0 > > 2. AMD EPYC 8534P: > - NPS=1: > Domain (IO): at index (0) there are 128 core, with (0) at index 0 > - NPS=2: > Domain (IO): at index (0) there are 64 core, with (0) at index 0 > Domain (IO): at index (1) there are 64 core, with (32) at index 0 > > Signed-off-by: Vipin Varghese <vipin.varghese@amd.com> > > Vipin Varghese (4): > eal/lcore: add topology based functions > test/lcore: enable tests for topology > doc: add topology grouping details > examples: update with lcore topology API > > app/test/test_lcores.c | 528 +++++++++++++ > config/meson.build | 18 + > .../prog_guide/env_abstraction_layer.rst | 22 + > examples/helloworld/main.c | 154 +++- > examples/l2fwd/main.c | 56 +- > examples/skeleton/basicfwd.c | 22 + > lib/eal/common/eal_common_lcore.c | 714 ++++++++++++++++++ > lib/eal/common/eal_private.h | 58 ++ > lib/eal/freebsd/eal.c | 10 + > lib/eal/include/rte_lcore.h | 209 +++++ > lib/eal/linux/eal.c | 11 + > lib/eal/meson.build | 4 + > lib/eal/version.map | 11 + > lib/eal/windows/eal.c | 12 + > 14 files changed, 1819 insertions(+), 10 deletions(-) > > -- > 2.34.1
13/02/2025 04:09, Varghese, Vipin: > [AMD Official Use Only - AMD Internal Distribution Only] > > Adding Thomas and Ajit to the loop. > > Hi Ajit, we have been using the patch series for identifying the topology and getting l3 cache id for populating the steering tag for Device Specific Model & MSI-x driven af-xdp for the experimental STAG firmware on Thor. > > Hence current use of topology library helps in > 1. workload placement in same Cache or IO domain > 2. populating id for MSIx or Device specific model for steering tags. > > Thomas and Ajith can we get some help to get this mainline too? Yes, sorry the review discussions did not start. It has been forgotten. You could rebase a v2 to make it more visible. Minor note: the changelog should be after --- in the commit log.
> From: Thomas Monjalon [mailto:thomas@monjalon.net] > Sent: Thursday, 13 February 2025 09.34 > > 13/02/2025 04:09, Varghese, Vipin: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > Adding Thomas and Ajit to the loop. > > > > Hi Ajit, we have been using the patch series for identifying the > topology and getting l3 cache id for populating the steering tag for > Device Specific Model & MSI-x driven af-xdp for the experimental STAG > firmware on Thor. Excellent. A real life example use case helps the review process a lot! > > > > Hence current use of topology library helps in > > 1. workload placement in same Cache or IO domain > > 2. populating id for MSIx or Device specific model for steering tags. > > > > Thomas and Ajith can we get some help to get this mainline too? > > Yes, sorry the review discussions did not start. > It has been forgotten. I think the topology/domain API in the EAL should be co-designed with the steering tag API in the ethdev library, so the design can be reviewed/discussed in its entirety. To help the review discussion, please consider describing the following: Which APIs are for slow path, and which are for fast path? Which APIs are "must have", i.e. core to making it work at all, and which APIs are "nice to have", i.e. support APIs to ease use of the new features? I haven't looked at the hwloc library's API; but I guess these new EAL functions are closely related. Is it a thin wrapper around the hwloc library, or is it very different?
[Public] Hi Thomas snipped > > > > Thomas and Ajith can we get some help to get this mainline too? > > Yes, sorry the review discussions did not start. > It has been forgotten. > > You could rebase a v2 to make it more visible. Sure will do this week. > > Minor note: the changelog should be after --- in the commit log. >
[Public] Hi Morten, snipped > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net] > > Sent: Thursday, 13 February 2025 09.34 > > > > 13/02/2025 04:09, Varghese, Vipin: > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > Adding Thomas and Ajit to the loop. > > > > > > Hi Ajit, we have been using the patch series for identifying the > > topology and getting l3 cache id for populating the steering tag for > > Device Specific Model & MSI-x driven af-xdp for the experimental STAG > > firmware on Thor. > > Excellent. A real life example use case helps the review process a lot! Steering tag is one of the examples or uses, as shared in the current patch series we make use of these for other examples too. Eventdev, pkt-distributor and graph nodes are also in works to exploit L2|L3 cache local coherency too. > > > > > > > Hence current use of topology library helps in 1. workload placement > > > in same Cache or IO domain 2. populating id for MSIx or Device > > > specific model for steering tags. > > > > > > Thomas and Ajith can we get some help to get this mainline too? > > > > Yes, sorry the review discussions did not start. > > It has been forgotten. > > I think the topology/domain API in the EAL should be co-designed with the steering > tag API in the ethdev library, so the design can be reviewed/discussed in its entirety. As shared in the discussion, we have been exploring simplified approach for steering tags, namely 1. pci-dev args (crude way) 2. flow api for RX (experimental way) Based on the platform (in case of AMD EPYC, these are translated to `L3 id + 1`) We do agree rte_ethdev library can use topology API. Current topology API are designed to be made independent from steering tags, as other examples do make use of the same. > > To help the review discussion, please consider describing the following: > Which APIs are for slow path, and which are for fast path? > Which APIs are "must have", i.e. core to making it work at all, and which APIs are > "nice to have", i.e. support APIs to ease use of the new features? Yes, will try to do the same in updated version. For Slow and Fast path API I might need some help, as I was under the impression current behavior is same rte_lcore (invoked at setup and before remote launch). But will check again. > > I haven't looked at the hwloc library's API; but I guess these new EAL functions are > closely related. Is it a thin wrapper around the hwloc library, or is it very different? This is very thin wrapper on top of hwloc library only. But with DPDK RTE_MAX_LCORE & RTE_NUMA boundary check and population.
> From: Varghese, Vipin [mailto:Vipin.Varghese@amd.com] > Sent: Monday, 3 March 2025 10.06 > > [Public] > > Hi Morten, > > snipped > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net] > > > Sent: Thursday, 13 February 2025 09.34 > > > > > > 13/02/2025 04:09, Varghese, Vipin: > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > > Adding Thomas and Ajit to the loop. > > > > > > > > Hi Ajit, we have been using the patch series for identifying the > > > topology and getting l3 cache id for populating the steering tag > for > > > Device Specific Model & MSI-x driven af-xdp for the experimental > STAG > > > firmware on Thor. > > > > Excellent. A real life example use case helps the review process a > lot! > > Steering tag is one of the examples or uses, as shared in the current > patch series we make use of these for other examples too. > Eventdev, pkt-distributor and graph nodes are also in works to exploit > L2|L3 cache local coherency too. > > > > > > > > > > > Hence current use of topology library helps in 1. workload > placement > > > > in same Cache or IO domain 2. populating id for MSIx or Device > > > > specific model for steering tags. > > > > > > > > Thomas and Ajith can we get some help to get this mainline too? > > > > > > Yes, sorry the review discussions did not start. > > > It has been forgotten. > > > > I think the topology/domain API in the EAL should be co-designed with > the steering > > tag API in the ethdev library, so the design can be > reviewed/discussed in its entirety. > > As shared in the discussion, we have been exploring simplified approach > for steering tags, namely > > 1. pci-dev args (crude way) > 2. flow api for RX (experimental way) > > Based on the platform (in case of AMD EPYC, these are translated to `L3 > id + 1`) > > We do agree rte_ethdev library can use topology API. Current topology > API are designed to be made independent from steering tags, as other > examples do make use of the same. > > > > > To help the review discussion, please consider describing the > following: > > Which APIs are for slow path, and which are for fast path? > > Which APIs are "must have", i.e. core to making it work at all, and > which APIs are > > "nice to have", i.e. support APIs to ease use of the new features? > > Yes, will try to do the same in updated version. For Slow and Fast path > API I might need some help, as I was under the impression current > behavior is same rte_lcore (invoked at setup and before remote launch). > But will check again. Probably they are all used for configuration only, and thus all slow path; but if there are any fast path APIs, they should be highlighted as such. > > > > > I haven't looked at the hwloc library's API; but I guess these new > EAL functions are > > closely related. Is it a thin wrapper around the hwloc library, or is > it very different? > This is very thin wrapper on top of hwloc library only. But with DPDK > RTE_MAX_LCORE & RTE_NUMA boundary check and population. OK. The hwloc library is portable across Linux, BSD and Windows, which is great! Please also describe the benefits of using this DPDK library, compared to directly using the hwloc library.
On 2025-03-04 11:08, Morten Brørup wrote: >> From: Varghese, Vipin [mailto:Vipin.Varghese@amd.com] >> Sent: Monday, 3 March 2025 10.06 >> >> [Public] >> >> Hi Morten, >> >> snipped >> >>> >>>> From: Thomas Monjalon [mailto:thomas@monjalon.net] >>>> Sent: Thursday, 13 February 2025 09.34 >>>> >>>> 13/02/2025 04:09, Varghese, Vipin: >>>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>>> >>>>> Adding Thomas and Ajit to the loop. >>>>> >>>>> Hi Ajit, we have been using the patch series for identifying the >>>> topology and getting l3 cache id for populating the steering tag >> for >>>> Device Specific Model & MSI-x driven af-xdp for the experimental >> STAG >>>> firmware on Thor. >>> >>> Excellent. A real life example use case helps the review process a >> lot! >> >> Steering tag is one of the examples or uses, as shared in the current >> patch series we make use of these for other examples too. >> Eventdev, pkt-distributor and graph nodes are also in works to exploit >> L2|L3 cache local coherency too. >> >>> >>>>> >>>>> Hence current use of topology library helps in 1. workload >> placement >>>>> in same Cache or IO domain 2. populating id for MSIx or Device >>>>> specific model for steering tags. >>>>> >>>>> Thomas and Ajith can we get some help to get this mainline too? >>>> >>>> Yes, sorry the review discussions did not start. >>>> It has been forgotten. >>> >>> I think the topology/domain API in the EAL should be co-designed with >> the steering >>> tag API in the ethdev library, so the design can be >> reviewed/discussed in its entirety. >> >> As shared in the discussion, we have been exploring simplified approach >> for steering tags, namely >> >> 1. pci-dev args (crude way) >> 2. flow api for RX (experimental way) >> >> Based on the platform (in case of AMD EPYC, these are translated to `L3 >> id + 1`) >> >> We do agree rte_ethdev library can use topology API. Current topology >> API are designed to be made independent from steering tags, as other >> examples do make use of the same. >> >>> >>> To help the review discussion, please consider describing the >> following: >>> Which APIs are for slow path, and which are for fast path? >>> Which APIs are "must have", i.e. core to making it work at all, and >> which APIs are >>> "nice to have", i.e. support APIs to ease use of the new features? >> >> Yes, will try to do the same in updated version. For Slow and Fast path >> API I might need some help, as I was under the impression current >> behavior is same rte_lcore (invoked at setup and before remote launch). >> But will check again. > > Probably they are all used for configuration only, and thus all slow path; but if there are any fast path APIs, they should be highlighted as such. > Preferably, software work schedulers like DSW should be able to read topology information during run-time/steady-state operation. If topology APIs are slow or non-MT-safe, they will need to build up their own data structures for such information (which is not crazy idea, but leads to duplication). I didn't follow the hwloc discussions, so I may lack some context for this discussion. >> >>> >>> I haven't looked at the hwloc library's API; but I guess these new >> EAL functions are >>> closely related. Is it a thin wrapper around the hwloc library, or is >> it very different? >> This is very thin wrapper on top of hwloc library only. But with DPDK >> RTE_MAX_LCORE & RTE_NUMA boundary check and population. > > OK. The hwloc library is portable across Linux, BSD and Windows, which is great! > > Please also describe the benefits of using this DPDK library, compared to directly using the hwloc library. >
Hello Vipin and others, please, will there be any progress or update on this series? I successfully tested those changes on our Intel and AMD machines and would like to use it in production soon. The API is a little bit unintuitive, at least for me, but I successfully integrated into our software. I am missing a clear relation to the NUMA socket approach used in DPDK. E.g. I would like to be able to easily walk over a list of lcores from a specific NUMA node grouped by L3 domain. Yes, there is the RTE_LCORE_DOMAIN_IO, but would it always match the appropriate socket IDs? Also, I do not clearly understand what is the purpose of using domain selector like: RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2 or even: RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2 the documentation does not explain this. I could not spot any kind of grouping that would help me in any way. Some "best practices" examples would be nice to have to understand the intentions better. I found a little catch when running DPDK with more lcores than there are physical or SMT CPU cores. This happens when using e.g. an option like --lcores=(0-15)@(0-1). The results from the topology API would not match the lcores because hwloc is not aware of the lcores concept. This might be mentioned somewhere. Anyway, I really appreciate this work and would like to see it upstream. Especially for AMD machines, some framework like this is a must. Kind regards, Jan On Tue, 5 Nov 2024 15:58:45 +0530 Vipin Varghese <vipin.varghese@amd.com> wrote: > This patch introduces improvements for NUMA topology awareness in > relation to DPDK logical cores. The goal is to expose API which allows > users to select optimal logical cores for any application. These > logical cores can be selected from various NUMA domains like CPU and > I/O. > > Change Summary: > - Introduces the concept of NUMA domain partitioning based on CPU and > I/O topology. > - Adds support for grouping DPDK logical cores within the same Cache > and I/O domain for improved locality. > - Implements topology detection and core grouping logic that > distinguishes between the following NUMA configurations: > * CPU topology & I/O topology (e.g., AMD SoC EPYC, Intel Xeon SPR) > * CPU+I/O topology (e.g., Ampere One with SLC, Intel Xeon SPR > with SNC) > - Enhances performance by minimizing lcore dispersion across > tiles|compute package with different L2/L3 cache or IO domains. > > Reason: > - Applications using DPDK libraries relies on consistent memory > access. > - Lcores being closer to same NUMA domain as IO. > - Lcores sharing same cache. > > Latency is minimized by using lcores that share the same NUMA > topology. Memory access is optimized by utilizing cores within the > same NUMA domain or tile. Cache coherence is preserved within the > same shared cache domain, reducing the remote access from > tile|compute package via snooping (local hit in either L2 or L3 > within same NUMA domain). > > Library dependency: hwloc > > Topology Flags: > --------------- > - RTE_LCORE_DOMAIN_L1: to group cores sharing same L1 cache > - RTE_LCORE_DOMAIN_SMT: same as RTE_LCORE_DOMAIN_L1 > - RTE_LCORE_DOMAIN_L2: group cores sharing same L2 cache > - RTE_LCORE_DOMAIN_L3: group cores sharing same L3 cache > - RTE_LCORE_DOMAIN_L4: group cores sharing same L4 cache > - RTE_LCORE_DOMAIN_IO: group cores sharing same IO > > < Function: Purpose > > --------------------- > - rte_get_domain_count: get domain count based on Topology Flag > - rte_lcore_count_from_domain: get valid lcores count under each > domain > - rte_get_lcore_in_domain: valid lcore id based on index > - rte_lcore_cpuset_in_domain: return valid cpuset based on index > - rte_lcore_is_main_in_domain: return true|false if main lcore is > present > - rte_get_next_lcore_from_domain: next valid lcore within domain > - rte_get_next_lcore_from_next_domain: next valid lcore from next > domain > > Note: > 1. Topology is NUMA grouping. > 2. Domain is various sub-groups within a specific Topology. > > Topology example: L1, L2, L3, L4, IO > Domian example: IO-A, IO-B > > < MACRO: Purpose > > ------------------ > - RTE_LCORE_FOREACH_DOMAIN: iterate lcores from all domains > - RTE_LCORE_FOREACH_WORKER_DOMAIN: iterate worker lcores from all > domains > - RTE_LCORE_FORN_NEXT_DOMAIN: iterate domain select n'th lcore > - RTE_LCORE_FORN_WORKER_NEXT_DOMAIN: iterate domain for worker n'th > lcore. > > Future work (after merge): > -------------------------- > - dma-perf per IO NUMA > - eventdev per L3 NUMA > - pipeline per SMT|L3 NUMA > - distributor per L3 for Port-Queue > - l2fwd-power per SMT > - testpmd option for IO NUMA per port > > Platform tested on: > ------------------- > - INTEL(R) XEON(R) PLATINUM 8562Y+ (support IO numa 1 & 2) > - AMD EPYC 8534P (supports IO numa 1 & 2) > - AMD EPYC 9554 (supports IO numa 1, 2, 4) > > Logs: > ----- > 1. INTEL(R) XEON(R) PLATINUM 8562Y+: > - SNC=1 > Domain (IO): at index (0) there are 48 core, with (0) at > index 0 > - SNC=2 > Domain (IO): at index (0) there are 24 core, with (0) at > index 0 Domain (IO): at index (1) there are 24 core, with (12) at > index 0 > > 2. AMD EPYC 8534P: > - NPS=1: > Domain (IO): at index (0) there are 128 core, with (0) at > index 0 > - NPS=2: > Domain (IO): at index (0) there are 64 core, with (0) at > index 0 Domain (IO): at index (1) there are 64 core, with (32) at > index 0 > > Signed-off-by: Vipin Varghese <vipin.varghese@amd.com> > > Vipin Varghese (4): > eal/lcore: add topology based functions > test/lcore: enable tests for topology > doc: add topology grouping details > examples: update with lcore topology API > > app/test/test_lcores.c | 528 +++++++++++++ > config/meson.build | 18 + > .../prog_guide/env_abstraction_layer.rst | 22 + > examples/helloworld/main.c | 154 +++- > examples/l2fwd/main.c | 56 +- > examples/skeleton/basicfwd.c | 22 + > lib/eal/common/eal_common_lcore.c | 714 > ++++++++++++++++++ lib/eal/common/eal_private.h | > 58 ++ lib/eal/freebsd/eal.c | 10 + > lib/eal/include/rte_lcore.h | 209 +++++ > lib/eal/linux/eal.c | 11 + > lib/eal/meson.build | 4 + > lib/eal/version.map | 11 + > lib/eal/windows/eal.c | 12 + > 14 files changed, 1819 insertions(+), 10 deletions(-) >
[AMD Official Use Only - AMD Internal Distribution Only] Snipped > > Hello Vipin and others, > > please, will there be any progress or update on this series? Apologies, we did a small update in slack, and missed this out here. Let me try to address your questions below > > I successfully tested those changes on our Intel and AMD machines and would like > to use it in production soon. > > The API is a little bit unintuitive, at least for me, but I successfully integrated into our > software. > > I am missing a clear relation to the NUMA socket approach used in DPDK. > E.g. I would like to be able to easily walk over a list of lcores from a specific NUMA > node grouped by L3 domain. Yes, there is the RTE_LCORE_DOMAIN_IO, but would > it always match the appropriate socket IDs? Yes, we from AMD were internally debating the same. But since there is an API in lcore API as ` rte_lcore_to_socket_id`, adding yet another variation or argument lack it luster. Hence we internally debating when using the new API why not check if it is desired Physical Socket or Sub Socket Numa domain? Hence, we did not add the option. > > Also, I do not clearly understand what is the purpose of using domain selector like: > > RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2 > > or even: > > RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2 I believe we have mentioned in documents to choose 1, if used multiple combo based on the code flow only 1 will be picked up. real use of these are to select physical cores, under same cache or io domain. Example: certain SoC has 4 cores sharing L2, which makes pipeline processing more convinent (less data movement). In such cases select lcores within same L2 topologoly. > > the documentation does not explain this. I could not spot any kind of grouping that > would help me in any way. Some "best practices" examples would be nice to have to > understand the intentions better. From https://patches.dpdk.org/project/dpdk/cover/20241105102849.1947-1-vipin.varghese@amd.com/ ``` Reason: - Applications using DPDK libraries relies on consistent memory access. - Lcores being closer to same NUMA domain as IO. - Lcores sharing same cache. Latency is minimized by using lcores that share the same NUMA topology. Memory access is optimized by utilizing cores within the same NUMA domain or tile. Cache coherence is preserved within the same shared cache domain, reducing the remote access from tile|compute package via snooping (local hit in either L2 or L3 within same NUMA domain). ``` > > I found a little catch when running DPDK with more lcores than there are physical or > SMT CPU cores. This happens when using e.g. an option like --lcores=(0-15)@(0-1). > The results from the topology API would not match the lcores because hwloc is not > aware of the lcores concept. This might be mentioned somewhere. Yes, this is expected. As one can map any cpu cores to dpdk lcore with `lcore-map`. We did mentioned this in RFCv4, but when upgraded to RFCv5 we missed to mention it back. > > Anyway, I really appreciate this work and would like to see it upstream. > Especially for AMD machines, some framework like this is a must. > > Kind regards, > Jan > We are planning to remove RFC tag and share the final version for upcoming release for DPDK shortly.
[Public] Hi All, Saring `rte_topology_` API patch next version targeted for upcoming release. Extras adding support for Cache-ID for L2 and L3 for Cache line stashing, Code Data Prioritization too. Snipped > > > > > Hello Vipin and others, > > > > please, will there be any progress or update on this series? > > Apologies, we did a small update in slack, and missed this out here. Let me try to > address your questions below > > > > > I successfully tested those changes on our Intel and AMD machines and > > would like to use it in production soon. > > > > The API is a little bit unintuitive, at least for me, but I > > successfully integrated into our software. > > > > I am missing a clear relation to the NUMA socket approach used in DPDK. > > E.g. I would like to be able to easily walk over a list of lcores from > > a specific NUMA node grouped by L3 domain. Yes, there is the > > RTE_LCORE_DOMAIN_IO, but would it always match the appropriate socket > IDs? > > Yes, we from AMD were internally debating the same. But since there is an API in > lcore API as ` rte_lcore_to_socket_id`, adding yet another variation or argument > lack it luster. > Hence we internally debating when using the new API why not check if it is desired > Physical Socket or Sub Socket Numa domain? > > Hence, we did not add the option. > > > > > Also, I do not clearly understand what is the purpose of using domain selector > like: > > > > RTE_LCORE_DOMAIN_L1 | RTE_LCORE_DOMAIN_L2 > > > > or even: > > > > RTE_LCORE_DOMAIN_L3 | RTE_LCORE_DOMAIN_L2 > > I believe we have mentioned in documents to choose 1, if used multiple combo > based on the code flow only 1 will be picked up. > > real use of these are to select physical cores, under same cache or io domain. > Example: certain SoC has 4 cores sharing L2, which makes pipeline processing > more convinent (less data movement). In such cases select lcores within same L2 > topologoly. > > > > > the documentation does not explain this. I could not spot any kind of > > grouping that would help me in any way. Some "best practices" examples > > would be nice to have to understand the intentions better. > > From https://patches.dpdk.org/project/dpdk/cover/20241105102849.1947-1- > vipin.varghese@amd.com/ > > ``` > Reason: > - Applications using DPDK libraries relies on consistent memory access. > - Lcores being closer to same NUMA domain as IO. > - Lcores sharing same cache. > > Latency is minimized by using lcores that share the same NUMA topology. > Memory access is optimized by utilizing cores within the same NUMA domain or > tile. Cache coherence is preserved within the same shared cache domain, reducing > the remote access from tile|compute package via snooping (local hit in either L2 or > L3 within same NUMA domain). > ``` > > > > > I found a little catch when running DPDK with more lcores than there > > are physical or SMT CPU cores. This happens when using e.g. an option like -- > lcores=(0-15)@(0-1). > > The results from the topology API would not match the lcores because > > hwloc is not aware of the lcores concept. This might be mentioned somewhere. > > Yes, this is expected. As one can map any cpu cores to dpdk lcore with `lcore- > map`. > We did mentioned this in RFCv4, but when upgraded to RFCv5 we missed to > mention it back. > > > > > Anyway, I really appreciate this work and would like to see it upstream. > > Especially for AMD machines, some framework like this is a must. > > > > Kind regards, > > Jan > > > > We are planning to remove RFC tag and share the final version for upcoming > release for DPDK shortly.