Message ID | 20240712154645.80622-1-daniel.gregory@bytedance.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id A507545614; Fri, 12 Jul 2024 17:46:53 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 6E04A402CB; Fri, 12 Jul 2024 17:46:53 +0200 (CEST) Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50]) by mails.dpdk.org (Postfix) with ESMTP id 1D15C402CB for <dev@dpdk.org>; Fri, 12 Jul 2024 17:46:50 +0200 (CEST) Received: by mail-wr1-f50.google.com with SMTP id ffacd0b85a97d-367818349a0so1303924f8f.1 for <dev@dpdk.org>; Fri, 12 Jul 2024 08:46:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1720799210; x=1721404010; darn=dpdk.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=kiTMohUx6rpKwa/A1jmOliihh3/4dszcUqaPi3P2Cj4=; b=BaHLxSkUUAfK1NeyBtowGr0uPXlxKD8dPrrj+sGE+1jcDEzpFDWvJKa2FgXf8vUq7s HKdOt/GrvIEfiv3P6E9l49OGpfGnY2sPeS0FxW0d1HqcNVHNb47baklDzAUAZVpdi4KV pzwFl8HoDf5zsA5EgGu7xCkRWjusUOWFJ6k+AuaVObKznM0yTVhFjVPlJzzPBS6ELelI 43ZcL7hCKnAenlBXnBE4AZmRt6yeGuxn4WMneUTld7QsE1pvY5UhtryA19j2E0G4SA6D SNlI8Vfy94pyNuw4y8UnLxUBDV9xQGzfzk/nXYK2AXX29KDFLTU+9jY3rPeBS41KHSlV fn3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720799210; x=1721404010; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kiTMohUx6rpKwa/A1jmOliihh3/4dszcUqaPi3P2Cj4=; b=UAIy7BwiTg7c4ysie+LP3qgdLe2AvNmudcHjqwXR9xlfaAPZZ31UikJFmSa/MKjxKe l7Q4FIW3Mpp1X+2Jc2lZx5oIdzXSledhJnAddjjU8PP4Mey/IKDjYmI6V3F4nuBgZAVu N9qpgOPX1pQXm1oONHQxqEbjmtCyxifBUow07h9yj2poYqlLiHTliXmRtQovim2hxpNm Ofo3m+JBoLwURYOMS237a1//OkmGg/H/ZB6WCQHYvanicACdxQ5Qz5713h6YnyK1updI Gv9MCuFeX0uqRDW9g2RfCLuVVNWe8Ku513SCNhRwHNw9/BngjppRsFHMGY/vSQxFqROq RLSA== X-Gm-Message-State: AOJu0Yz1M+SrEfuTiq4i4zqNRM/N8tINCfwLqXwNzeD/chGj/QSDBDtD MeWDP3x4uDn5Y0m2D8qu7ChH5jTHd9Bm7Qp6L3b1JTMa8OKGIN14HgVI6j/bliKyue8bj+3Orv0 H X-Google-Smtp-Source: AGHT+IH1ikykNQ5DZzDJQ4Tv8+rNBlLnl7pYReKtpaA3lsotSKLLC83o25z6IP9eboSXQR9J6XpfDA== X-Received: by 2002:adf:f1c5:0:b0:367:f0d6:24e8 with SMTP id ffacd0b85a97d-367f0d62728mr3833269f8f.48.1720799210444; Fri, 12 Jul 2024 08:46:50 -0700 (PDT) Received: from C02FF2N1MD6T.bytedance.net (ec2-3-9-240-80.eu-west-2.compute.amazonaws.com. [3.9.240.80]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-367cde7e023sm10468615f8f.13.2024.07.12.08.46.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Jul 2024 08:46:50 -0700 (PDT) From: Daniel Gregory <daniel.gregory@bytedance.com> To: Stanislaw Kardach <stanislaw.kardach@gmail.com> Cc: dev@dpdk.org, Punit Agrawal <punit.agrawal@bytedance.com>, Liang Ma <liangma@liangbit.com>, Pengcheng Wang <wangpengcheng.pp@bytedance.com>, Chunsong Feng <fengchunsong@bytedance.com>, Daniel Gregory <daniel.gregory@bytedance.com>, Stephen Hemminger <stephen@networkplumber.org> Subject: [PATCH v2 0/9] riscv: implement accelerated crc using zbc Date: Fri, 12 Jul 2024 16:46:36 +0100 Message-Id: <20240712154645.80622-1-daniel.gregory@bytedance.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20240618174133.33457-1-daniel.gregory@bytedance.com> References: <20240618174133.33457-1-daniel.gregory@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org |
Series |
riscv: implement accelerated crc using zbc
|
|
Message
Daniel Gregory
July 12, 2024, 3:46 p.m. UTC
The RISC-V Zbc extension adds instructions for carry-less multiplication we can use to implement CRC in hardware. This patch set contains two new implementations: - one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to implement the four rte_hash_crc_* functions - one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce the buffer until it is small enough for a Barrett reduction to implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler My approach is largely based on the Intel's "Fast CRC Computation Using PCLMULQDQ Instruction" white paper https://www.researchgate.net/publication/263424619_Fast_CRC_computation and a post about "Optimizing CRC32 for small payload sizes on x86" https://mary.rs/lab/crc32/ Whether these new implementations are enabled is controlled by new build-time and run-time detection of the RISC-V extensions present in the compiler and on the target system. I have carried out some performance comparisons between the generic table implementations and the new hardware implementations. Listed below is the number of cycles it takes to compute the CRC hash for buffers of various sizes (as reported by rte_get_timer_cycles()). These results were collected on a Kendryte K230 and averaged over 20 samples: |Buffer | CRC32-ETH (lib/net) | CRC32C (lib/hash) | |Size (MB) | Table | Hardware | Table | Hardware | |----------|----------|----------|----------|----------| | 1 | 155168 | 11610 | 73026 | 18385 | | 2 | 311203 | 22998 | 145586 | 35886 | | 3 | 466744 | 34370 | 218536 | 53939 | | 4 | 621843 | 45536 | 291574 | 71944 | | 5 | 777908 | 56989 | 364152 | 89706 | | 6 | 932736 | 68023 | 437016 | 107726 | | 7 | 1088756 | 79236 | 510197 | 125426 | | 8 | 1243794 | 90467 | 583231 | 143614 | These results suggest a speed-up of lib/net by thirteen times, and of lib/hash by four times. I have also run the hash_functions_autotest benchmark in dpdk_test, which measures the performance of the lib/hash implementation on small buffers, getting the following times: | Key Length | Time (ticks/op) | | (bytes) | Table | Hardware | |------------|----------|----------| | 1 | 0.47 | 0.85 | | 2 | 0.57 | 0.87 | | 4 | 0.99 | 0.88 | | 8 | 1.35 | 0.88 | | 9 | 1.20 | 1.09 | | 13 | 1.76 | 1.35 | | 16 | 1.87 | 1.02 | | 32 | 2.96 | 0.98 | | 37 | 3.35 | 1.45 | | 40 | 3.49 | 1.12 | | 48 | 4.02 | 1.25 | | 64 | 5.08 | 1.54 | v2: - replace compile flag with build-time (riscv extension macros) and run-time detection (linux hwprobe syscall) (Stephen Hemminger) - add qemu target that supports zbc (Stanislaw Kardach) - fix spelling error in commit message - fix a bug in the net/ implementation that would cause segfaults on small unaligned buffers - refactor net/ implemementation to move variable declarations to top of functions - enable the optimisation in a couple other places optimised crc is preferred to jhash - l3fwd-power - cuckoo-hash Daniel Gregory (9): config/riscv: detect presence of Zbc extension hash: implement crc using riscv carryless multiply net: implement crc using riscv carryless multiply config/riscv: add qemu crossbuild target examples/l3fwd: use accelerated crc on riscv ipfrag: use accelerated crc on riscv examples/l3fwd-power: use accelerated crc on riscv hash/cuckoo: use accelerated crc on riscv member: use accelerated crc on riscv MAINTAINERS | 2 + app/test/test_crc.c | 9 + app/test/test_hash.c | 7 + config/riscv/meson.build | 44 +++- config/riscv/riscv64_qemu_linux_gcc | 17 ++ .../linux_gsg/cross_build_dpdk_for_riscv.rst | 5 + examples/l3fwd-power/main.c | 2 +- examples/l3fwd/l3fwd_em.c | 2 +- lib/eal/riscv/include/rte_cpuflags.h | 2 + lib/eal/riscv/rte_cpuflags.c | 112 +++++++--- lib/hash/meson.build | 1 + lib/hash/rte_crc_riscv64.h | 89 ++++++++ lib/hash/rte_cuckoo_hash.c | 3 + lib/hash/rte_hash_crc.c | 13 +- lib/hash/rte_hash_crc.h | 6 +- lib/ip_frag/ip_frag_internal.c | 6 +- lib/member/rte_member.h | 2 +- lib/net/meson.build | 4 + lib/net/net_crc.h | 11 + lib/net/net_crc_zbc.c | 191 ++++++++++++++++++ lib/net/rte_net_crc.c | 40 ++++ lib/net/rte_net_crc.h | 2 + 22 files changed, 529 insertions(+), 41 deletions(-) create mode 100644 config/riscv/riscv64_qemu_linux_gcc create mode 100644 lib/hash/rte_crc_riscv64.h create mode 100644 lib/net/net_crc_zbc.c
Comments
On Fri, Jul 12, 2024 at 5:47 PM Daniel Gregory <daniel.gregory@bytedance.com> wrote: > > The RISC-V Zbc extension adds instructions for carry-less multiplication > we can use to implement CRC in hardware. This patch set contains two new > implementations: > > - one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to > implement the four rte_hash_crc_* functions > - one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce > the buffer until it is small enough for a Barrett reduction to > implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler > > My approach is largely based on the Intel's "Fast CRC Computation Using > PCLMULQDQ Instruction" white paper > https://www.researchgate.net/publication/263424619_Fast_CRC_computation > and a post about "Optimizing CRC32 for small payload sizes on x86" > https://mary.rs/lab/crc32/ > > Whether these new implementations are enabled is controlled by new > build-time and run-time detection of the RISC-V extensions present in > the compiler and on the target system. > > I have carried out some performance comparisons between the generic > table implementations and the new hardware implementations. Listed below > is the number of cycles it takes to compute the CRC hash for buffers of > various sizes (as reported by rte_get_timer_cycles()). These results > were collected on a Kendryte K230 and averaged over 20 samples: > > |Buffer | CRC32-ETH (lib/net) | CRC32C (lib/hash) | > |Size (MB) | Table | Hardware | Table | Hardware | > |----------|----------|----------|----------|----------| > | 1 | 155168 | 11610 | 73026 | 18385 | > | 2 | 311203 | 22998 | 145586 | 35886 | > | 3 | 466744 | 34370 | 218536 | 53939 | > | 4 | 621843 | 45536 | 291574 | 71944 | > | 5 | 777908 | 56989 | 364152 | 89706 | > | 6 | 932736 | 68023 | 437016 | 107726 | > | 7 | 1088756 | 79236 | 510197 | 125426 | > | 8 | 1243794 | 90467 | 583231 | 143614 | > > These results suggest a speed-up of lib/net by thirteen times, and of > lib/hash by four times. > > I have also run the hash_functions_autotest benchmark in dpdk_test, > which measures the performance of the lib/hash implementation on small > buffers, getting the following times: > > | Key Length | Time (ticks/op) | > | (bytes) | Table | Hardware | > |------------|----------|----------| > | 1 | 0.47 | 0.85 | > | 2 | 0.57 | 0.87 | > | 4 | 0.99 | 0.88 | > | 8 | 1.35 | 0.88 | > | 9 | 1.20 | 1.09 | > | 13 | 1.76 | 1.35 | > | 16 | 1.87 | 1.02 | > | 32 | 2.96 | 0.98 | > | 37 | 3.35 | 1.45 | > | 40 | 3.49 | 1.12 | > | 48 | 4.02 | 1.25 | > | 64 | 5.08 | 1.54 | Thanks for the submission. This series comes late for v24.07 and there was no review, it is deferred to v24.11. Cc: Sachin for info.