mbox series

[v2,0/9] riscv: implement accelerated crc using zbc

Message ID 20240712154645.80622-1-daniel.gregory@bytedance.com (mailing list archive)
Headers
Series riscv: implement accelerated crc using zbc |

Message

Daniel Gregory July 12, 2024, 3:46 p.m. UTC
The RISC-V Zbc extension adds instructions for carry-less multiplication
we can use to implement CRC in hardware. This patch set contains two new
implementations:

- one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to
  implement the four rte_hash_crc_* functions
- one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce
  the buffer until it is small enough for a Barrett reduction to
  implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler

My approach is largely based on the Intel's "Fast CRC Computation Using
PCLMULQDQ Instruction" white paper
https://www.researchgate.net/publication/263424619_Fast_CRC_computation
and a post about "Optimizing CRC32 for small payload sizes on x86"
https://mary.rs/lab/crc32/

Whether these new implementations are enabled is controlled by new
build-time and run-time detection of the RISC-V extensions present in
the compiler and on the target system.

I have carried out some performance comparisons between the generic
table implementations and the new hardware implementations. Listed below
is the number of cycles it takes to compute the CRC hash for buffers of
various sizes (as reported by rte_get_timer_cycles()). These results
were collected on a Kendryte K230 and averaged over 20 samples:

|Buffer    | CRC32-ETH (lib/net) | CRC32C (lib/hash)   |
|Size (MB) | Table    | Hardware | Table    | Hardware |
|----------|----------|----------|----------|----------|
|        1 |   155168 |    11610 |    73026 |    18385 |
|        2 |   311203 |    22998 |   145586 |    35886 |
|        3 |   466744 |    34370 |   218536 |    53939 |
|        4 |   621843 |    45536 |   291574 |    71944 |
|        5 |   777908 |    56989 |   364152 |    89706 |
|        6 |   932736 |    68023 |   437016 |   107726 |
|        7 |  1088756 |    79236 |   510197 |   125426 |
|        8 |  1243794 |    90467 |   583231 |   143614 |

These results suggest a speed-up of lib/net by thirteen times, and of
lib/hash by four times.

I have also run the hash_functions_autotest benchmark in dpdk_test,
which measures the performance of the lib/hash implementation on small
buffers, getting the following times:

| Key Length | Time (ticks/op)     |
| (bytes)    | Table    | Hardware |
|------------|----------|----------|
|          1 |     0.47 |     0.85 |
|          2 |     0.57 |     0.87 |
|          4 |     0.99 |     0.88 |
|          8 |     1.35 |     0.88 |
|          9 |     1.20 |     1.09 |
|         13 |     1.76 |     1.35 |
|         16 |     1.87 |     1.02 |
|         32 |     2.96 |     0.98 |
|         37 |     3.35 |     1.45 |
|         40 |     3.49 |     1.12 |
|         48 |     4.02 |     1.25 |
|         64 |     5.08 |     1.54 |

v2:
- replace compile flag with build-time (riscv extension macros) and
  run-time detection (linux hwprobe syscall) (Stephen Hemminger)
- add qemu target that supports zbc (Stanislaw Kardach)
- fix spelling error in commit message
- fix a bug in the net/ implementation that would cause segfaults on
  small unaligned buffers
- refactor net/ implemementation to move variable declarations to top
  of functions
- enable the optimisation in a couple other places optimised crc is
  preferred to jhash
  - l3fwd-power
  - cuckoo-hash

Daniel Gregory (9):
  config/riscv: detect presence of Zbc extension
  hash: implement crc using riscv carryless multiply
  net: implement crc using riscv carryless multiply
  config/riscv: add qemu crossbuild target
  examples/l3fwd: use accelerated crc on riscv
  ipfrag: use accelerated crc on riscv
  examples/l3fwd-power: use accelerated crc on riscv
  hash/cuckoo: use accelerated crc on riscv
  member: use accelerated crc on riscv

 MAINTAINERS                                   |   2 +
 app/test/test_crc.c                           |   9 +
 app/test/test_hash.c                          |   7 +
 config/riscv/meson.build                      |  44 +++-
 config/riscv/riscv64_qemu_linux_gcc           |  17 ++
 .../linux_gsg/cross_build_dpdk_for_riscv.rst  |   5 +
 examples/l3fwd-power/main.c                   |   2 +-
 examples/l3fwd/l3fwd_em.c                     |   2 +-
 lib/eal/riscv/include/rte_cpuflags.h          |   2 +
 lib/eal/riscv/rte_cpuflags.c                  | 112 +++++++---
 lib/hash/meson.build                          |   1 +
 lib/hash/rte_crc_riscv64.h                    |  89 ++++++++
 lib/hash/rte_cuckoo_hash.c                    |   3 +
 lib/hash/rte_hash_crc.c                       |  13 +-
 lib/hash/rte_hash_crc.h                       |   6 +-
 lib/ip_frag/ip_frag_internal.c                |   6 +-
 lib/member/rte_member.h                       |   2 +-
 lib/net/meson.build                           |   4 +
 lib/net/net_crc.h                             |  11 +
 lib/net/net_crc_zbc.c                         | 191 ++++++++++++++++++
 lib/net/rte_net_crc.c                         |  40 ++++
 lib/net/rte_net_crc.h                         |   2 +
 22 files changed, 529 insertions(+), 41 deletions(-)
 create mode 100644 config/riscv/riscv64_qemu_linux_gcc
 create mode 100644 lib/hash/rte_crc_riscv64.h
 create mode 100644 lib/net/net_crc_zbc.c
  

Comments

David Marchand July 12, 2024, 5:19 p.m. UTC | #1
On Fri, Jul 12, 2024 at 5:47 PM Daniel Gregory
<daniel.gregory@bytedance.com> wrote:
>
> The RISC-V Zbc extension adds instructions for carry-less multiplication
> we can use to implement CRC in hardware. This patch set contains two new
> implementations:
>
> - one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to
>   implement the four rte_hash_crc_* functions
> - one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce
>   the buffer until it is small enough for a Barrett reduction to
>   implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler
>
> My approach is largely based on the Intel's "Fast CRC Computation Using
> PCLMULQDQ Instruction" white paper
> https://www.researchgate.net/publication/263424619_Fast_CRC_computation
> and a post about "Optimizing CRC32 for small payload sizes on x86"
> https://mary.rs/lab/crc32/
>
> Whether these new implementations are enabled is controlled by new
> build-time and run-time detection of the RISC-V extensions present in
> the compiler and on the target system.
>
> I have carried out some performance comparisons between the generic
> table implementations and the new hardware implementations. Listed below
> is the number of cycles it takes to compute the CRC hash for buffers of
> various sizes (as reported by rte_get_timer_cycles()). These results
> were collected on a Kendryte K230 and averaged over 20 samples:
>
> |Buffer    | CRC32-ETH (lib/net) | CRC32C (lib/hash)   |
> |Size (MB) | Table    | Hardware | Table    | Hardware |
> |----------|----------|----------|----------|----------|
> |        1 |   155168 |    11610 |    73026 |    18385 |
> |        2 |   311203 |    22998 |   145586 |    35886 |
> |        3 |   466744 |    34370 |   218536 |    53939 |
> |        4 |   621843 |    45536 |   291574 |    71944 |
> |        5 |   777908 |    56989 |   364152 |    89706 |
> |        6 |   932736 |    68023 |   437016 |   107726 |
> |        7 |  1088756 |    79236 |   510197 |   125426 |
> |        8 |  1243794 |    90467 |   583231 |   143614 |
>
> These results suggest a speed-up of lib/net by thirteen times, and of
> lib/hash by four times.
>
> I have also run the hash_functions_autotest benchmark in dpdk_test,
> which measures the performance of the lib/hash implementation on small
> buffers, getting the following times:
>
> | Key Length | Time (ticks/op)     |
> | (bytes)    | Table    | Hardware |
> |------------|----------|----------|
> |          1 |     0.47 |     0.85 |
> |          2 |     0.57 |     0.87 |
> |          4 |     0.99 |     0.88 |
> |          8 |     1.35 |     0.88 |
> |          9 |     1.20 |     1.09 |
> |         13 |     1.76 |     1.35 |
> |         16 |     1.87 |     1.02 |
> |         32 |     2.96 |     0.98 |
> |         37 |     3.35 |     1.45 |
> |         40 |     3.49 |     1.12 |
> |         48 |     4.02 |     1.25 |
> |         64 |     5.08 |     1.54 |

Thanks for the submission.
This series comes late for v24.07 and there was no review, it is
deferred to v24.11.

Cc: Sachin for info.