From patchwork Tue Feb 7 16:00:05 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Srikanth Yalavarthi X-Patchwork-Id: 123312 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 048BF41C30; Tue, 7 Feb 2023 17:00:22 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id D5E9F40A84; Tue, 7 Feb 2023 17:00:21 +0100 (CET) Received: from mx0b-0016f401.pphosted.com (mx0a-0016f401.pphosted.com [67.231.148.174]) by mails.dpdk.org (Postfix) with ESMTP id 49F224021F for ; Tue, 7 Feb 2023 17:00:20 +0100 (CET) Received: from pps.filterd (m0045849.ppops.net [127.0.0.1]) by mx0a-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 317EdPa0011264 for ; Tue, 7 Feb 2023 08:00:17 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=pfpt0220; bh=OvCqXlbONnL/i+nr6fWOhpz6C1ak62fo2O8HNVr78Zg=; b=kMRXWsL0Ut/gxWc1XjUA5yAh6SdbfJNewmmp9jONLX3ztzBjJOJ3IZ3JG4idNhD6IKmT yLoFt8iDB3cQnWp/ln6FCiSOZgOICWVOoCQBezGwMjU9+mG1kyQFYFbo5lh/beXmlvHb w9Id6PH/YH1FsxIw6SbKlx0WkjCnFk3o9Bxkd8myZDlDLwNeyLA501qTLnZHGUqOIyrG edMUExQ+1ARPWlZL4A5N1D4fNp2056mnXtx5w33NEYzu2RqrDtjlvNL4fuIpNSFcU+UB C5HzQDrAZy0lOOiBF785w7P/MwIrQrvgfbqcqNIOrul98cFs4OWAEXxov8FIZR/n34HH 7g== Received: from dc5-exch02.marvell.com ([199.233.59.182]) by mx0a-0016f401.pphosted.com (PPS) with ESMTPS id 3nkdyrsrbj-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Tue, 07 Feb 2023 08:00:17 -0800 Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH02.marvell.com (10.69.176.39) with Microsoft SMTP Server (TLS) id 15.0.1497.42; Tue, 7 Feb 2023 08:00:15 -0800 Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend Transport; Tue, 7 Feb 2023 08:00:15 -0800 Received: from ml-host-33.caveonetworks.com (unknown [10.110.143.233]) by maili.marvell.com (Postfix) with ESMTP id 9D8D33F70AE; Tue, 7 Feb 2023 08:00:10 -0800 (PST) From: Srikanth Yalavarthi To: Srikanth Yalavarthi CC: , , , , , Subject: [PATCH v6 1/4] mldev: add headers for internal ML functions Date: Tue, 7 Feb 2023 08:00:05 -0800 Message-ID: <20230207160008.30182-2-syalavarthi@marvell.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230207160008.30182-1-syalavarthi@marvell.com> References: <20221208193532.16718-1-syalavarthi@marvell.com> <20230207160008.30182-1-syalavarthi@marvell.com> MIME-Version: 1.0 X-Proofpoint-ORIG-GUID: 7sOkM9pHB9gKAGVVPYyuMK_lYSrB44hy X-Proofpoint-GUID: 7sOkM9pHB9gKAGVVPYyuMK_lYSrB44hy X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1 definitions=2023-02-07_07,2023-02-06_03,2022-06-22_01 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Added header files for internal ML utility routines to convert IO type and format to string, IO type to size and routines to convert data types. Signed-off-by: Srikanth Yalavarthi --- Depends-on: series-26858 ("Implementation of mldev test application") v6: * Updated release notes and series dependencies v5: * Moved the code from drivers/common/ml to lib/mldev * Added rte_ml_io_ prefix to the functions v3: * Skip installation of internal common/ml headers v2: * Moved implementation out of patch. Only headers are included. doc/guides/rel_notes/release_23_03.rst | 5 + lib/mldev/meson.build | 2 + lib/mldev/mldev_utils.c | 5 + lib/mldev/mldev_utils.h | 345 +++++++++++++++++++++++++ 4 files changed, 357 insertions(+) create mode 100644 lib/mldev/mldev_utils.c create mode 100644 lib/mldev/mldev_utils.h -- 2.17.1 diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst index cd1ac98abe..425323241e 100644 --- a/doc/guides/rel_notes/release_23_03.rst +++ b/doc/guides/rel_notes/release_23_03.rst @@ -95,6 +95,11 @@ New Features * Test case for inferences from multiple models in ordered mode. * Test case for inferences from multiple models.in interleaving mode. +* **Added common driver functions for machine learning device library.** + + * Added functions to translate IO type and format to string. + * Added functions to quantize and dequantize inference IO data. + Removed Items ------------- diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build index 5c99532c1a..452b83a480 100644 --- a/lib/mldev/meson.build +++ b/lib/mldev/meson.build @@ -4,6 +4,7 @@ sources = files( 'rte_mldev_pmd.c', 'rte_mldev.c', + 'mldev_utils.c', ) headers = files( @@ -16,6 +17,7 @@ indirect_headers += files( driver_sdk_headers += files( 'rte_mldev_pmd.h', + 'mldev_utils.h', ) deps += ['mempool'] diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c new file mode 100644 index 0000000000..9dbbf013a0 --- /dev/null +++ b/lib/mldev/mldev_utils.c @@ -0,0 +1,5 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright (c) 2022 Marvell. + */ + +#include "mldev_utils.h" diff --git a/lib/mldev/mldev_utils.h b/lib/mldev/mldev_utils.h new file mode 100644 index 0000000000..04cdaab567 --- /dev/null +++ b/lib/mldev/mldev_utils.h @@ -0,0 +1,345 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright (c) 2022 Marvell. + */ + +#ifndef _RTE_MLDEV_UTILS_H_ +#define _RTE_MLDEV_UTILS_H_ + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * @file + * + * RTE ML Device PMD utility API + * + * These APIs for the use from ML drivers, user applications shouldn't use them. + * + */ + +#include +#include + +/** + * @internal + * + * Get the size an ML IO type in bytes. + * + * @param[in] type + * Enumeration of ML IO data type. + * + * @return + * - > 0, Size of the data type in bytes. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_type_size_get(enum rte_ml_io_type type); + +/** + * @internal + * + * Get the name of an ML IO type. + * + * @param[in] type + * Enumeration of ML IO data type. + * @param[in] str + * Address of character array. + * @param[in] len + * Length of character array. + */ +__rte_internal +void +rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len); + +/** + * @internal + * + * Get the name of an ML IO format. + * + * @param[in] type + * Enumeration of ML IO format. + * @param[in] str + * Address of character array. + * @param[in] len + * Length of character array. + */ +__rte_internal +void +rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len); + +/** + * @internal + * + * Convert a buffer containing numbers in single precision floating format (float32) to signed 8-bit + * integer format (INT8). + * + * @param[in] scale + * Scale factor for conversion. + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * @param[out] output + * Output buffer to store INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in signed 8-bit integer format (INT8) to single precision + * floating format (float32). + * + * @param[in] scale + * Scale factor for conversion. + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes. + * @param[out] output + * Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in single precision floating format (float32) to unsigned + * 8-bit integer format (UINT8). + * + * @param[in] scale + * Scale factor for conversion. + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * @param[out] output + * Output buffer to store UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in unsigned 8-bit integer format (UINT8) to single precision + * floating format (float32). + * + * @param[in] scale + * Scale factor for conversion. + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes. + * @param[out] output + * Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in single precision floating format (float32) to signed + * 16-bit integer format (INT16). + * + * @param[in] scale + * Scale factor for conversion. + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * @param[out] output + * Output buffer to store INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in signed 16-bit integer format (INT16) to single precision + * floating format (float32). + * + * @param[in] scale + * Scale factor for conversion. + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes. + * @param[out] output + * Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in single precision floating format (float32) to unsigned + * 16-bit integer format (UINT16). + * + * @param[in] scale + * Scale factor for conversion. + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * @param[out] output + * Output buffer to store UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in unsigned 16-bit integer format (UINT16) to single + * precision floating format (float32). + * + * @param[in] scale + * Scale factor for conversion. + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes. + * @param[out] output + * Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in single precision floating format (float32) to half + * precision floating point format (FP16). + * + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes. + * @param[out] output + * Output buffer to store float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in half precision floating format (FP16) to single precision + * floating point format (float32). + * + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes. + * @param[out] output + * Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in single precision floating format (float32) to brain + * floating point format (bfloat16). + * + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes. + * @param[out] output + * Output buffer to store bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output); + +/** + * @internal + * + * Convert a buffer containing numbers in brain floating point format (bfloat16) to single precision + * floating point format (float32). + * + * @param[in] nb_elements + * Number of elements in the buffer. + * @param[in] input + * Input buffer containing bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) + * bytes. + * @param[out] output + * Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes. + * + * @return + * - 0, Success. + * - < 0, Error code on failure. + */ +__rte_internal +int +rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output); + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_MLDEV_UTILS_H_ */ From patchwork Tue Feb 7 16:00:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Srikanth Yalavarthi X-Patchwork-Id: 123313 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id DDA6E41C30; Tue, 7 Feb 2023 17:00:27 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 0B751427F2; Tue, 7 Feb 2023 17:00:23 +0100 (CET) Received: from mx0b-0016f401.pphosted.com (mx0a-0016f401.pphosted.com [67.231.148.174]) by mails.dpdk.org (Postfix) with ESMTP id 4A17F40A84 for ; Tue, 7 Feb 2023 17:00:20 +0100 (CET) Received: from pps.filterd (m0045849.ppops.net [127.0.0.1]) by mx0a-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 317EdPa2011264 for ; Tue, 7 Feb 2023 08:00:18 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=pfpt0220; bh=OcXiFop2NiZ7y5kWoqlZ1KEmeMNNkXYO+YOoxaMIGJ8=; b=CWzAvXotedREJG6igF7XLS+rrcHGzzSAWOdCxLmR7uwDp9DxTF29kkGNgQJW9+cfB/sV 50XzRPC826ZLlHy+JVmhuLWNo3gkQTUDUEEaAcsgQwKJhypj2gZ6tElX/zV7hQ2C6Ebp nNynLctK+U8yCAen5yC1jdvNC4q9QqZ4/Mi1NHzASKK6HZhLQ5y/k5PrPkTLX6dzXzwT 8YalK68G3/QHqwiUlGCotjP0gHWFvsYgXRPMs9/344Y9kd0LRnig/m2Cet9oeGXdw7Um 57XlOoaoCsHlot4Zcx37DGkTbpBLok5Jn4Yb7C+4hoefPbOr/iCIs54qiWRPBNbkH4t8 fg== Received: from dc5-exch02.marvell.com ([199.233.59.182]) by mx0a-0016f401.pphosted.com (PPS) with ESMTPS id 3nkdyrsrbj-4 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Tue, 07 Feb 2023 08:00:18 -0800 Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH02.marvell.com (10.69.176.39) with Microsoft SMTP Server (TLS) id 15.0.1497.42; Tue, 7 Feb 2023 08:00:16 -0800 Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend Transport; Tue, 7 Feb 2023 08:00:16 -0800 Received: from ml-host-33.caveonetworks.com (unknown [10.110.143.233]) by maili.marvell.com (Postfix) with ESMTP id 002AC3F70BE; Tue, 7 Feb 2023 08:00:10 -0800 (PST) From: Srikanth Yalavarthi To: Srikanth Yalavarthi CC: , , , , , Subject: [PATCH v6 2/4] mldev: implement ML IO type handling functions Date: Tue, 7 Feb 2023 08:00:06 -0800 Message-ID: <20230207160008.30182-3-syalavarthi@marvell.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230207160008.30182-1-syalavarthi@marvell.com> References: <20221208193532.16718-1-syalavarthi@marvell.com> <20230207160008.30182-1-syalavarthi@marvell.com> MIME-Version: 1.0 X-Proofpoint-ORIG-GUID: 7-LdFhEWh0BlYXAYMrLHiNScHpw8XQAB X-Proofpoint-GUID: 7-LdFhEWh0BlYXAYMrLHiNScHpw8XQAB X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1 definitions=2023-02-07_07,2023-02-06_03,2022-06-22_01 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Implemented ML utility functions to convert IO data type to name, IO format to name and routine to get the size of an IO data type in bytes. Signed-off-by: Srikanth Yalavarthi --- v5: * Moved the code from drivers/common/ml to lib/mldev * Added rte_ml_io_ prefix to the functions v2: * Implemented common utility functions as part of the patch * Dropped use of driver routines for data conversion functions lib/mldev/mldev_utils.c | 113 ++++++++++++++++++++++++++++++++++++++++ lib/mldev/version.map | 4 ++ 2 files changed, 117 insertions(+) -- 2.17.1 diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c index 9dbbf013a0..d2442b123b 100644 --- a/lib/mldev/mldev_utils.c +++ b/lib/mldev/mldev_utils.c @@ -2,4 +2,117 @@ * Copyright (c) 2022 Marvell. */ +#include +#include + +#include +#include + #include "mldev_utils.h" + +/* Description: + * This file implements Machine Learning utility routines, except type conversion routines. + */ + +int +rte_ml_io_type_size_get(enum rte_ml_io_type type) +{ + switch (type) { + case RTE_ML_IO_TYPE_UNKNOWN: + return -EINVAL; + case RTE_ML_IO_TYPE_INT8: + return sizeof(int8_t); + case RTE_ML_IO_TYPE_UINT8: + return sizeof(uint8_t); + case RTE_ML_IO_TYPE_INT16: + return sizeof(int16_t); + case RTE_ML_IO_TYPE_UINT16: + return sizeof(uint16_t); + case RTE_ML_IO_TYPE_INT32: + return sizeof(int32_t); + case RTE_ML_IO_TYPE_UINT32: + return sizeof(uint32_t); + case RTE_ML_IO_TYPE_FP8: + return sizeof(uint8_t); + case RTE_ML_IO_TYPE_FP16: + return sizeof(uint8_t) * 2; + case RTE_ML_IO_TYPE_FP32: + return sizeof(uint8_t) * 4; + case RTE_ML_IO_TYPE_BFLOAT16: + return sizeof(uint8_t) * 2; + default: + return -EINVAL; + } +} + +void +rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len) +{ + switch (type) { + case RTE_ML_IO_TYPE_UNKNOWN: + rte_strlcpy(str, "unknown", len); + break; + case RTE_ML_IO_TYPE_INT8: + rte_strlcpy(str, "int8", len); + break; + case RTE_ML_IO_TYPE_UINT8: + rte_strlcpy(str, "uint8", len); + break; + case RTE_ML_IO_TYPE_INT16: + rte_strlcpy(str, "int16", len); + break; + case RTE_ML_IO_TYPE_UINT16: + rte_strlcpy(str, "uint16", len); + break; + case RTE_ML_IO_TYPE_INT32: + rte_strlcpy(str, "int32", len); + break; + case RTE_ML_IO_TYPE_UINT32: + rte_strlcpy(str, "uint32", len); + break; + case RTE_ML_IO_TYPE_FP8: + rte_strlcpy(str, "float8", len); + break; + case RTE_ML_IO_TYPE_FP16: + rte_strlcpy(str, "float16", len); + break; + case RTE_ML_IO_TYPE_FP32: + rte_strlcpy(str, "float32", len); + break; + case RTE_ML_IO_TYPE_BFLOAT16: + rte_strlcpy(str, "bfloat16", len); + break; + default: + rte_strlcpy(str, "invalid", len); + } +} + +void +rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len) +{ + switch (format) { + case RTE_ML_IO_FORMAT_NCHW: + rte_strlcpy(str, "NCHW", len); + break; + case RTE_ML_IO_FORMAT_NHWC: + rte_strlcpy(str, "NHWC", len); + break; + case RTE_ML_IO_FORMAT_CHWN: + rte_strlcpy(str, "CHWN", len); + break; + case RTE_ML_IO_FORMAT_3D: + rte_strlcpy(str, "3D", len); + break; + case RTE_ML_IO_FORMAT_2D: + rte_strlcpy(str, "Matrix", len); + break; + case RTE_ML_IO_FORMAT_1D: + rte_strlcpy(str, "Vector", len); + break; + case RTE_ML_IO_FORMAT_SCALAR: + rte_strlcpy(str, "Scalar", len); + break; + default: + rte_strlcpy(str, "invalid", len); + } +} diff --git a/lib/mldev/version.map b/lib/mldev/version.map index d2b30a991a..9d06659493 100644 --- a/lib/mldev/version.map +++ b/lib/mldev/version.map @@ -48,4 +48,8 @@ INTERNAL { rte_ml_dev_pmd_get_dev; rte_ml_dev_pmd_get_named_dev; rte_ml_dev_pmd_release; + + rte_ml_io_type_size_get; + rte_ml_io_type_to_str; + rte_ml_io_format_to_str; }; From patchwork Tue Feb 7 16:00:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Srikanth Yalavarthi X-Patchwork-Id: 123315 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 7716841C30; Tue, 7 Feb 2023 17:00:45 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id AC3E642D13; Tue, 7 Feb 2023 17:00:25 +0100 (CET) Received: from mx0b-0016f401.pphosted.com (mx0a-0016f401.pphosted.com [67.231.148.174]) by mails.dpdk.org (Postfix) with ESMTP id 07D004021F for ; Tue, 7 Feb 2023 17:00:20 +0100 (CET) Received: from pps.filterd (m0045849.ppops.net [127.0.0.1]) by mx0a-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 317EdWj9011298 for ; Tue, 7 Feb 2023 08:00:20 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=pfpt0220; bh=l+umPyvoTctOp+dXR5qc/h2JYv1B5GyN0DgpWQd4AhQ=; b=glqALXSJ6v0j/xqeAQj/MV1YNfbCljUj7kN1M0dWAT1mD/vKXijI1RnYXBjqBk/LUV7y b/zjWZ9cZ831hVSORUp6e964QLjN/SJgmabYJaFmg++4FjRJXwB3Bg4ghlYCb3FG/37p kjg6LxNqPDBSnZLa3cr8MSko3UpwJ4BJXq0CdU+e+tvUxEYycoVrklMonE/TY93FuBVX 0evgX1H3MeCgpkk0d7+UCP6hHA1YN7KOgL5d4fTr41bzDngeUirRFwcIMWiOFeDAKYKJ MgSSus+gS+LDtM0PBqseK++77prrohTq59yoPHZZEbXs7dVw6uARRyjdmMhwMaS2ZOy1 lA== Received: from dc5-exch01.marvell.com ([199.233.59.181]) by mx0a-0016f401.pphosted.com (PPS) with ESMTPS id 3nkdyrsrbu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Tue, 07 Feb 2023 08:00:20 -0800 Received: from DC5-EXCH02.marvell.com (10.69.176.39) by DC5-EXCH01.marvell.com (10.69.176.38) with Microsoft SMTP Server (TLS) id 15.0.1497.42; Tue, 7 Feb 2023 08:00:18 -0800 Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH02.marvell.com (10.69.176.39) with Microsoft SMTP Server id 15.0.1497.42 via Frontend Transport; Tue, 7 Feb 2023 08:00:18 -0800 Received: from ml-host-33.caveonetworks.com (unknown [10.110.143.233]) by maili.marvell.com (Postfix) with ESMTP id 5889C3F7124; Tue, 7 Feb 2023 08:00:11 -0800 (PST) From: Srikanth Yalavarthi To: Srikanth Yalavarthi CC: , , , , , Subject: [PATCH v6 3/4] mldev: add scalar type conversion functions Date: Tue, 7 Feb 2023 08:00:07 -0800 Message-ID: <20230207160008.30182-4-syalavarthi@marvell.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230207160008.30182-1-syalavarthi@marvell.com> References: <20221208193532.16718-1-syalavarthi@marvell.com> <20230207160008.30182-1-syalavarthi@marvell.com> MIME-Version: 1.0 X-Proofpoint-ORIG-GUID: dpeDCu3EBGw-2lE1aUy5iyOAO2Jq19jM X-Proofpoint-GUID: dpeDCu3EBGw-2lE1aUy5iyOAO2Jq19jM X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1 definitions=2023-02-07_07,2023-02-06_03,2022-06-22_01 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Added scalar implementations to support conversion of data types. Support is enabled to handle int8, uint8, int16, uint16, float16, float32 and bfloat16 types. Signed-off-by: Srikanth Yalavarthi --- v5: * Moved the code from drivers/common/ml to lib/mldev * Added rte_ml_io_ prefix to the functions v2: * Updated internal function names * Updated function attributes to __rte_weak lib/mldev/meson.build | 1 + lib/mldev/mldev_utils_scalar.c | 720 +++++++++++++++++++++++++++++++++ lib/mldev/version.map | 12 + 3 files changed, 733 insertions(+) create mode 100644 lib/mldev/mldev_utils_scalar.c -- 2.17.1 diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build index 452b83a480..fce9c0ebee 100644 --- a/lib/mldev/meson.build +++ b/lib/mldev/meson.build @@ -5,6 +5,7 @@ sources = files( 'rte_mldev_pmd.c', 'rte_mldev.c', 'mldev_utils.c', + 'mldev_utils_scalar.c', ) headers = files( diff --git a/lib/mldev/mldev_utils_scalar.c b/lib/mldev/mldev_utils_scalar.c new file mode 100644 index 0000000000..40320ed3ef --- /dev/null +++ b/lib/mldev/mldev_utils_scalar.c @@ -0,0 +1,720 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright (c) 2022 Marvell. + */ + +#include +#include +#include + +#include "mldev_utils.h" + +/* Description: + * This file implements scalar versions of Machine Learning utility functions used to convert data + * types from higher precision to lower precision and vice-versa. + */ + +#ifndef BIT +#define BIT(nr) (1UL << (nr)) +#endif + +#ifndef BITS_PER_LONG +#define BITS_PER_LONG (__SIZEOF_LONG__ * 8) +#endif + +#ifndef GENMASK_U32 +#define GENMASK_U32(h, l) (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h)))) +#endif + +/* float32: bit index of MSB & LSB of sign, exponent and mantissa */ +#define FP32_LSB_M 0 +#define FP32_MSB_M 22 +#define FP32_LSB_E 23 +#define FP32_MSB_E 30 +#define FP32_LSB_S 31 +#define FP32_MSB_S 31 + +/* float32: bitmask for sign, exponent and mantissa */ +#define FP32_MASK_S GENMASK_U32(FP32_MSB_S, FP32_LSB_S) +#define FP32_MASK_E GENMASK_U32(FP32_MSB_E, FP32_LSB_E) +#define FP32_MASK_M GENMASK_U32(FP32_MSB_M, FP32_LSB_M) + +/* float16: bit index of MSB & LSB of sign, exponent and mantissa */ +#define FP16_LSB_M 0 +#define FP16_MSB_M 9 +#define FP16_LSB_E 10 +#define FP16_MSB_E 14 +#define FP16_LSB_S 15 +#define FP16_MSB_S 15 + +/* float16: bitmask for sign, exponent and mantissa */ +#define FP16_MASK_S GENMASK_U32(FP16_MSB_S, FP16_LSB_S) +#define FP16_MASK_E GENMASK_U32(FP16_MSB_E, FP16_LSB_E) +#define FP16_MASK_M GENMASK_U32(FP16_MSB_M, FP16_LSB_M) + +/* bfloat16: bit index of MSB & LSB of sign, exponent and mantissa */ +#define BF16_LSB_M 0 +#define BF16_MSB_M 6 +#define BF16_LSB_E 7 +#define BF16_MSB_E 14 +#define BF16_LSB_S 15 +#define BF16_MSB_S 15 + +/* bfloat16: bitmask for sign, exponent and mantissa */ +#define BF16_MASK_S GENMASK_U32(BF16_MSB_S, BF16_LSB_S) +#define BF16_MASK_E GENMASK_U32(BF16_MSB_E, BF16_LSB_E) +#define BF16_MASK_M GENMASK_U32(BF16_MSB_M, BF16_LSB_M) + +/* Exponent bias */ +#define FP32_BIAS_E 127 +#define FP16_BIAS_E 15 +#define BF16_BIAS_E 127 + +#define FP32_PACK(sign, exponent, mantissa) \ + (((sign) << FP32_LSB_S) | ((exponent) << FP32_LSB_E) | (mantissa)) + +#define FP16_PACK(sign, exponent, mantissa) \ + (((sign) << FP16_LSB_S) | ((exponent) << FP16_LSB_E) | (mantissa)) + +#define BF16_PACK(sign, exponent, mantissa) \ + (((sign) << BF16_LSB_S) | ((exponent) << BF16_LSB_E) | (mantissa)) + +/* Represent float32 as float and uint32_t */ +union float32 { + float f; + uint32_t u; +}; + +__rte_weak int +rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + int8_t *output_buffer; + uint64_t i; + int i32; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (int8_t *)output; + + for (i = 0; i < nb_elements; i++) { + i32 = (int32_t)round((*input_buffer) * scale); + + if (i32 < INT8_MIN) + i32 = INT8_MIN; + + if (i32 > INT8_MAX) + i32 = INT8_MAX; + + *output_buffer = (int8_t)i32; + + input_buffer++; + output_buffer++; + } + + return 0; +} + +__rte_weak int +rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output) +{ + int8_t *input_buffer; + float *output_buffer; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (int8_t *)input; + output_buffer = (float *)output; + + for (i = 0; i < nb_elements; i++) { + *output_buffer = scale * (float)(*input_buffer); + + input_buffer++; + output_buffer++; + } + + return 0; +} + +__rte_weak int +rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + uint8_t *output_buffer; + int32_t i32; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (uint8_t *)output; + + for (i = 0; i < nb_elements; i++) { + i32 = (int32_t)round((*input_buffer) * scale); + + if (i32 < 0) + i32 = 0; + + if (i32 > UINT8_MAX) + i32 = UINT8_MAX; + + *output_buffer = (uint8_t)i32; + + input_buffer++; + output_buffer++; + } + + return 0; +} + +__rte_weak int +rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output) +{ + uint8_t *input_buffer; + float *output_buffer; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (uint8_t *)input; + output_buffer = (float *)output; + + for (i = 0; i < nb_elements; i++) { + *output_buffer = scale * (float)(*input_buffer); + + input_buffer++; + output_buffer++; + } + + return 0; +} + +__rte_weak int +rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + int16_t *output_buffer; + int32_t i32; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (int16_t *)output; + + for (i = 0; i < nb_elements; i++) { + i32 = (int32_t)round((*input_buffer) * scale); + + if (i32 < INT16_MIN) + i32 = INT16_MIN; + + if (i32 > INT16_MAX) + i32 = INT16_MAX; + + *output_buffer = (int16_t)i32; + + input_buffer++; + output_buffer++; + } + + return 0; +} + +__rte_weak int +rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output) +{ + int16_t *input_buffer; + float *output_buffer; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (int16_t *)input; + output_buffer = (float *)output; + + for (i = 0; i < nb_elements; i++) { + *output_buffer = scale * (float)(*input_buffer); + + input_buffer++; + output_buffer++; + } + + return 0; +} + +__rte_weak int +rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + uint16_t *output_buffer; + int32_t i32; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (uint16_t *)output; + + for (i = 0; i < nb_elements; i++) { + i32 = (int32_t)round((*input_buffer) * scale); + + if (i32 < 0) + i32 = 0; + + if (i32 > UINT16_MAX) + i32 = UINT16_MAX; + + *output_buffer = (uint16_t)i32; + + input_buffer++; + output_buffer++; + } + + return 0; +} + +__rte_weak int +rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output) +{ + uint16_t *input_buffer; + float *output_buffer; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (uint16_t *)input; + output_buffer = (float *)output; + + for (i = 0; i < nb_elements; i++) { + *output_buffer = scale * (float)(*input_buffer); + + input_buffer++; + output_buffer++; + } + + return 0; +} + +/* Convert a single precision floating point number (float32) into a half precision + * floating point number (float16) using round to nearest rounding mode. + */ +static uint16_t +__float32_to_float16_scalar_rtn(float x) +{ + union float32 f32; /* float32 input */ + uint32_t f32_s; /* float32 sign */ + uint32_t f32_e; /* float32 exponent */ + uint32_t f32_m; /* float32 mantissa */ + uint16_t f16_s; /* float16 sign */ + uint16_t f16_e; /* float16 exponent */ + uint16_t f16_m; /* float16 mantissa */ + uint32_t tbits; /* number of truncated bits */ + uint32_t tmsb; /* MSB position of truncated bits */ + uint32_t m_32; /* temporary float32 mantissa */ + uint16_t m_16; /* temporary float16 mantissa */ + uint16_t u16; /* float16 output */ + int be_16; /* float16 biased exponent, signed */ + + f32.f = x; + f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S; + f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E; + f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M; + + f16_s = f32_s; + f16_e = 0; + f16_m = 0; + + switch (f32_e) { + case (0): /* float32: zero or subnormal number */ + f16_e = 0; + if (f32_m == 0) /* zero */ + f16_m = 0; + else /* subnormal number, convert to zero */ + f16_m = 0; + break; + case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */ + f16_e = FP16_MASK_E >> FP16_LSB_E; + if (f32_m == 0) { /* infinity */ + f16_m = 0; + } else { /* nan, propagate mantissa and set MSB of mantissa to 1 */ + f16_m = f32_m >> (FP32_MSB_M - FP16_MSB_M); + f16_m |= BIT(FP16_MSB_M); + } + break; + default: /* float32: normal number */ + /* compute biased exponent for float16 */ + be_16 = (int)f32_e - FP32_BIAS_E + FP16_BIAS_E; + + /* overflow, be_16 = [31-INF], set to infinity */ + if (be_16 >= (int)(FP16_MASK_E >> FP16_LSB_E)) { + f16_e = FP16_MASK_E >> FP16_LSB_E; + f16_m = 0; + } else if ((be_16 >= 1) && (be_16 < (int)(FP16_MASK_E >> FP16_LSB_E))) { + /* normal float16, be_16 = [1:30]*/ + f16_e = be_16; + m_16 = f32_m >> (FP32_LSB_E - FP16_LSB_E); + tmsb = FP32_MSB_M - FP16_MSB_M - 1; + if ((f32_m & GENMASK_U32(tmsb, 0)) > BIT(tmsb)) { + /* round: non-zero truncated bits except MSB */ + m_16++; + + /* overflow into exponent */ + if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1) + f16_e++; + } else if ((f32_m & GENMASK_U32(tmsb, 0)) == BIT(tmsb)) { + /* round: MSB of truncated bits and LSB of m_16 is set */ + if ((m_16 & 0x1) == 0x1) { + m_16++; + + /* overflow into exponent */ + if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1) + f16_e++; + } + } + f16_m = m_16 & FP16_MASK_M; + } else if ((be_16 >= -(int)(FP16_MSB_M)) && (be_16 < 1)) { + /* underflow: zero / subnormal, be_16 = [-9:0] */ + f16_e = 0; + + /* add implicit leading zero */ + m_32 = f32_m | BIT(FP32_LSB_E); + tbits = FP32_LSB_E - FP16_LSB_E - be_16 + 1; + m_16 = m_32 >> tbits; + + /* if non-leading truncated bits are set */ + if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) { + m_16++; + + /* overflow into exponent */ + if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1) + f16_e++; + } else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) { + /* if leading truncated bit is set */ + if ((m_16 & 0x1) == 0x1) { + m_16++; + + /* overflow into exponent */ + if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1) + f16_e++; + } + } + f16_m = m_16 & FP16_MASK_M; + } else if (be_16 == -(int)(FP16_MSB_M + 1)) { + /* underflow: zero, be_16 = [-10] */ + f16_e = 0; + if (f32_m != 0) + f16_m = 1; + else + f16_m = 0; + } else { + /* underflow: zero, be_16 = [-INF:-11] */ + f16_e = 0; + f16_m = 0; + } + + break; + } + + u16 = FP16_PACK(f16_s, f16_e, f16_m); + + return u16; +} + +__rte_weak int +rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + uint16_t *output_buffer; + uint64_t i; + + if ((nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (uint16_t *)output; + + for (i = 0; i < nb_elements; i++) { + *output_buffer = __float32_to_float16_scalar_rtn(*input_buffer); + + input_buffer = input_buffer + 1; + output_buffer = output_buffer + 1; + } + + return 0; +} + +/* Convert a half precision floating point number (float16) into a single precision + * floating point number (float32). + */ +static float +__float16_to_float32_scalar_rtx(uint16_t f16) +{ + union float32 f32; /* float32 output */ + uint16_t f16_s; /* float16 sign */ + uint16_t f16_e; /* float16 exponent */ + uint16_t f16_m; /* float16 mantissa */ + uint32_t f32_s; /* float32 sign */ + uint32_t f32_e; /* float32 exponent */ + uint32_t f32_m; /* float32 mantissa*/ + uint8_t shift; /* number of bits to be shifted */ + uint32_t clz; /* count of leading zeroes */ + int e_16; /* float16 exponent unbiased */ + + f16_s = (f16 & FP16_MASK_S) >> FP16_LSB_S; + f16_e = (f16 & FP16_MASK_E) >> FP16_LSB_E; + f16_m = (f16 & FP16_MASK_M) >> FP16_LSB_M; + + f32_s = f16_s; + switch (f16_e) { + case (FP16_MASK_E >> FP16_LSB_E): /* float16: infinity or nan */ + f32_e = FP32_MASK_E >> FP32_LSB_E; + if (f16_m == 0x0) { /* infinity */ + f32_m = f16_m; + } else { /* nan, propagate mantissa, set MSB of mantissa to 1 */ + f32_m = f16_m; + shift = FP32_MSB_M - FP16_MSB_M; + f32_m = (f32_m << shift) & FP32_MASK_M; + f32_m |= BIT(FP32_MSB_M); + } + break; + case 0: /* float16: zero or sub-normal */ + f32_m = f16_m; + if (f16_m == 0) { /* zero signed */ + f32_e = 0; + } else { /* subnormal numbers */ + clz = __builtin_clz((uint32_t)f16_m) - sizeof(uint32_t) * 8 + FP16_LSB_E; + e_16 = (int)f16_e - clz; + f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E; + + shift = clz + (FP32_MSB_M - FP16_MSB_M) + 1; + f32_m = (f32_m << shift) & FP32_MASK_M; + } + break; + default: /* normal numbers */ + f32_m = f16_m; + e_16 = (int)f16_e; + f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E; + + shift = (FP32_MSB_M - FP16_MSB_M); + f32_m = (f32_m << shift) & FP32_MASK_M; + } + + f32.u = FP32_PACK(f32_s, f32_e, f32_m); + + return f32.f; +} + +__rte_weak int +rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output) +{ + uint16_t *input_buffer; + float *output_buffer; + uint64_t i; + + if ((nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (uint16_t *)input; + output_buffer = (float *)output; + + for (i = 0; i < nb_elements; i++) { + *output_buffer = __float16_to_float32_scalar_rtx(*input_buffer); + + input_buffer = input_buffer + 1; + output_buffer = output_buffer + 1; + } + + return 0; +} + +/* Convert a single precision floating point number (float32) into a + * brain float number (bfloat16) using round to nearest rounding mode. + */ +static uint16_t +__float32_to_bfloat16_scalar_rtn(float x) +{ + union float32 f32; /* float32 input */ + uint32_t f32_s; /* float32 sign */ + uint32_t f32_e; /* float32 exponent */ + uint32_t f32_m; /* float32 mantissa */ + uint16_t b16_s; /* float16 sign */ + uint16_t b16_e; /* float16 exponent */ + uint16_t b16_m; /* float16 mantissa */ + uint32_t tbits; /* number of truncated bits */ + uint16_t u16; /* float16 output */ + + f32.f = x; + f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S; + f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E; + f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M; + + b16_s = f32_s; + b16_e = 0; + b16_m = 0; + + switch (f32_e) { + case (0): /* float32: zero or subnormal number */ + b16_e = 0; + if (f32_m == 0) /* zero */ + b16_m = 0; + else /* subnormal float32 number, normal bfloat16 */ + goto bf16_normal; + break; + case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */ + b16_e = BF16_MASK_E >> BF16_LSB_E; + if (f32_m == 0) { /* infinity */ + b16_m = 0; + } else { /* nan, propagate mantissa and set MSB of mantissa to 1 */ + b16_m = f32_m >> (FP32_MSB_M - BF16_MSB_M); + b16_m |= BIT(BF16_MSB_M); + } + break; + default: /* float32: normal number, normal bfloat16 */ + goto bf16_normal; + } + + goto bf16_pack; + +bf16_normal: + b16_e = f32_e; + tbits = FP32_MSB_M - BF16_MSB_M; + b16_m = f32_m >> tbits; + + /* if non-leading truncated bits are set */ + if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) { + b16_m++; + + /* if overflow into exponent */ + if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1) + b16_e++; + } else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) { + /* if only leading truncated bit is set */ + if ((b16_m & 0x1) == 0x1) { + b16_m++; + + /* if overflow into exponent */ + if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1) + b16_e++; + } + } + b16_m = b16_m & BF16_MASK_M; + +bf16_pack: + u16 = BF16_PACK(b16_s, b16_e, b16_m); + + return u16; +} + +__rte_weak int +rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + uint16_t *output_buffer; + uint64_t i; + + if ((nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (uint16_t *)output; + + for (i = 0; i < nb_elements; i++) { + *output_buffer = __float32_to_bfloat16_scalar_rtn(*input_buffer); + + input_buffer = input_buffer + 1; + output_buffer = output_buffer + 1; + } + + return 0; +} + +/* Convert a brain float number (bfloat16) into a + * single precision floating point number (float32). + */ +static float +__bfloat16_to_float32_scalar_rtx(uint16_t f16) +{ + union float32 f32; /* float32 output */ + uint16_t b16_s; /* float16 sign */ + uint16_t b16_e; /* float16 exponent */ + uint16_t b16_m; /* float16 mantissa */ + uint32_t f32_s; /* float32 sign */ + uint32_t f32_e; /* float32 exponent */ + uint32_t f32_m; /* float32 mantissa*/ + uint8_t shift; /* number of bits to be shifted */ + + b16_s = (f16 & BF16_MASK_S) >> BF16_LSB_S; + b16_e = (f16 & BF16_MASK_E) >> BF16_LSB_E; + b16_m = (f16 & BF16_MASK_M) >> BF16_LSB_M; + + f32_s = b16_s; + switch (b16_e) { + case (BF16_MASK_E >> BF16_LSB_E): /* bfloat16: infinity or nan */ + f32_e = FP32_MASK_E >> FP32_LSB_E; + if (b16_m == 0x0) { /* infinity */ + f32_m = 0; + } else { /* nan, propagate mantissa, set MSB of mantissa to 1 */ + f32_m = b16_m; + shift = FP32_MSB_M - BF16_MSB_M; + f32_m = (f32_m << shift) & FP32_MASK_M; + f32_m |= BIT(FP32_MSB_M); + } + break; + case 0: /* bfloat16: zero or subnormal */ + f32_m = b16_m; + if (b16_m == 0) { /* zero signed */ + f32_e = 0; + } else { /* subnormal numbers */ + goto fp32_normal; + } + break; + default: /* bfloat16: normal number */ + goto fp32_normal; + } + + goto fp32_pack; + +fp32_normal: + f32_m = b16_m; + f32_e = FP32_BIAS_E + b16_e - BF16_BIAS_E; + + shift = (FP32_MSB_M - BF16_MSB_M); + f32_m = (f32_m << shift) & FP32_MASK_M; + +fp32_pack: + f32.u = FP32_PACK(f32_s, f32_e, f32_m); + + return f32.f; +} + +__rte_weak int +rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output) +{ + uint16_t *input_buffer; + float *output_buffer; + uint64_t i; + + if ((nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (uint16_t *)input; + output_buffer = (float *)output; + + for (i = 0; i < nb_elements; i++) { + *output_buffer = __bfloat16_to_float32_scalar_rtx(*input_buffer); + + input_buffer = input_buffer + 1; + output_buffer = output_buffer + 1; + } + + return 0; +} diff --git a/lib/mldev/version.map b/lib/mldev/version.map index 9d06659493..0706b565be 100644 --- a/lib/mldev/version.map +++ b/lib/mldev/version.map @@ -52,4 +52,16 @@ INTERNAL { rte_ml_io_type_size_get; rte_ml_io_type_to_str; rte_ml_io_format_to_str; + rte_ml_io_float32_to_int8; + rte_ml_io_int8_to_float32; + rte_ml_io_float32_to_uint8; + rte_ml_io_uint8_to_float32; + rte_ml_io_float32_to_int16; + rte_ml_io_int16_to_float32; + rte_ml_io_float32_to_uint16; + rte_ml_io_uint16_to_float32; + rte_ml_io_float32_to_float16; + rte_ml_io_float16_to_float32; + rte_ml_io_float32_to_bfloat16; + rte_ml_io_bfloat16_to_float32; }; From patchwork Tue Feb 7 16:00:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Srikanth Yalavarthi X-Patchwork-Id: 123316 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 084FA41C30; Tue, 7 Feb 2023 17:00:54 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id B614642D1A; Tue, 7 Feb 2023 17:00:26 +0100 (CET) Received: from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com [67.231.156.173]) by mails.dpdk.org (Postfix) with ESMTP id 7F4BD42D0E for ; Tue, 7 Feb 2023 17:00:25 +0100 (CET) Received: from pps.filterd (m0045851.ppops.net [127.0.0.1]) by mx0b-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 317BEAMO015844; Tue, 7 Feb 2023 08:00:21 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=pfpt0220; bh=KPwoQ0e/FPVHNZB8nsJz2b7Xj3MqEwHr28eVfdwFC10=; b=PJJkQLSZHvHuPa/JgafGAlM6ZqxcKUQMA1p3EYWBITpCe4DH/5HGlimayehSsKkGcz+x 6rRiiYiiiKYwHJcRkLNl+my83oLoHocOlpo8YTahYNh2ZYer92rUflqFDc7Vqf2hphhz G8PK8iXBus+Sfg77uhOrEuix6r0mg8k9KVil8u4jNbr36/oP9eVXYqu4Jz21KO/duHRq uM8R8WZJw7yDt9L/XYw+BIA4wrZsL5HGvthcXMzau+zKczXlcuQ0wQA9l+hij7mmTMib 2BlKguihQIvotQ8BISMmPYAYcQDxBxRk1MII9BM7Qee1uMIsCyPElweFInBMXpfPUlqQ oQ== Received: from dc5-exch02.marvell.com ([199.233.59.182]) by mx0b-0016f401.pphosted.com (PPS) with ESMTPS id 3nhqrtmr96-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Tue, 07 Feb 2023 08:00:20 -0800 Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH02.marvell.com (10.69.176.39) with Microsoft SMTP Server (TLS) id 15.0.1497.42; Tue, 7 Feb 2023 08:00:18 -0800 Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend Transport; Tue, 7 Feb 2023 08:00:18 -0800 Received: from ml-host-33.caveonetworks.com (unknown [10.110.143.233]) by maili.marvell.com (Postfix) with ESMTP id E7E233F70EC; Tue, 7 Feb 2023 08:00:11 -0800 (PST) From: Srikanth Yalavarthi To: Srikanth Yalavarthi , Ruifeng Wang CC: , , , , , Subject: [PATCH v6 4/4] mldev: add Arm NEON type conversion routines Date: Tue, 7 Feb 2023 08:00:08 -0800 Message-ID: <20230207160008.30182-5-syalavarthi@marvell.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20230207160008.30182-1-syalavarthi@marvell.com> References: <20221208193532.16718-1-syalavarthi@marvell.com> <20230207160008.30182-1-syalavarthi@marvell.com> MIME-Version: 1.0 X-Proofpoint-GUID: j95RbRgL9hdLHi6ro8AOLWkUhiPP_O5d X-Proofpoint-ORIG-GUID: j95RbRgL9hdLHi6ro8AOLWkUhiPP_O5d X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1 definitions=2023-02-07_07,2023-02-06_03,2022-06-22_01 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Added ARM NEON intrinsic based implementations to support conversion of data types. Support is enabled to handle int8, uint8, int16, uint16, float16, float32 and bfloat16 types. Signed-off-by: Srikanth Yalavarthi --- v5: * Moved the code from drivers/common/ml to lib/mldev * Added rte_ml_io_ prefix to the functions v2: * Dropped use of driver routines to call neon functions * Optimization of neon functions. Reduce the number of intrinsic calls. lib/mldev/meson.build | 4 + lib/mldev/mldev_utils_neon.c | 873 +++++++++++++++++++++++++++++++++++ 2 files changed, 877 insertions(+) create mode 100644 lib/mldev/mldev_utils_neon.c -- 2.17.1 diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build index fce9c0ebee..05694b0839 100644 --- a/lib/mldev/meson.build +++ b/lib/mldev/meson.build @@ -8,6 +8,10 @@ sources = files( 'mldev_utils_scalar.c', ) +if arch_subdir == 'arm' + sources += files('mldev_utils_neon.c') +endif + headers = files( 'rte_mldev.h', ) diff --git a/lib/mldev/mldev_utils_neon.c b/lib/mldev/mldev_utils_neon.c new file mode 100644 index 0000000000..32b620db20 --- /dev/null +++ b/lib/mldev/mldev_utils_neon.c @@ -0,0 +1,873 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright (c) 2022 Marvell. + */ + +#include +#include +#include + +#include "mldev_utils.h" + +#include + +/* Description: + * This file implements vector versions of Machine Learning utility functions used to convert data + * types from higher precision to lower precision and vice-versa. Implementation is based on Arm + * Neon intrinsics. + */ + +static inline void +__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output) +{ + int16x4_t s16x4_l; + int16x4_t s16x4_h; + float32x4_t f32x4; + int16x8_t s16x8; + int32x4_t s32x4; + int8x8_t s8x8; + + /* load 4 float32 elements, scale, convert, saturate narrow to int16. + * Use round to nearest with ties away rounding mode. + */ + f32x4 = vld1q_f32(input); + f32x4 = vmulq_n_f32(f32x4, scale); + s32x4 = vcvtaq_s32_f32(f32x4); + s16x4_l = vqmovn_s32(s32x4); + + /* load next 4 float32 elements, scale, convert, saturate narrow to int16. + * Use round to nearest with ties away rounding mode. + */ + f32x4 = vld1q_f32(input + 4); + f32x4 = vmulq_n_f32(f32x4, scale); + s32x4 = vcvtaq_s32_f32(f32x4); + s16x4_h = vqmovn_s32(s32x4); + + /* combine lower and higher int16x4_t to int16x8_t */ + s16x8 = vcombine_s16(s16x4_l, s16x4_h); + + /* narrow to int8_t */ + s8x8 = vqmovn_s16(s16x8); + + /* store 8 elements */ + vst1_s8(output, s8x8); +} + +static inline void +__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output) +{ + int32_t s32; + int16_t s16; + + /* scale and convert, round to nearest with ties away rounding mode */ + s32 = vcvtas_s32_f32(scale * (*input)); + + /* saturate narrow */ + s16 = vqmovns_s32(s32); + + /* convert to int8_t */ + *output = vqmovnh_s16(s16); +} + +int +rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + int8_t *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (int8_t *)output; + vlen = 2 * sizeof(float) / sizeof(int8_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output) +{ + float32x4_t f32x4; + int16x8_t s16x8; + int16x4_t s16x4; + int32x4_t s32x4; + int8x8_t s8x8; + + /* load 8 x int8_t elements */ + s8x8 = vld1_s8(input); + + /* widen int8_t to int16_t */ + s16x8 = vmovl_s8(s8x8); + + /* convert lower 4 elements: widen to int32_t, convert to float, scale and store */ + s16x4 = vget_low_s16(s16x8); + s32x4 = vmovl_s16(s16x4); + f32x4 = vcvtq_f32_s32(s32x4); + f32x4 = vmulq_n_f32(f32x4, scale); + vst1q_f32(output, f32x4); + + /* convert higher 4 elements: widen to int32_t, convert to float, scale and store */ + s16x4 = vget_high_s16(s16x8); + s32x4 = vmovl_s16(s16x4); + f32x4 = vcvtq_f32_s32(s32x4); + f32x4 = vmulq_n_f32(f32x4, scale); + vst1q_f32(output + 4, f32x4); +} + +static inline void +__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output) +{ + *output = scale * vcvts_f32_s32((int32_t)*input); +} + +int +rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output) +{ + int8_t *input_buffer; + float *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (int8_t *)input; + output_buffer = (float *)output; + vlen = 2 * sizeof(float) / sizeof(int8_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output) +{ + uint16x4_t u16x4_l; + uint16x4_t u16x4_h; + float32x4_t f32x4; + uint32x4_t u32x4; + uint16x8_t u16x8; + uint8x8_t u8x8; + + /* load 4 float elements, scale, convert, saturate narrow to uint16_t. + * use round to nearest with ties away rounding mode. + */ + f32x4 = vld1q_f32(input); + f32x4 = vmulq_n_f32(f32x4, scale); + u32x4 = vcvtaq_u32_f32(f32x4); + u16x4_l = vqmovn_u32(u32x4); + + /* load next 4 float elements, scale, convert, saturate narrow to uint16_t + * use round to nearest with ties away rounding mode. + */ + f32x4 = vld1q_f32(input + 4); + f32x4 = vmulq_n_f32(f32x4, scale); + u32x4 = vcvtaq_u32_f32(f32x4); + u16x4_h = vqmovn_u32(u32x4); + + /* combine lower and higher uint16x4_t */ + u16x8 = vcombine_u16(u16x4_l, u16x4_h); + + /* narrow to uint8x8_t */ + u8x8 = vqmovn_u16(u16x8); + + /* store 8 elements */ + vst1_u8(output, u8x8); +} + +static inline void +__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output) +{ + uint32_t u32; + uint16_t u16; + + /* scale and convert, round to nearest with ties away rounding mode */ + u32 = vcvtas_u32_f32(scale * (*input)); + + /* saturate narrow */ + u16 = vqmovns_u32(u32); + + /* convert to uint8_t */ + *output = vqmovnh_u16(u16); +} + +int +rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + uint8_t *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (uint8_t *)output; + vlen = 2 * sizeof(float) / sizeof(uint8_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output) +{ + float32x4_t f32x4; + uint16x8_t u16x8; + uint16x4_t u16x4; + uint32x4_t u32x4; + uint8x8_t u8x8; + + /* load 8 x uint8_t elements */ + u8x8 = vld1_u8(input); + + /* widen uint8_t to uint16_t */ + u16x8 = vmovl_u8(u8x8); + + /* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */ + u16x4 = vget_low_u16(u16x8); + u32x4 = vmovl_u16(u16x4); + f32x4 = vcvtq_f32_u32(u32x4); + f32x4 = vmulq_n_f32(f32x4, scale); + vst1q_f32(output, f32x4); + + /* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */ + u16x4 = vget_high_u16(u16x8); + u32x4 = vmovl_u16(u16x4); + f32x4 = vcvtq_f32_u32(u32x4); + f32x4 = vmulq_n_f32(f32x4, scale); + vst1q_f32(output + 4, f32x4); +} + +static inline void +__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output) +{ + *output = scale * vcvts_f32_u32((uint32_t)*input); +} + +int +rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output) +{ + uint8_t *input_buffer; + float *output_buffer; + uint64_t nb_iterations; + uint64_t vlen; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (uint8_t *)input; + output_buffer = (float *)output; + vlen = 2 * sizeof(float) / sizeof(uint8_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output) +{ + float32x4_t f32x4; + int16x4_t s16x4; + int32x4_t s32x4; + + /* load 4 x float elements */ + f32x4 = vld1q_f32(input); + + /* scale */ + f32x4 = vmulq_n_f32(f32x4, scale); + + /* convert to int32x4_t using round to nearest with ties away rounding mode */ + s32x4 = vcvtaq_s32_f32(f32x4); + + /* saturate narrow to int16x4_t */ + s16x4 = vqmovn_s32(s32x4); + + /* store 4 elements */ + vst1_s16(output, s16x4); +} + +static inline void +__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output) +{ + int32_t s32; + + /* scale and convert, round to nearest with ties away rounding mode */ + s32 = vcvtas_s32_f32(scale * (*input)); + + /* saturate narrow */ + *output = vqmovns_s32(s32); +} + +int +rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + int16_t *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (int16_t *)output; + vlen = 2 * sizeof(float) / sizeof(int16_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output) +{ + float32x4_t f32x4; + int16x4_t s16x4; + int32x4_t s32x4; + + /* load 4 x int16_t elements */ + s16x4 = vld1_s16(input); + + /* widen int16_t to int32_t */ + s32x4 = vmovl_s16(s16x4); + + /* convert int32_t to float */ + f32x4 = vcvtq_f32_s32(s32x4); + + /* scale */ + f32x4 = vmulq_n_f32(f32x4, scale); + + /* store float32x4_t */ + vst1q_f32(output, f32x4); +} + +static inline void +__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output) +{ + *output = scale * vcvts_f32_s32((int32_t)*input); +} + +int +rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output) +{ + int16_t *input_buffer; + float *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (int16_t *)input; + output_buffer = (float *)output; + vlen = 2 * sizeof(float) / sizeof(int16_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output) +{ + float32x4_t f32x4; + uint16x4_t u16x4; + uint32x4_t u32x4; + + /* load 4 float elements */ + f32x4 = vld1q_f32(input); + + /* scale */ + f32x4 = vmulq_n_f32(f32x4, scale); + + /* convert using round to nearest with ties to away rounding mode */ + u32x4 = vcvtaq_u32_f32(f32x4); + + /* saturate narrow */ + u16x4 = vqmovn_u32(u32x4); + + /* store 4 elements */ + vst1_u16(output, u16x4); +} + +static inline void +__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output) +{ + uint32_t u32; + + /* scale and convert, round to nearest with ties away rounding mode */ + u32 = vcvtas_u32_f32(scale * (*input)); + + /* saturate narrow */ + *output = vqmovns_u32(u32); +} + +int +rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output) +{ + float *input_buffer; + uint16_t *output_buffer; + uint64_t nb_iterations; + uint64_t vlen; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float *)input; + output_buffer = (uint16_t *)output; + vlen = 2 * sizeof(float) / sizeof(uint16_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output) +{ + float32x4_t f32x4; + uint16x4_t u16x4; + uint32x4_t u32x4; + + /* load 4 x uint16_t elements */ + u16x4 = vld1_u16(input); + + /* widen uint16_t to uint32_t */ + u32x4 = vmovl_u16(u16x4); + + /* convert uint32_t to float */ + f32x4 = vcvtq_f32_u32(u32x4); + + /* scale */ + f32x4 = vmulq_n_f32(f32x4, scale); + + /* store float32x4_t */ + vst1q_f32(output, f32x4); +} + +static inline void +__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output) +{ + *output = scale * vcvts_f32_u32((uint32_t)*input); +} + +int +rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output) +{ + uint16_t *input_buffer; + float *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (uint16_t *)input; + output_buffer = (float *)output; + vlen = 2 * sizeof(float) / sizeof(uint16_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output) +{ + float32x4_t f32x4; + float16x4_t f16x4; + + /* load 4 x float32_t elements */ + f32x4 = vld1q_f32(input); + + /* convert to float16x4_t */ + f16x4 = vcvt_f16_f32(f32x4); + + /* store float16x4_t */ + vst1_f16(output, f16x4); +} + +static inline void +__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output) +{ + float32x4_t f32x4; + float16x4_t f16x4; + + /* load element to 4 lanes */ + f32x4 = vld1q_dup_f32(input); + + /* convert float32_t to float16_t */ + f16x4 = vcvt_f16_f32(f32x4); + + /* store lane 0 / 1 element */ + vst1_lane_f16(output, f16x4, 0); +} + +int +rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output) +{ + float32_t *input_buffer; + float16_t *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float32_t *)input; + output_buffer = (float16_t *)output; + vlen = 2 * sizeof(float32_t) / sizeof(float16_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __float32_to_float16_neon_f16x4(input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __float32_to_float16_neon_f16x1(input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output) +{ + float16x4_t f16x4; + float32x4_t f32x4; + + /* load 4 x float16_t elements */ + f16x4 = vld1_f16(input); + + /* convert float16x4_t to float32x4_t */ + f32x4 = vcvt_f32_f16(f16x4); + + /* store float32x4_t */ + vst1q_f32(output, f32x4); +} + +static inline void +__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output) +{ + float16x4_t f16x4; + float32x4_t f32x4; + + /* load element to 4 lanes */ + f16x4 = vld1_dup_f16(input); + + /* convert float16_t to float32_t */ + f32x4 = vcvt_f32_f16(f16x4); + + /* store 1 element */ + vst1q_lane_f32(output, f32x4, 0); +} + +int +rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output) +{ + float16_t *input_buffer; + float32_t *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float16_t *)input; + output_buffer = (float32_t *)output; + vlen = 2 * sizeof(float32_t) / sizeof(float16_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __float16_to_float32_neon_f32x4(input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __float16_to_float32_neon_f32x1(input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +#ifdef __ARM_FEATURE_BF16 + +static inline void +__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output) +{ + float32x4_t f32x4; + bfloat16x4_t bf16x4; + + /* load 4 x float32_t elements */ + f32x4 = vld1q_f32(input); + + /* convert float32x4_t to bfloat16x4_t */ + bf16x4 = vcvt_bf16_f32(f32x4); + + /* store bfloat16x4_t */ + vst1_bf16(output, bf16x4); +} + +static inline void +__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output) +{ + float32x4_t f32x4; + bfloat16x4_t bf16x4; + + /* load element to 4 lanes */ + f32x4 = vld1q_dup_f32(input); + + /* convert float32_t to bfloat16_t */ + bf16x4 = vcvt_bf16_f32(f32x4); + + /* store lane 0 / 1 element */ + vst1_lane_bf16(output, bf16x4, 0); +} + +int +rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output) +{ + float32_t *input_buffer; + bfloat16_t *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (float32_t *)input; + output_buffer = (bfloat16_t *)output; + vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +static inline void +__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output) +{ + bfloat16x4_t bf16x4; + float32x4_t f32x4; + + /* load 4 x bfloat16_t elements */ + bf16x4 = vld1_bf16(input); + + /* convert bfloat16x4_t to float32x4_t */ + f32x4 = vcvt_f32_bf16(bf16x4); + + /* store float32x4_t */ + vst1q_f32(output, f32x4); +} + +static inline void +__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output) +{ + bfloat16x4_t bf16x4; + float32x4_t f32x4; + + /* load element to 4 lanes */ + bf16x4 = vld1_dup_bf16(input); + + /* convert bfloat16_t to float32_t */ + f32x4 = vcvt_f32_bf16(bf16x4); + + /* store lane 0 / 1 element */ + vst1q_lane_f32(output, f32x4, 0); +} + +int +rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output) +{ + bfloat16_t *input_buffer; + float32_t *output_buffer; + uint64_t nb_iterations; + uint32_t vlen; + uint64_t i; + + if ((nb_elements == 0) || (input == NULL) || (output == NULL)) + return -EINVAL; + + input_buffer = (bfloat16_t *)input; + output_buffer = (float32_t *)output; + vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t); + nb_iterations = nb_elements / vlen; + + /* convert vlen elements in each iteration */ + for (i = 0; i < nb_iterations; i++) { + __bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer); + input_buffer += vlen; + output_buffer += vlen; + } + + /* convert leftover elements */ + i = i * vlen; + for (; i < nb_elements; i++) { + __bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer); + input_buffer++; + output_buffer++; + } + + return 0; +} + +#endif /* __ARM_FEATURE_BF16 */