From patchwork Tue Feb  7 16:00:05 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Srikanth Yalavarthi <syalavarthi@marvell.com>
X-Patchwork-Id: 123312
X-Patchwork-Delegate: thomas@monjalon.net
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 048BF41C30;
	Tue,  7 Feb 2023 17:00:22 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id D5E9F40A84;
	Tue,  7 Feb 2023 17:00:21 +0100 (CET)
Received: from mx0b-0016f401.pphosted.com (mx0a-0016f401.pphosted.com
 [67.231.148.174])
 by mails.dpdk.org (Postfix) with ESMTP id 49F224021F
 for <dev@dpdk.org>; Tue,  7 Feb 2023 17:00:20 +0100 (CET)
Received: from pps.filterd (m0045849.ppops.net [127.0.0.1])
 by mx0a-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id
 317EdPa0011264 for <dev@dpdk.org>; Tue, 7 Feb 2023 08:00:17 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com;
 h=from : to : cc :
 subject : date : message-id : in-reply-to : references : mime-version :
 content-type; s=pfpt0220; bh=OvCqXlbONnL/i+nr6fWOhpz6C1ak62fo2O8HNVr78Zg=;
 b=kMRXWsL0Ut/gxWc1XjUA5yAh6SdbfJNewmmp9jONLX3ztzBjJOJ3IZ3JG4idNhD6IKmT
 yLoFt8iDB3cQnWp/ln6FCiSOZgOICWVOoCQBezGwMjU9+mG1kyQFYFbo5lh/beXmlvHb
 w9Id6PH/YH1FsxIw6SbKlx0WkjCnFk3o9Bxkd8myZDlDLwNeyLA501qTLnZHGUqOIyrG
 edMUExQ+1ARPWlZL4A5N1D4fNp2056mnXtx5w33NEYzu2RqrDtjlvNL4fuIpNSFcU+UB
 C5HzQDrAZy0lOOiBF785w7P/MwIrQrvgfbqcqNIOrul98cFs4OWAEXxov8FIZR/n34HH 7g==
Received: from dc5-exch02.marvell.com ([199.233.59.182])
 by mx0a-0016f401.pphosted.com (PPS) with ESMTPS id 3nkdyrsrbj-2
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
 for <dev@dpdk.org>; Tue, 07 Feb 2023 08:00:17 -0800
Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH02.marvell.com
 (10.69.176.39) with Microsoft SMTP Server (TLS) id 15.0.1497.42;
 Tue, 7 Feb 2023 08:00:15 -0800
Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend
 Transport; Tue, 7 Feb 2023 08:00:15 -0800
Received: from ml-host-33.caveonetworks.com (unknown [10.110.143.233])
 by maili.marvell.com (Postfix) with ESMTP id 9D8D33F70AE;
 Tue,  7 Feb 2023 08:00:10 -0800 (PST)
From: Srikanth Yalavarthi <syalavarthi@marvell.com>
To: Srikanth Yalavarthi <syalavarthi@marvell.com>
CC: <dev@dpdk.org>, <sshankarnara@marvell.com>, <jerinj@marvell.com>,
 <aprabhu@marvell.com>, <ptakkar@marvell.com>, <pshukla@marvell.com>
Subject: [PATCH v6 1/4] mldev: add headers for internal ML functions
Date: Tue, 7 Feb 2023 08:00:05 -0800
Message-ID: <20230207160008.30182-2-syalavarthi@marvell.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20230207160008.30182-1-syalavarthi@marvell.com>
References: <20221208193532.16718-1-syalavarthi@marvell.com>
 <20230207160008.30182-1-syalavarthi@marvell.com>
MIME-Version: 1.0
X-Proofpoint-ORIG-GUID: 7sOkM9pHB9gKAGVVPYyuMK_lYSrB44hy
X-Proofpoint-GUID: 7sOkM9pHB9gKAGVVPYyuMK_lYSrB44hy
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1
 definitions=2023-02-07_07,2023-02-06_03,2022-06-22_01
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Added header files for internal ML utility routines to convert
IO type and format to string, IO type to size and routines to
convert data types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
Depends-on: series-26858 ("Implementation of mldev test application")

v6:
* Updated release notes and series dependencies

v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v3:
* Skip installation of internal common/ml headers

v2:
* Moved implementation out of patch. Only headers are included.

 doc/guides/rel_notes/release_23_03.rst |   5 +
 lib/mldev/meson.build                  |   2 +
 lib/mldev/mldev_utils.c                |   5 +
 lib/mldev/mldev_utils.h                | 345 +++++++++++++++++++++++++
 4 files changed, 357 insertions(+)
 create mode 100644 lib/mldev/mldev_utils.c
 create mode 100644 lib/mldev/mldev_utils.h

--
2.17.1

diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index cd1ac98abe..425323241e 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -95,6 +95,11 @@ New Features
   * Test case for inferences from multiple models in ordered mode.
   * Test case for inferences from multiple models.in interleaving mode.

+* **Added common driver functions for machine learning device library.**
+
+  * Added functions to translate IO type and format to string.
+  * Added functions to quantize and dequantize inference IO data.
+

 Removed Items
 -------------
diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index 5c99532c1a..452b83a480 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -4,6 +4,7 @@
 sources = files(
         'rte_mldev_pmd.c',
         'rte_mldev.c',
+        'mldev_utils.c',
 )

 headers = files(
@@ -16,6 +17,7 @@ indirect_headers += files(

 driver_sdk_headers += files(
         'rte_mldev_pmd.h',
+        'mldev_utils.h',
 )

 deps += ['mempool']
diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c
new file mode 100644
index 0000000000..9dbbf013a0
--- /dev/null
+++ b/lib/mldev/mldev_utils.c
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include "mldev_utils.h"
diff --git a/lib/mldev/mldev_utils.h b/lib/mldev/mldev_utils.h
new file mode 100644
index 0000000000..04cdaab567
--- /dev/null
+++ b/lib/mldev/mldev_utils.h
@@ -0,0 +1,345 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _RTE_MLDEV_UTILS_H_
+#define _RTE_MLDEV_UTILS_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file
+ *
+ * RTE ML Device PMD utility API
+ *
+ * These APIs for the use from ML drivers, user applications shouldn't use them.
+ *
+ */
+
+#include <rte_compat.h>
+#include <rte_mldev.h>
+
+/**
+ * @internal
+ *
+ * Get the size an ML IO type in bytes.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ *
+ * @return
+ *	- > 0, Size of the data type in bytes.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO type.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO data type.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len);
+
+/**
+ * @internal
+ *
+ * Get the name of an ML IO format.
+ *
+ * @param[in] type
+ *	Enumeration of ML IO format.
+ * @param[in] str
+ *	Address of character array.
+ * @param[in] len
+ *	Length of character array.
+ */
+__rte_internal
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed 8-bit
+ * integer format (INT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 8-bit integer format (INT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 8-bit integer format (UINT8).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 8-bit integer format (UINT8) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT8 numbers. Size of buffer is equal to (nb_elements * 1) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to signed
+ * 16-bit integer format (INT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in signed 16-bit integer format (INT16) to single precision
+ * floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing INT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to unsigned
+ * 16-bit integer format (UINT16).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ * @param[out] output
+ *	Output buffer to store UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in unsigned 16-bit integer format (UINT16) to single
+ * precision floating format (float32).
+ *
+ * @param[in] scale
+ *      Scale factor for conversion.
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing UINT16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to half
+ * precision floating point format (FP16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in half precision floating format (FP16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in single precision floating format (float32) to brain
+ * floating point format (bfloat16).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing float32 numbers. Size of buffer is equal to (nb_elements *4) bytes.
+ * @param[out] output
+ *	Output buffer to store bfloat16 numbers. Size of buffer is equal to (nb_elements * 2) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output);
+
+/**
+ * @internal
+ *
+ * Convert a buffer containing numbers in brain floating point format (bfloat16) to single precision
+ * floating point format (float32).
+ *
+ * @param[in] nb_elements
+ *	Number of elements in the buffer.
+ * @param[in] input
+ *	Input buffer containing bfloat16 numbers. Size of buffer is equal to (nb_elements * 2)
+ * bytes.
+ * @param[out] output
+ *	Output buffer to store float32 numbers. Size of buffer is equal to (nb_elements * 4) bytes.
+ *
+ * @return
+ *	- 0, Success.
+ *	- < 0, Error code on failure.
+ */
+__rte_internal
+int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MLDEV_UTILS_H_ */

From patchwork Tue Feb  7 16:00:06 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Srikanth Yalavarthi <syalavarthi@marvell.com>
X-Patchwork-Id: 123313
X-Patchwork-Delegate: thomas@monjalon.net
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id DDA6E41C30;
	Tue,  7 Feb 2023 17:00:27 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 0B751427F2;
	Tue,  7 Feb 2023 17:00:23 +0100 (CET)
Received: from mx0b-0016f401.pphosted.com (mx0a-0016f401.pphosted.com
 [67.231.148.174])
 by mails.dpdk.org (Postfix) with ESMTP id 4A17F40A84
 for <dev@dpdk.org>; Tue,  7 Feb 2023 17:00:20 +0100 (CET)
Received: from pps.filterd (m0045849.ppops.net [127.0.0.1])
 by mx0a-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id
 317EdPa2011264 for <dev@dpdk.org>; Tue, 7 Feb 2023 08:00:18 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com;
 h=from : to : cc :
 subject : date : message-id : in-reply-to : references : mime-version :
 content-type; s=pfpt0220; bh=OcXiFop2NiZ7y5kWoqlZ1KEmeMNNkXYO+YOoxaMIGJ8=;
 b=CWzAvXotedREJG6igF7XLS+rrcHGzzSAWOdCxLmR7uwDp9DxTF29kkGNgQJW9+cfB/sV
 50XzRPC826ZLlHy+JVmhuLWNo3gkQTUDUEEaAcsgQwKJhypj2gZ6tElX/zV7hQ2C6Ebp
 nNynLctK+U8yCAen5yC1jdvNC4q9QqZ4/Mi1NHzASKK6HZhLQ5y/k5PrPkTLX6dzXzwT
 8YalK68G3/QHqwiUlGCotjP0gHWFvsYgXRPMs9/344Y9kd0LRnig/m2Cet9oeGXdw7Um
 57XlOoaoCsHlot4Zcx37DGkTbpBLok5Jn4Yb7C+4hoefPbOr/iCIs54qiWRPBNbkH4t8 fg==
Received: from dc5-exch02.marvell.com ([199.233.59.182])
 by mx0a-0016f401.pphosted.com (PPS) with ESMTPS id 3nkdyrsrbj-4
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
 for <dev@dpdk.org>; Tue, 07 Feb 2023 08:00:18 -0800
Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH02.marvell.com
 (10.69.176.39) with Microsoft SMTP Server (TLS) id 15.0.1497.42;
 Tue, 7 Feb 2023 08:00:16 -0800
Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend
 Transport; Tue, 7 Feb 2023 08:00:16 -0800
Received: from ml-host-33.caveonetworks.com (unknown [10.110.143.233])
 by maili.marvell.com (Postfix) with ESMTP id 002AC3F70BE;
 Tue,  7 Feb 2023 08:00:10 -0800 (PST)
From: Srikanth Yalavarthi <syalavarthi@marvell.com>
To: Srikanth Yalavarthi <syalavarthi@marvell.com>
CC: <dev@dpdk.org>, <sshankarnara@marvell.com>, <jerinj@marvell.com>,
 <aprabhu@marvell.com>, <ptakkar@marvell.com>, <pshukla@marvell.com>
Subject: [PATCH v6 2/4] mldev: implement ML IO type handling functions
Date: Tue, 7 Feb 2023 08:00:06 -0800
Message-ID: <20230207160008.30182-3-syalavarthi@marvell.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20230207160008.30182-1-syalavarthi@marvell.com>
References: <20221208193532.16718-1-syalavarthi@marvell.com>
 <20230207160008.30182-1-syalavarthi@marvell.com>
MIME-Version: 1.0
X-Proofpoint-ORIG-GUID: 7-LdFhEWh0BlYXAYMrLHiNScHpw8XQAB
X-Proofpoint-GUID: 7-LdFhEWh0BlYXAYMrLHiNScHpw8XQAB
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1
 definitions=2023-02-07_07,2023-02-06_03,2022-06-22_01
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Implemented ML utility functions to convert IO data type to name,
IO format to name and routine to get the size of an IO data type
in bytes.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v2:
* Implemented common utility functions as part of the patch
* Dropped use of driver routines for data conversion functions

 lib/mldev/mldev_utils.c | 113 ++++++++++++++++++++++++++++++++++++++++
 lib/mldev/version.map   |   4 ++
 2 files changed, 117 insertions(+)

--
2.17.1

diff --git a/lib/mldev/mldev_utils.c b/lib/mldev/mldev_utils.c
index 9dbbf013a0..d2442b123b 100644
--- a/lib/mldev/mldev_utils.c
+++ b/lib/mldev/mldev_utils.c
@@ -2,4 +2,117 @@
  * Copyright (c) 2022 Marvell.
  */

+#include <errno.h>
+#include <stdint.h>
+
+#include <rte_mldev.h>
+#include <rte_string_fns.h>
+
 #include "mldev_utils.h"
+
+/* Description:
+ * This file implements Machine Learning utility routines, except type conversion routines.
+ */
+
+int
+rte_ml_io_type_size_get(enum rte_ml_io_type type)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		return -EINVAL;
+	case RTE_ML_IO_TYPE_INT8:
+		return sizeof(int8_t);
+	case RTE_ML_IO_TYPE_UINT8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_INT16:
+		return sizeof(int16_t);
+	case RTE_ML_IO_TYPE_UINT16:
+		return sizeof(uint16_t);
+	case RTE_ML_IO_TYPE_INT32:
+		return sizeof(int32_t);
+	case RTE_ML_IO_TYPE_UINT32:
+		return sizeof(uint32_t);
+	case RTE_ML_IO_TYPE_FP8:
+		return sizeof(uint8_t);
+	case RTE_ML_IO_TYPE_FP16:
+		return sizeof(uint8_t) * 2;
+	case RTE_ML_IO_TYPE_FP32:
+		return sizeof(uint8_t) * 4;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		return sizeof(uint8_t) * 2;
+	default:
+		return -EINVAL;
+	}
+}
+
+void
+rte_ml_io_type_to_str(enum rte_ml_io_type type, char *str, int len)
+{
+	switch (type) {
+	case RTE_ML_IO_TYPE_UNKNOWN:
+		rte_strlcpy(str, "unknown", len);
+		break;
+	case RTE_ML_IO_TYPE_INT8:
+		rte_strlcpy(str, "int8", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT8:
+		rte_strlcpy(str, "uint8", len);
+		break;
+	case RTE_ML_IO_TYPE_INT16:
+		rte_strlcpy(str, "int16", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT16:
+		rte_strlcpy(str, "uint16", len);
+		break;
+	case RTE_ML_IO_TYPE_INT32:
+		rte_strlcpy(str, "int32", len);
+		break;
+	case RTE_ML_IO_TYPE_UINT32:
+		rte_strlcpy(str, "uint32", len);
+		break;
+	case RTE_ML_IO_TYPE_FP8:
+		rte_strlcpy(str, "float8", len);
+		break;
+	case RTE_ML_IO_TYPE_FP16:
+		rte_strlcpy(str, "float16", len);
+		break;
+	case RTE_ML_IO_TYPE_FP32:
+		rte_strlcpy(str, "float32", len);
+		break;
+	case RTE_ML_IO_TYPE_BFLOAT16:
+		rte_strlcpy(str, "bfloat16", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
+
+void
+rte_ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
+{
+	switch (format) {
+	case RTE_ML_IO_FORMAT_NCHW:
+		rte_strlcpy(str, "NCHW", len);
+		break;
+	case RTE_ML_IO_FORMAT_NHWC:
+		rte_strlcpy(str, "NHWC", len);
+		break;
+	case RTE_ML_IO_FORMAT_CHWN:
+		rte_strlcpy(str, "CHWN", len);
+		break;
+	case RTE_ML_IO_FORMAT_3D:
+		rte_strlcpy(str, "3D", len);
+		break;
+	case RTE_ML_IO_FORMAT_2D:
+		rte_strlcpy(str, "Matrix", len);
+		break;
+	case RTE_ML_IO_FORMAT_1D:
+		rte_strlcpy(str, "Vector", len);
+		break;
+	case RTE_ML_IO_FORMAT_SCALAR:
+		rte_strlcpy(str, "Scalar", len);
+		break;
+	default:
+		rte_strlcpy(str, "invalid", len);
+	}
+}
diff --git a/lib/mldev/version.map b/lib/mldev/version.map
index d2b30a991a..9d06659493 100644
--- a/lib/mldev/version.map
+++ b/lib/mldev/version.map
@@ -48,4 +48,8 @@ INTERNAL {
 	rte_ml_dev_pmd_get_dev;
 	rte_ml_dev_pmd_get_named_dev;
 	rte_ml_dev_pmd_release;
+
+	rte_ml_io_type_size_get;
+	rte_ml_io_type_to_str;
+	rte_ml_io_format_to_str;
 };

From patchwork Tue Feb  7 16:00:07 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Srikanth Yalavarthi <syalavarthi@marvell.com>
X-Patchwork-Id: 123315
X-Patchwork-Delegate: thomas@monjalon.net
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 7716841C30;
	Tue,  7 Feb 2023 17:00:45 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id AC3E642D13;
	Tue,  7 Feb 2023 17:00:25 +0100 (CET)
Received: from mx0b-0016f401.pphosted.com (mx0a-0016f401.pphosted.com
 [67.231.148.174])
 by mails.dpdk.org (Postfix) with ESMTP id 07D004021F
 for <dev@dpdk.org>; Tue,  7 Feb 2023 17:00:20 +0100 (CET)
Received: from pps.filterd (m0045849.ppops.net [127.0.0.1])
 by mx0a-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id
 317EdWj9011298 for <dev@dpdk.org>; Tue, 7 Feb 2023 08:00:20 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com;
 h=from : to : cc :
 subject : date : message-id : in-reply-to : references : mime-version :
 content-type; s=pfpt0220; bh=l+umPyvoTctOp+dXR5qc/h2JYv1B5GyN0DgpWQd4AhQ=;
 b=glqALXSJ6v0j/xqeAQj/MV1YNfbCljUj7kN1M0dWAT1mD/vKXijI1RnYXBjqBk/LUV7y
 b/zjWZ9cZ831hVSORUp6e964QLjN/SJgmabYJaFmg++4FjRJXwB3Bg4ghlYCb3FG/37p
 kjg6LxNqPDBSnZLa3cr8MSko3UpwJ4BJXq0CdU+e+tvUxEYycoVrklMonE/TY93FuBVX
 0evgX1H3MeCgpkk0d7+UCP6hHA1YN7KOgL5d4fTr41bzDngeUirRFwcIMWiOFeDAKYKJ
 MgSSus+gS+LDtM0PBqseK++77prrohTq59yoPHZZEbXs7dVw6uARRyjdmMhwMaS2ZOy1 lA==
Received: from dc5-exch01.marvell.com ([199.233.59.181])
 by mx0a-0016f401.pphosted.com (PPS) with ESMTPS id 3nkdyrsrbu-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
 for <dev@dpdk.org>; Tue, 07 Feb 2023 08:00:20 -0800
Received: from DC5-EXCH02.marvell.com (10.69.176.39) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server (TLS) id 15.0.1497.42;
 Tue, 7 Feb 2023 08:00:18 -0800
Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH02.marvell.com
 (10.69.176.39) with Microsoft SMTP Server id 15.0.1497.42 via Frontend
 Transport; Tue, 7 Feb 2023 08:00:18 -0800
Received: from ml-host-33.caveonetworks.com (unknown [10.110.143.233])
 by maili.marvell.com (Postfix) with ESMTP id 5889C3F7124;
 Tue,  7 Feb 2023 08:00:11 -0800 (PST)
From: Srikanth Yalavarthi <syalavarthi@marvell.com>
To: Srikanth Yalavarthi <syalavarthi@marvell.com>
CC: <dev@dpdk.org>, <sshankarnara@marvell.com>, <jerinj@marvell.com>,
 <aprabhu@marvell.com>, <ptakkar@marvell.com>, <pshukla@marvell.com>
Subject: [PATCH v6 3/4] mldev: add scalar type conversion functions
Date: Tue, 7 Feb 2023 08:00:07 -0800
Message-ID: <20230207160008.30182-4-syalavarthi@marvell.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20230207160008.30182-1-syalavarthi@marvell.com>
References: <20221208193532.16718-1-syalavarthi@marvell.com>
 <20230207160008.30182-1-syalavarthi@marvell.com>
MIME-Version: 1.0
X-Proofpoint-ORIG-GUID: dpeDCu3EBGw-2lE1aUy5iyOAO2Jq19jM
X-Proofpoint-GUID: dpeDCu3EBGw-2lE1aUy5iyOAO2Jq19jM
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1
 definitions=2023-02-07_07,2023-02-06_03,2022-06-22_01
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Added scalar implementations to support conversion of data types.
Support is enabled to handle int8, uint8, int16, uint16, float16,
float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v2:
* Updated internal function names
* Updated function attributes to __rte_weak

 lib/mldev/meson.build          |   1 +
 lib/mldev/mldev_utils_scalar.c | 720 +++++++++++++++++++++++++++++++++
 lib/mldev/version.map          |  12 +
 3 files changed, 733 insertions(+)
 create mode 100644 lib/mldev/mldev_utils_scalar.c

--
2.17.1

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index 452b83a480..fce9c0ebee 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -5,6 +5,7 @@ sources = files(
         'rte_mldev_pmd.c',
         'rte_mldev.c',
         'mldev_utils.c',
+        'mldev_utils_scalar.c',
 )

 headers = files(
diff --git a/lib/mldev/mldev_utils_scalar.c b/lib/mldev/mldev_utils_scalar.c
new file mode 100644
index 0000000000..40320ed3ef
--- /dev/null
+++ b/lib/mldev/mldev_utils_scalar.c
@@ -0,0 +1,720 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <math.h>
+#include <stdint.h>
+
+#include "mldev_utils.h"
+
+/* Description:
+ * This file implements scalar versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa.
+ */
+
+#ifndef BIT
+#define BIT(nr) (1UL << (nr))
+#endif
+
+#ifndef BITS_PER_LONG
+#define BITS_PER_LONG (__SIZEOF_LONG__ * 8)
+#endif
+
+#ifndef GENMASK_U32
+#define GENMASK_U32(h, l) (((~0UL) << (l)) & (~0UL >> (BITS_PER_LONG - 1 - (h))))
+#endif
+
+/* float32: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP32_LSB_M 0
+#define FP32_MSB_M 22
+#define FP32_LSB_E 23
+#define FP32_MSB_E 30
+#define FP32_LSB_S 31
+#define FP32_MSB_S 31
+
+/* float32: bitmask for sign, exponent and mantissa */
+#define FP32_MASK_S GENMASK_U32(FP32_MSB_S, FP32_LSB_S)
+#define FP32_MASK_E GENMASK_U32(FP32_MSB_E, FP32_LSB_E)
+#define FP32_MASK_M GENMASK_U32(FP32_MSB_M, FP32_LSB_M)
+
+/* float16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define FP16_LSB_M 0
+#define FP16_MSB_M 9
+#define FP16_LSB_E 10
+#define FP16_MSB_E 14
+#define FP16_LSB_S 15
+#define FP16_MSB_S 15
+
+/* float16: bitmask for sign, exponent and mantissa */
+#define FP16_MASK_S GENMASK_U32(FP16_MSB_S, FP16_LSB_S)
+#define FP16_MASK_E GENMASK_U32(FP16_MSB_E, FP16_LSB_E)
+#define FP16_MASK_M GENMASK_U32(FP16_MSB_M, FP16_LSB_M)
+
+/* bfloat16: bit index of MSB & LSB of sign, exponent and mantissa */
+#define BF16_LSB_M 0
+#define BF16_MSB_M 6
+#define BF16_LSB_E 7
+#define BF16_MSB_E 14
+#define BF16_LSB_S 15
+#define BF16_MSB_S 15
+
+/* bfloat16: bitmask for sign, exponent and mantissa */
+#define BF16_MASK_S GENMASK_U32(BF16_MSB_S, BF16_LSB_S)
+#define BF16_MASK_E GENMASK_U32(BF16_MSB_E, BF16_LSB_E)
+#define BF16_MASK_M GENMASK_U32(BF16_MSB_M, BF16_LSB_M)
+
+/* Exponent bias */
+#define FP32_BIAS_E 127
+#define FP16_BIAS_E 15
+#define BF16_BIAS_E 127
+
+#define FP32_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP32_LSB_S) | ((exponent) << FP32_LSB_E) | (mantissa))
+
+#define FP16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << FP16_LSB_S) | ((exponent) << FP16_LSB_E) | (mantissa))
+
+#define BF16_PACK(sign, exponent, mantissa)                                                        \
+	(((sign) << BF16_LSB_S) | ((exponent) << BF16_LSB_E) | (mantissa))
+
+/* Represent float32 as float and uint32_t */
+union float32 {
+	float f;
+	uint32_t u;
+};
+
+__rte_weak int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t i;
+	int i32;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT8_MIN)
+			i32 = INT8_MIN;
+
+		if (i32 > INT8_MAX)
+			i32 = INT8_MAX;
+
+		*output_buffer = (int8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT8_MAX)
+			i32 = UINT8_MAX;
+
+		*output_buffer = (uint8_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < INT16_MIN)
+			i32 = INT16_MIN;
+
+		if (i32 > INT16_MAX)
+			i32 = INT16_MAX;
+
+		*output_buffer = (int16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	int32_t i32;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		i32 = (int32_t)round((*input_buffer) * scale);
+
+		if (i32 < 0)
+			i32 = 0;
+
+		if (i32 > UINT16_MAX)
+			i32 = UINT16_MAX;
+
+		*output_buffer = (uint16_t)i32;
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+__rte_weak int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = scale * (float)(*input_buffer);
+
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a half precision
+ * floating point number (float16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_float16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint32_t tmsb;	   /* MSB position of truncated bits */
+	uint32_t m_32;	   /* temporary float32 mantissa */
+	uint16_t m_16;	   /* temporary float16 mantissa */
+	uint16_t u16;	   /* float16 output */
+	int be_16;	   /* float16 biased exponent, signed */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	f16_s = f32_s;
+	f16_e = 0;
+	f16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		f16_e = 0;
+		if (f32_m == 0) /* zero */
+			f16_m = 0;
+		else /* subnormal number, convert to zero */
+			f16_m = 0;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		f16_e = FP16_MASK_E >> FP16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			f16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			f16_m = f32_m >> (FP32_MSB_M - FP16_MSB_M);
+			f16_m |= BIT(FP16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number */
+		/* compute biased exponent for float16 */
+		be_16 = (int)f32_e - FP32_BIAS_E + FP16_BIAS_E;
+
+		/* overflow, be_16 = [31-INF], set to infinity */
+		if (be_16 >= (int)(FP16_MASK_E >> FP16_LSB_E)) {
+			f16_e = FP16_MASK_E >> FP16_LSB_E;
+			f16_m = 0;
+		} else if ((be_16 >= 1) && (be_16 < (int)(FP16_MASK_E >> FP16_LSB_E))) {
+			/* normal float16, be_16 = [1:30]*/
+			f16_e = be_16;
+			m_16 = f32_m >> (FP32_LSB_E - FP16_LSB_E);
+			tmsb = FP32_MSB_M - FP16_MSB_M - 1;
+			if ((f32_m & GENMASK_U32(tmsb, 0)) > BIT(tmsb)) {
+				/* round: non-zero truncated bits except MSB */
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tmsb, 0)) == BIT(tmsb)) {
+				/* round: MSB of truncated bits and LSB of m_16 is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if ((be_16 >= -(int)(FP16_MSB_M)) && (be_16 < 1)) {
+			/* underflow: zero / subnormal, be_16 = [-9:0] */
+			f16_e = 0;
+
+			/* add implicit leading zero */
+			m_32 = f32_m | BIT(FP32_LSB_E);
+			tbits = FP32_LSB_E - FP16_LSB_E - be_16 + 1;
+			m_16 = m_32 >> tbits;
+
+			/* if non-leading truncated bits are set */
+			if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+				m_16++;
+
+				/* overflow into exponent */
+				if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+					f16_e++;
+			} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+				/* if leading truncated bit is set */
+				if ((m_16 & 0x1) == 0x1) {
+					m_16++;
+
+					/* overflow into exponent */
+					if (((m_16 & FP16_MASK_E) >> FP16_LSB_E) == 0x1)
+						f16_e++;
+				}
+			}
+			f16_m = m_16 & FP16_MASK_M;
+		} else if (be_16 == -(int)(FP16_MSB_M + 1)) {
+			/* underflow: zero, be_16 = [-10] */
+			f16_e = 0;
+			if (f32_m != 0)
+				f16_m = 1;
+			else
+				f16_m = 0;
+		} else {
+			/* underflow: zero, be_16 = [-INF:-11] */
+			f16_e = 0;
+			f16_m = 0;
+		}
+
+		break;
+	}
+
+	u16 = FP16_PACK(f16_s, f16_e, f16_m);
+
+	return u16;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_float16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a half precision floating point number (float16) into a single precision
+ * floating point number (float32).
+ */
+static float
+__float16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t f16_s;	   /* float16 sign */
+	uint16_t f16_e;	   /* float16 exponent */
+	uint16_t f16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+	uint32_t clz;	   /* count of leading zeroes */
+	int e_16;	   /* float16 exponent unbiased */
+
+	f16_s = (f16 & FP16_MASK_S) >> FP16_LSB_S;
+	f16_e = (f16 & FP16_MASK_E) >> FP16_LSB_E;
+	f16_m = (f16 & FP16_MASK_M) >> FP16_LSB_M;
+
+	f32_s = f16_s;
+	switch (f16_e) {
+	case (FP16_MASK_E >> FP16_LSB_E): /* float16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (f16_m == 0x0) { /* infinity */
+			f32_m = f16_m;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = f16_m;
+			shift = FP32_MSB_M - FP16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* float16: zero or sub-normal */
+		f32_m = f16_m;
+		if (f16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			clz = __builtin_clz((uint32_t)f16_m) - sizeof(uint32_t) * 8 + FP16_LSB_E;
+			e_16 = (int)f16_e - clz;
+			f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+			shift = clz + (FP32_MSB_M - FP16_MSB_M) + 1;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+		}
+		break;
+	default: /* normal numbers */
+		f32_m = f16_m;
+		e_16 = (int)f16_e;
+		f32_e = FP32_BIAS_E + e_16 - FP16_BIAS_E;
+
+		shift = (FP32_MSB_M - FP16_MSB_M);
+		f32_m = (f32_m << shift) & FP32_MASK_M;
+	}
+
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a single precision floating point number (float32) into a
+ * brain float number (bfloat16) using round to nearest rounding mode.
+ */
+static uint16_t
+__float32_to_bfloat16_scalar_rtn(float x)
+{
+	union float32 f32; /* float32 input */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t tbits;	   /* number of truncated bits */
+	uint16_t u16;	   /* float16 output */
+
+	f32.f = x;
+	f32_s = (f32.u & FP32_MASK_S) >> FP32_LSB_S;
+	f32_e = (f32.u & FP32_MASK_E) >> FP32_LSB_E;
+	f32_m = (f32.u & FP32_MASK_M) >> FP32_LSB_M;
+
+	b16_s = f32_s;
+	b16_e = 0;
+	b16_m = 0;
+
+	switch (f32_e) {
+	case (0): /* float32: zero or subnormal number */
+		b16_e = 0;
+		if (f32_m == 0) /* zero */
+			b16_m = 0;
+		else /* subnormal float32 number, normal bfloat16 */
+			goto bf16_normal;
+		break;
+	case (FP32_MASK_E >> FP32_LSB_E): /* float32: infinity or nan */
+		b16_e = BF16_MASK_E >> BF16_LSB_E;
+		if (f32_m == 0) { /* infinity */
+			b16_m = 0;
+		} else { /* nan, propagate mantissa and set MSB of mantissa to 1 */
+			b16_m = f32_m >> (FP32_MSB_M - BF16_MSB_M);
+			b16_m |= BIT(BF16_MSB_M);
+		}
+		break;
+	default: /* float32: normal number, normal bfloat16 */
+		goto bf16_normal;
+	}
+
+	goto bf16_pack;
+
+bf16_normal:
+	b16_e = f32_e;
+	tbits = FP32_MSB_M - BF16_MSB_M;
+	b16_m = f32_m >> tbits;
+
+	/* if non-leading truncated bits are set */
+	if ((f32_m & GENMASK_U32(tbits - 1, 0)) > BIT(tbits - 1)) {
+		b16_m++;
+
+		/* if overflow into exponent */
+		if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+			b16_e++;
+	} else if ((f32_m & GENMASK_U32(tbits - 1, 0)) == BIT(tbits - 1)) {
+		/* if only leading truncated bit is set */
+		if ((b16_m & 0x1) == 0x1) {
+			b16_m++;
+
+			/* if overflow into exponent */
+			if (((b16_m & BF16_MASK_E) >> BF16_LSB_E) == 0x1)
+				b16_e++;
+		}
+	}
+	b16_m = b16_m & BF16_MASK_M;
+
+bf16_pack:
+	u16 = BF16_PACK(b16_s, b16_e, b16_m);
+
+	return u16;
+}
+
+__rte_weak int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __float32_to_bfloat16_scalar_rtn(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
+
+/* Convert a brain float number (bfloat16) into a
+ * single precision floating point number (float32).
+ */
+static float
+__bfloat16_to_float32_scalar_rtx(uint16_t f16)
+{
+	union float32 f32; /* float32 output */
+	uint16_t b16_s;	   /* float16 sign */
+	uint16_t b16_e;	   /* float16 exponent */
+	uint16_t b16_m;	   /* float16 mantissa */
+	uint32_t f32_s;	   /* float32 sign */
+	uint32_t f32_e;	   /* float32 exponent */
+	uint32_t f32_m;	   /* float32 mantissa*/
+	uint8_t shift;	   /* number of bits to be shifted */
+
+	b16_s = (f16 & BF16_MASK_S) >> BF16_LSB_S;
+	b16_e = (f16 & BF16_MASK_E) >> BF16_LSB_E;
+	b16_m = (f16 & BF16_MASK_M) >> BF16_LSB_M;
+
+	f32_s = b16_s;
+	switch (b16_e) {
+	case (BF16_MASK_E >> BF16_LSB_E): /* bfloat16: infinity or nan */
+		f32_e = FP32_MASK_E >> FP32_LSB_E;
+		if (b16_m == 0x0) { /* infinity */
+			f32_m = 0;
+		} else { /* nan, propagate mantissa, set MSB of mantissa to 1 */
+			f32_m = b16_m;
+			shift = FP32_MSB_M - BF16_MSB_M;
+			f32_m = (f32_m << shift) & FP32_MASK_M;
+			f32_m |= BIT(FP32_MSB_M);
+		}
+		break;
+	case 0: /* bfloat16: zero or subnormal */
+		f32_m = b16_m;
+		if (b16_m == 0) { /* zero signed */
+			f32_e = 0;
+		} else { /* subnormal numbers */
+			goto fp32_normal;
+		}
+		break;
+	default: /* bfloat16: normal number */
+		goto fp32_normal;
+	}
+
+	goto fp32_pack;
+
+fp32_normal:
+	f32_m = b16_m;
+	f32_e = FP32_BIAS_E + b16_e - BF16_BIAS_E;
+
+	shift = (FP32_MSB_M - BF16_MSB_M);
+	f32_m = (f32_m << shift) & FP32_MASK_M;
+
+fp32_pack:
+	f32.u = FP32_PACK(f32_s, f32_e, f32_m);
+
+	return f32.f;
+}
+
+__rte_weak int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+
+	for (i = 0; i < nb_elements; i++) {
+		*output_buffer = __bfloat16_to_float32_scalar_rtx(*input_buffer);
+
+		input_buffer = input_buffer + 1;
+		output_buffer = output_buffer + 1;
+	}
+
+	return 0;
+}
diff --git a/lib/mldev/version.map b/lib/mldev/version.map
index 9d06659493..0706b565be 100644
--- a/lib/mldev/version.map
+++ b/lib/mldev/version.map
@@ -52,4 +52,16 @@ INTERNAL {
 	rte_ml_io_type_size_get;
 	rte_ml_io_type_to_str;
 	rte_ml_io_format_to_str;
+	rte_ml_io_float32_to_int8;
+	rte_ml_io_int8_to_float32;
+	rte_ml_io_float32_to_uint8;
+	rte_ml_io_uint8_to_float32;
+	rte_ml_io_float32_to_int16;
+	rte_ml_io_int16_to_float32;
+	rte_ml_io_float32_to_uint16;
+	rte_ml_io_uint16_to_float32;
+	rte_ml_io_float32_to_float16;
+	rte_ml_io_float16_to_float32;
+	rte_ml_io_float32_to_bfloat16;
+	rte_ml_io_bfloat16_to_float32;
 };

From patchwork Tue Feb  7 16:00:08 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Srikanth Yalavarthi <syalavarthi@marvell.com>
X-Patchwork-Id: 123316
X-Patchwork-Delegate: thomas@monjalon.net
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 084FA41C30;
	Tue,  7 Feb 2023 17:00:54 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id B614642D1A;
	Tue,  7 Feb 2023 17:00:26 +0100 (CET)
Received: from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com
 [67.231.156.173])
 by mails.dpdk.org (Postfix) with ESMTP id 7F4BD42D0E
 for <dev@dpdk.org>; Tue,  7 Feb 2023 17:00:25 +0100 (CET)
Received: from pps.filterd (m0045851.ppops.net [127.0.0.1])
 by mx0b-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id
 317BEAMO015844; Tue, 7 Feb 2023 08:00:21 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com;
 h=from : to : cc :
 subject : date : message-id : in-reply-to : references : mime-version :
 content-type; s=pfpt0220; bh=KPwoQ0e/FPVHNZB8nsJz2b7Xj3MqEwHr28eVfdwFC10=;
 b=PJJkQLSZHvHuPa/JgafGAlM6ZqxcKUQMA1p3EYWBITpCe4DH/5HGlimayehSsKkGcz+x
 6rRiiYiiiKYwHJcRkLNl+my83oLoHocOlpo8YTahYNh2ZYer92rUflqFDc7Vqf2hphhz
 G8PK8iXBus+Sfg77uhOrEuix6r0mg8k9KVil8u4jNbr36/oP9eVXYqu4Jz21KO/duHRq
 uM8R8WZJw7yDt9L/XYw+BIA4wrZsL5HGvthcXMzau+zKczXlcuQ0wQA9l+hij7mmTMib
 2BlKguihQIvotQ8BISMmPYAYcQDxBxRk1MII9BM7Qee1uMIsCyPElweFInBMXpfPUlqQ oQ==
Received: from dc5-exch02.marvell.com ([199.233.59.182])
 by mx0b-0016f401.pphosted.com (PPS) with ESMTPS id 3nhqrtmr96-2
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT);
 Tue, 07 Feb 2023 08:00:20 -0800
Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH02.marvell.com
 (10.69.176.39) with Microsoft SMTP Server (TLS) id 15.0.1497.42;
 Tue, 7 Feb 2023 08:00:18 -0800
Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend
 Transport; Tue, 7 Feb 2023 08:00:18 -0800
Received: from ml-host-33.caveonetworks.com (unknown [10.110.143.233])
 by maili.marvell.com (Postfix) with ESMTP id E7E233F70EC;
 Tue,  7 Feb 2023 08:00:11 -0800 (PST)
From: Srikanth Yalavarthi <syalavarthi@marvell.com>
To: Srikanth Yalavarthi <syalavarthi@marvell.com>, Ruifeng Wang
 <ruifeng.wang@arm.com>
CC: <dev@dpdk.org>, <sshankarnara@marvell.com>, <jerinj@marvell.com>,
 <aprabhu@marvell.com>, <ptakkar@marvell.com>, <pshukla@marvell.com>
Subject: [PATCH v6 4/4] mldev: add Arm NEON type conversion routines
Date: Tue, 7 Feb 2023 08:00:08 -0800
Message-ID: <20230207160008.30182-5-syalavarthi@marvell.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20230207160008.30182-1-syalavarthi@marvell.com>
References: <20221208193532.16718-1-syalavarthi@marvell.com>
 <20230207160008.30182-1-syalavarthi@marvell.com>
MIME-Version: 1.0
X-Proofpoint-GUID: j95RbRgL9hdLHi6ro8AOLWkUhiPP_O5d
X-Proofpoint-ORIG-GUID: j95RbRgL9hdLHi6ro8AOLWkUhiPP_O5d
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1
 definitions=2023-02-07_07,2023-02-06_03,2022-06-22_01
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Added ARM NEON intrinsic based implementations to support conversion
of data types. Support is enabled to handle int8, uint8, int16, uint16,
float16, float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
v5:
* Moved the code from drivers/common/ml to lib/mldev
* Added rte_ml_io_ prefix to the functions

v2:
* Dropped use of driver routines to call neon functions
* Optimization of neon functions. Reduce the number of intrinsic calls.

 lib/mldev/meson.build        |   4 +
 lib/mldev/mldev_utils_neon.c | 873 +++++++++++++++++++++++++++++++++++
 2 files changed, 877 insertions(+)
 create mode 100644 lib/mldev/mldev_utils_neon.c

--
2.17.1

diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build
index fce9c0ebee..05694b0839 100644
--- a/lib/mldev/meson.build
+++ b/lib/mldev/meson.build
@@ -8,6 +8,10 @@ sources = files(
         'mldev_utils_scalar.c',
 )

+if arch_subdir == 'arm'
+    sources += files('mldev_utils_neon.c')
+endif
+
 headers = files(
         'rte_mldev.h',
 )
diff --git a/lib/mldev/mldev_utils_neon.c b/lib/mldev/mldev_utils_neon.c
new file mode 100644
index 0000000000..32b620db20
--- /dev/null
+++ b/lib/mldev/mldev_utils_neon.c
@@ -0,0 +1,873 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "mldev_utils.h"
+
+#include <arm_neon.h>
+
+/* Description:
+ * This file implements vector versions of Machine Learning utility functions used to convert data
+ * types from higher precision to lower precision and vice-versa. Implementation is based on Arm
+ * Neon intrinsics.
+ */
+
+static inline void
+__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output)
+{
+	int16x4_t s16x4_l;
+	int16x4_t s16x4_h;
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_l = vqmovn_s32(s32x4);
+
+	/* load next 4 float32 elements, scale, convert, saturate narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s16x4_h = vqmovn_s32(s32x4);
+
+	/* combine lower and higher int16x4_t to int16x8_t */
+	s16x8 = vcombine_s16(s16x4_l, s16x4_h);
+
+	/* narrow to int8_t */
+	s8x8 = vqmovn_s16(s16x8);
+
+	/* store 8 elements */
+	vst1_s8(output, s8x8);
+}
+
+static inline void
+__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output)
+{
+	int32_t s32;
+	int16_t s16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	s16 = vqmovns_s32(s32);
+
+	/* convert to int8_t */
+	*output = vqmovnh_s16(s16);
+}
+
+int
+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 8 x int8_t elements */
+	s8x8 = vld1_s8(input);
+
+	/* widen int8_t to int16_t */
+	s16x8 = vmovl_s8(s8x8);
+
+	/* convert lower 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_low_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_high_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output)
+{
+	uint16x4_t u16x4_l;
+	uint16x4_t u16x4_h;
+	float32x4_t f32x4;
+	uint32x4_t u32x4;
+	uint16x8_t u16x8;
+	uint8x8_t u8x8;
+
+	/* load 4 float elements, scale, convert, saturate narrow to uint16_t.
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_l = vqmovn_u32(u32x4);
+
+	/* load next 4 float elements, scale, convert, saturate narrow to uint16_t
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u16x4_h = vqmovn_u32(u32x4);
+
+	/* combine lower and higher uint16x4_t */
+	u16x8 = vcombine_u16(u16x4_l, u16x4_h);
+
+	/* narrow to uint8x8_t */
+	u8x8 = vqmovn_u16(u16x8);
+
+	/* store 8 elements */
+	vst1_u8(output, u8x8);
+}
+
+static inline void
+__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output)
+{
+	uint32_t u32;
+	uint16_t u16;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	u16 = vqmovns_u32(u32);
+
+	/* convert to uint8_t */
+	*output = vqmovnh_u16(u16);
+}
+
+int
+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x8_t u16x8;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint8x8_t u8x8;
+
+	/* load 8 x uint8_t elements */
+	u8x8 = vld1_u8(input);
+
+	/* widen uint8_t to uint16_t */
+	u16x8 = vmovl_u8(u8x8);
+
+	/* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_low_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_high_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static inline void
+__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert to int32x4_t using round to nearest with ties away rounding mode */
+	s32x4 = vcvtaq_s32_f32(f32x4);
+
+	/* saturate narrow to int16x4_t */
+	s16x4 = vqmovn_s32(s32x4);
+
+	/* store 4 elements */
+	vst1_s16(output, s16x4);
+}
+
+static inline void
+__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output)
+{
+	int32_t s32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	s32 = vcvtas_s32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_s32(s32);
+}
+
+int
+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x int16_t elements */
+	s16x4 = vld1_s16(input);
+
+	/* widen int16_t to int32_t */
+	s32x4 = vmovl_s16(s16x4);
+
+	/* convert int32_t to float */
+	f32x4 = vcvtq_f32_s32(s32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	u32x4 = vcvtaq_u32_f32(f32x4);
+
+	/* saturate narrow */
+	u16x4 = vqmovn_u32(u32x4);
+
+	/* store 4 elements */
+	vst1_u16(output, u16x4);
+}
+
+static inline void
+__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output)
+{
+	uint32_t u32;
+
+	/* scale and convert, round to nearest with ties away rounding mode */
+	u32 = vcvtas_u32_f32(scale * (*input));
+
+	/* saturate narrow */
+	*output = vqmovns_u32(u32);
+}
+
+int
+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 x uint16_t elements */
+	u16x4 = vld1_u16(input);
+
+	/* widen uint16_t to uint32_t */
+	u32x4 = vmovl_u16(u16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_u32(u32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert to float16x4_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store float16x4_t */
+	vst1_f16(output, f16x4);
+}
+
+static inline void
+__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to float16_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_f16(output, f16x4, 0);
+}
+
+int
+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	float16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (float16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_float16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_float16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x float16_t elements */
+	f16x4 = vld1_f16(input);
+
+	/* convert float16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	f16x4 = vld1_dup_f16(input);
+
+	/* convert float16_t to float32_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	float16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#ifdef __ARM_FEATURE_BF16
+
+static inline void
+__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert float32x4_t to bfloat16x4_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store bfloat16x4_t */
+	vst1_bf16(output, bf16x4);
+}
+
+static inline void
+__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to bfloat16_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_bf16(output, bf16x4, 0);
+}
+
+int
+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	bfloat16_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (bfloat16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x bfloat16_t elements */
+	bf16x4 = vld1_bf16(input);
+
+	/* convert bfloat16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static inline void
+__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	bf16x4 = vld1_dup_bf16(input);
+
+	/* convert bfloat16_t to float32_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store lane 0 / 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
+{
+	bfloat16_t *input_buffer;
+	float32_t *output_buffer;
+	uint64_t nb_iterations;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (bfloat16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+	nb_iterations = nb_elements / vlen;
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < nb_iterations; i++) {
+		__bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#endif /* __ARM_FEATURE_BF16 */