[v1,4/4] common/ml: add Arm NEON type conversion routines

Message ID 20221208193532.16718-5-syalavarthi@marvell.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series implementation of ML common code |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/github-robot: build success github build: passed
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/intel-Testing success Testing PASS
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS

Commit Message

Srikanth Yalavarthi Dec. 8, 2022, 7:35 p.m. UTC
  Added ARM NEON intrinsic based implementations to support conversion
of data types. Support is enabled to handle int8, uint8, int16, uint16,
float16, float32 and bfloat16 types.

Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
---
 drivers/common/ml/meson.build     |   5 +
 drivers/common/ml/ml_utils.c      |  48 ++
 drivers/common/ml/ml_utils_neon.c | 950 ++++++++++++++++++++++++++++++
 drivers/common/ml/ml_utils_neon.h |  23 +
 4 files changed, 1026 insertions(+)
 create mode 100644 drivers/common/ml/ml_utils_neon.c
 create mode 100644 drivers/common/ml/ml_utils_neon.h
  

Comments

Ruifeng Wang Dec. 12, 2022, 7:16 a.m. UTC | #1
> -----Original Message-----
> From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Sent: Friday, December 9, 2022 3:36 AM
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: dev@dpdk.org; sshankarnara@marvell.com; jerinj@marvell.com; aprabhu@marvell.com
> Subject: [PATCH v1 4/4] common/ml: add Arm NEON type conversion routines
> 
> Added ARM NEON intrinsic based implementations to support conversion of data types.
> Support is enabled to handle int8, uint8, int16, uint16, float16, float32 and bfloat16
> types.
> 
> Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> ---
>  drivers/common/ml/meson.build     |   5 +
>  drivers/common/ml/ml_utils.c      |  48 ++
>  drivers/common/ml/ml_utils_neon.c | 950 ++++++++++++++++++++++++++++++
> drivers/common/ml/ml_utils_neon.h |  23 +
>  4 files changed, 1026 insertions(+)
>  create mode 100644 drivers/common/ml/ml_utils_neon.c  create mode 100644
> drivers/common/ml/ml_utils_neon.h
> 
> diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build index
> 84ae84ee4e..f7ce19b4b4 100644
> --- a/drivers/common/ml/meson.build
> +++ b/drivers/common/ml/meson.build
> @@ -17,6 +17,11 @@ sources = files(
>          'ml_utils_generic.c',
>  )
> 
> +if arch_subdir == 'arm'
> +    headers += files('ml_utils_neon.h')
> +    sources += files('ml_utils_neon.c') endif
> +
>  deps += ['mldev']
> 
>  pmd_supports_disable_iova_as_pa = true
> diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c index
> e2edef0904..3edcf09fde 100644
> --- a/drivers/common/ml/ml_utils.c
> +++ b/drivers/common/ml/ml_utils.c
> @@ -120,71 +120,119 @@ ml_io_format_to_str(enum rte_ml_io_format format, char *str, int
> len)  int  ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
> {
> +#if defined(__ARM_NEON__)
> +	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
> +#else
>  	return ml_float32_to_int8_generic(scale, nb_elements, input, output);
> +#endif
>  }
> 
Maybe __rte_weak can be used to remove the ifdef clutter.

Something like:
ml_utils.c
__rte_weak int ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
{
	return ml_float32_to_int8_generic(scale, nb_elements, input, output);
}
ml_utis_neon.c
int ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
{
	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
}

<snip>
> diff --git a/drivers/common/ml/ml_utils_neon.c b/drivers/common/ml/ml_utils_neon.c
> new file mode 100644
> index 0000000000..b660de07ec
> --- /dev/null
> +++ b/drivers/common/ml/ml_utils_neon.c
> @@ -0,0 +1,950 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2022 Marvell.
> + */
> +
> +#include <errno.h>
> +#include <math.h>
> +#include <stdint.h>
> +
> +#include <rte_common.h>
> +#include <rte_vect.h>
> +
> +#include "ml_utils.h"
> +#include "ml_utils_neon.h"
> +
> +#include <arm_neon.h>
This line can be removed. It is included rte_vect.h.

Thanks.
<snip>
  
Srikanth Yalavarthi Dec. 12, 2022, 5:25 p.m. UTC | #2
> -----Original Message-----
> From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Sent: 12 December 2022 12:46
> To: Srikanth Yalavarthi <syalavarthi@marvell.com>
> Cc: dev@dpdk.org; Shivah Shankar Shankar Narayan Rao
> <sshankarnara@marvell.com>; Jerin Jacob Kollanukkaran
> <jerinj@marvell.com>; Anup Prabhu <aprabhu@marvell.com>; nd
> <nd@arm.com>; Srikanth Yalavarthi <syalavarthi@marvell.com>
> Subject: [EXT] RE: [PATCH v1 4/4] common/ml: add Arm NEON type
> conversion routines
> 
> External Email
> 
> ----------------------------------------------------------------------
> > -----Original Message-----
> > From: Srikanth Yalavarthi <syalavarthi@marvell.com>
> > Sent: Friday, December 9, 2022 3:36 AM
> > To: Srikanth Yalavarthi <syalavarthi@marvell.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>
> > Cc: dev@dpdk.org; sshankarnara@marvell.com; jerinj@marvell.com;
> > aprabhu@marvell.com
> > Subject: [PATCH v1 4/4] common/ml: add Arm NEON type conversion
> > routines
> >
> > Added ARM NEON intrinsic based implementations to support conversion
> of data types.
> > Support is enabled to handle int8, uint8, int16, uint16, float16,
> > float32 and bfloat16 types.
> >
> > Signed-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>
> > ---
> >  drivers/common/ml/meson.build     |   5 +
> >  drivers/common/ml/ml_utils.c      |  48 ++
> >  drivers/common/ml/ml_utils_neon.c | 950
> > ++++++++++++++++++++++++++++++ drivers/common/ml/ml_utils_neon.h
> |  23
> > +
> >  4 files changed, 1026 insertions(+)
> >  create mode 100644 drivers/common/ml/ml_utils_neon.c  create mode
> > 100644 drivers/common/ml/ml_utils_neon.h
> >
> > diff --git a/drivers/common/ml/meson.build
> > b/drivers/common/ml/meson.build index
> > 84ae84ee4e..f7ce19b4b4 100644
> > --- a/drivers/common/ml/meson.build
> > +++ b/drivers/common/ml/meson.build
> > @@ -17,6 +17,11 @@ sources = files(
> >          'ml_utils_generic.c',
> >  )
> >
> > +if arch_subdir == 'arm'
> > +    headers += files('ml_utils_neon.h')
> > +    sources += files('ml_utils_neon.c') endif
> > +
> >  deps += ['mldev']
> >
> >  pmd_supports_disable_iova_as_pa = true diff --git
> > a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c index
> > e2edef0904..3edcf09fde 100644
> > --- a/drivers/common/ml/ml_utils.c
> > +++ b/drivers/common/ml/ml_utils.c
> > @@ -120,71 +120,119 @@ ml_io_format_to_str(enum rte_ml_io_format
> > format, char *str, int
> > len)  int  ml_float32_to_int8(float scale, uint64_t nb_elements, void
> > *input, void *output) {
> > +#if defined(__ARM_NEON__)
> > +	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
> > +#else
> >  	return ml_float32_to_int8_generic(scale, nb_elements, input,
> > output);
> > +#endif
> >  }
> >
> Maybe __rte_weak can be used to remove the ifdef clutter.
> 
> Something like:
> ml_utils.c
> __rte_weak int ml_float32_to_int8(float scale, uint64_t nb_elements, void
> *input, void *output) {
> 	return ml_float32_to_int8_generic(scale, nb_elements, input,
> output); } ml_utis_neon.c int ml_float32_to_int8(float scale, uint64_t
> nb_elements, void *input, void *output) {
> 	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
> }
> 
Updated the common/ml series implementation. Scalar / generic routines would be weak symbols.

> <snip>
> > diff --git a/drivers/common/ml/ml_utils_neon.c
> > b/drivers/common/ml/ml_utils_neon.c
> > new file mode 100644
> > index 0000000000..b660de07ec
> > --- /dev/null
> > +++ b/drivers/common/ml/ml_utils_neon.c
> > @@ -0,0 +1,950 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright (c) 2022 Marvell.
> > + */
> > +
> > +#include <errno.h>
> > +#include <math.h>
> > +#include <stdint.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_vect.h>
> > +
> > +#include "ml_utils.h"
> > +#include "ml_utils_neon.h"
> > +
> > +#include <arm_neon.h>
> This line can be removed. It is included rte_vect.h.
Done
> 
> Thanks.
> <snip>
  

Patch

diff --git a/drivers/common/ml/meson.build b/drivers/common/ml/meson.build
index 84ae84ee4e..f7ce19b4b4 100644
--- a/drivers/common/ml/meson.build
+++ b/drivers/common/ml/meson.build
@@ -17,6 +17,11 @@  sources = files(
         'ml_utils_generic.c',
 )
 
+if arch_subdir == 'arm'
+    headers += files('ml_utils_neon.h')
+    sources += files('ml_utils_neon.c')
+endif
+
 deps += ['mldev']
 
 pmd_supports_disable_iova_as_pa = true
diff --git a/drivers/common/ml/ml_utils.c b/drivers/common/ml/ml_utils.c
index e2edef0904..3edcf09fde 100644
--- a/drivers/common/ml/ml_utils.c
+++ b/drivers/common/ml/ml_utils.c
@@ -120,71 +120,119 @@  ml_io_format_to_str(enum rte_ml_io_format format, char *str, int len)
 int
 ml_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_int8_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_int8_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_int8_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_int8_to_float32_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_uint8_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_uint8_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_uint8_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_uint8_to_float32_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_int16_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_int16_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_int16_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_int16_to_float32_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_uint16_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_uint16_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_uint16_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_uint16_to_float32_generic(scale, nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_float16(uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float32_to_float16_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_float16_generic(nb_elements, input, output);
+#endif
 }
 
 int
 ml_float16_to_float32(uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_NEON__)
+	return ml_float16_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_float16_to_float32_generic(nb_elements, input, output);
+#endif
 }
 
 int
 ml_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_FEATURE_BF16)
+	return ml_float32_to_bfloat16_neon(scale, nb_elements, input, output);
+#else
 	return ml_float32_to_bfloat16_generic(nb_elements, input, output);
+#endif
 }
 
 int
 ml_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)
 {
+#if defined(__ARM_FEATURE_BF16)
+	return ml_bfloat16_to_float32_neon(scale, nb_elements, input, output);
+#else
 	return ml_bfloat16_to_float32_generic(nb_elements, input, output);
+#endif
 }
diff --git a/drivers/common/ml/ml_utils_neon.c b/drivers/common/ml/ml_utils_neon.c
new file mode 100644
index 0000000000..b660de07ec
--- /dev/null
+++ b/drivers/common/ml/ml_utils_neon.c
@@ -0,0 +1,950 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#include <errno.h>
+#include <math.h>
+#include <stdint.h>
+
+#include <rte_common.h>
+#include <rte_vect.h>
+
+#include "ml_utils.h"
+#include "ml_utils_neon.h"
+
+#include <arm_neon.h>
+
+static void
+__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output)
+{
+	int16x4_t s16x4_l;
+	int16x4_t s16x4_h;
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int32x4_t s32x4;
+	int32x4_t vmin;
+	int32x4_t vmax;
+	int8x8_t s8x8;
+
+	/* set constants */
+	vmin = vdupq_n_s32(INT8_MIN);
+	vmax = vdupq_n_s32(INT8_MAX);
+
+	/* load 4 float32 elements, scale, convert, update ranges and narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s32x4 = vminq_s32(s32x4, vmax);
+	s32x4 = vmaxq_s32(s32x4, vmin);
+	s16x4_l = vmovn_s32(s32x4);
+
+	/* load next 4 float32 elements, scale, convert, update ranges and narrow to int16.
+	 * Use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	s32x4 = vcvtaq_s32_f32(f32x4);
+	s32x4 = vminq_s32(s32x4, vmax);
+	s32x4 = vmaxq_s32(s32x4, vmin);
+	s16x4_h = vmovn_s32(s32x4);
+
+	/* combine lower and higher int16x4_t to int16x8_t */
+	s16x8 = vcombine_s16(s16x4_l, s16x4_h);
+
+	/* narrow to int8_t */
+	s8x8 = vmovn_s16(s16x8);
+
+	/* store 8 elements */
+	vst1_s8(output, s8x8);
+}
+
+static void
+__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output)
+{
+	float32x2_t f32x2;
+	int32x2_t s32x2;
+	int32x2_t vmin;
+	int32x2_t vmax;
+	int8x8_t s8x8;
+
+	/* set constants */
+	vmin = vdup_n_s32(INT8_MIN);
+	vmax = vdup_n_s32(INT8_MAX);
+
+	/* load element to 2 lanes */
+	f32x2 = vld1_dup_f32(input);
+
+	/* scale */
+	f32x2 = vmul_n_f32(f32x2, scale);
+
+	/* convert with use round to nearest with ties away rounding mode */
+	s32x2 = vcvta_s32_f32(f32x2);
+
+	/* update range [INT8_MIN:INT8_MAX] */
+	s32x2 = vmin_s32(s32x2, vmax);
+	s32x2 = vmax_s32(s32x2, vmin);
+
+	/* convert to int8_t */
+	s8x8 = vreinterpret_s8_s32(s32x2);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_s8(output, s8x8, 0);
+}
+
+int
+ml_float32_to_int8_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int8_t *output_buffer;
+	uint32_t batch_size;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int8_t *)output;
+	batch_size = 2 * sizeof(float) / sizeof(int8_t);
+
+	/* convert batch_size elements in each iteration */
+	for (i = 0; i < (nb_elements / batch_size); i++) {
+		__float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer);
+		input_buffer += batch_size;
+		output_buffer += batch_size;
+	}
+
+	/* convert leftover elements */
+	i = i * batch_size;
+	for (; i < nb_elements; i++) {
+		__float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x8_t s16x8;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int8x8_t s8x8;
+
+	/* load 8 x int8_t elements */
+	s8x8 = vld1_s8(input);
+
+	/* widen int8_t to int16_t */
+	s16x8 = vmovl_s8(s8x8);
+
+	/* convert lower 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_low_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to int32_t, convert to float, scale and store */
+	s16x4 = vget_high_s16(s16x8);
+	s32x4 = vmovl_s16(s16x4);
+	f32x4 = vcvtq_f32_s32(s32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static void
+__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+ml_int8_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int8_t *input_buffer;
+	float *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int8_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output)
+{
+	uint16x4_t u16x4_l;
+	uint16x4_t u16x4_h;
+	float32x4_t f32x4;
+	uint32x4_t u32x4;
+	uint16x8_t u16x8;
+	uint32x4_t vmax;
+	uint8x8_t u8x8;
+
+	/* set constants */
+	vmax = vdupq_n_u32(UINT8_MAX);
+
+	/* load 4 float elements, scale, convert, update range and narrow to uint16_t.
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u32x4 = vminq_u32(u32x4, vmax);
+	u16x4_l = vmovn_u32(u32x4);
+
+	/* load next 4 float elements, scale, convert, update range and narrow to uint16_t
+	 * use round to nearest with ties away rounding mode.
+	 */
+	f32x4 = vld1q_f32(input + 4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	u32x4 = vcvtaq_u32_f32(f32x4);
+	u32x4 = vminq_u32(u32x4, vmax);
+	u16x4_h = vmovn_u32(u32x4);
+
+	/* combine lower and higher uint16x4_t */
+	u16x8 = vcombine_u16(u16x4_l, u16x4_h);
+
+	/* narrow to uint8x8_t */
+	u8x8 = vmovn_u16(u16x8);
+
+	/* store 8 elements */
+	vst1_u8(output, u8x8);
+}
+
+static void
+__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output)
+{
+	float32x2_t f32x2;
+	uint32x2_t u32x2;
+	uint32x2_t vmax;
+	uint8x8_t u8x8;
+
+	/* set constants */
+	vmax = vdup_n_u32(UINT8_MAX);
+
+	/* load element to 2 lanes */
+	f32x2 = vld1_dup_f32(input);
+
+	/* scale */
+	f32x2 = vmul_n_f32(f32x2, scale);
+
+	/* convert to uin32_t using round to nearest with ties away rounding mode */
+	u32x2 = vcvta_u32_f32(f32x2);
+
+	/* update range [0:UINT8_MAX] */
+	u32x2 = vmin_u32(u32x2, vmax);
+
+	/* convert to uint8x8_t */
+	u8x8 = vreinterpret_u8_u32(u32x2);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_u8(output, u8x8, 0);
+}
+
+int
+ml_float32_to_uint8_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint8_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint8_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x8_t u16x8;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint8x8_t u8x8;
+
+	/* load 8 x uint8_t elements */
+	u8x8 = vld1_u8(input);
+
+	/* widen uint8_t to uint16_t */
+	u16x8 = vmovl_u8(u8x8);
+
+	/* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_low_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output, f32x4);
+
+	/* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */
+	u16x4 = vget_high_u16(u16x8);
+	u32x4 = vmovl_u16(u16x4);
+	f32x4 = vcvtq_f32_u32(u32x4);
+	f32x4 = vmulq_n_f32(f32x4, scale);
+	vst1q_f32(output + 4, f32x4);
+}
+
+static void
+__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+ml_uint8_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint8_t *input_buffer;
+	float *output_buffer;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint8_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint8_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+	int32x4_t vmin;
+	int32x4_t vmax;
+
+	/* set constants */
+	vmin = vdupq_n_s32(INT16_MIN);
+	vmax = vdupq_n_s32(INT16_MAX);
+
+	/* load 4 x float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert to int32x4_t using round to nearest with ties away rounding mode */
+	s32x4 = vcvtaq_s32_f32(f32x4);
+
+	/* update range [INT16_MIN:INT16_MAX] */
+	s32x4 = vminq_s32(s32x4, vmax);
+	s32x4 = vmaxq_s32(s32x4, vmin);
+
+	/* narrow to int16x4_t */
+	s16x4 = vmovn_s32(s32x4);
+
+	/* store 4 elements */
+	vst1_s16(output, s16x4);
+}
+
+static void
+__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output)
+{
+	float32x2_t f32x2;
+	int32x2_t s32x2;
+	int16x4_t s16x4;
+	int32x2_t vmin;
+	int32x2_t vmax;
+
+	/* set constants */
+	vmin = vdup_n_s32(INT16_MIN);
+	vmax = vdup_n_s32(INT16_MAX);
+
+	/* load element to 2 lanes */
+	f32x2 = vld1_dup_f32(input);
+
+	/* scale */
+	f32x2 = vmul_n_f32(f32x2, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	s32x2 = vcvta_s32_f32(f32x2);
+
+	/* update range [INT16_MIN:INT16_MAX] */
+	s32x2 = vmin_s32(s32x2, vmax);
+	s32x2 = vmax_s32(s32x2, vmin);
+
+	/* convert to int16x4_t */
+	s16x4 = vreinterpret_s16_s32(s32x2);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_s16(output, s16x4, 0);
+}
+
+int
+ml_float32_to_int16_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	int16_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (int16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	int16x4_t s16x4;
+	int32x4_t s32x4;
+
+	/* load 4 x int16_t elements */
+	s16x4 = vld1_s16(input);
+
+	/* widen int16_t to int32_t */
+	s32x4 = vmovl_s16(s16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_s32(s32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static void
+__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_s32((int32_t)*input);
+}
+
+int
+ml_int16_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	int16_t *input_buffer;
+	float *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (int16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(int16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+	uint32x4_t vmax;
+
+	/* set constants */
+	vmax = vdupq_n_u32(UINT16_MAX);
+
+	/* load 4 float elements */
+	f32x4 = vld1q_f32(input);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	u32x4 = vcvtaq_u32_f32(f32x4);
+
+	/* update range [0:UINT16_MAX] */
+	u32x4 = vminq_u32(u32x4, vmax);
+
+	/* narrow */
+	u16x4 = vmovn_u32(u32x4);
+
+	/* store 4 elements */
+	vst1_u16(output, u16x4);
+}
+
+static void
+__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output)
+{
+	float32x2_t f32x2;
+	uint16x4_t u16x4;
+	int32x2_t s32x2;
+	int32x2_t vmax;
+
+	/* set constants */
+	vmax = vdup_n_s32(UINT16_MAX);
+
+	/* load element to 2 lanes */
+	f32x2 = vld1_dup_f32(input);
+
+	/* scale */
+	f32x2 = vmul_n_f32(f32x2, scale);
+
+	/* convert using round to nearest with ties to away rounding mode */
+	s32x2 = vcvta_s32_f32(f32x2);
+
+	/* update range [0:UINT16_MAX] */
+	s32x2 = vmin_s32(s32x2, vmax);
+
+	/* convert to uint16x4_t */
+	u16x4 = vreinterpret_u16_s32(s32x2);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_u16(output, u16x4, 0);
+}
+
+int
+ml_float32_to_uint16_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	float *input_buffer;
+	uint16_t *output_buffer;
+	uint64_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float *)input;
+	output_buffer = (uint16_t *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output)
+{
+	float32x4_t f32x4;
+	uint16x4_t u16x4;
+	uint32x4_t u32x4;
+
+	/* load 4 x uint16_t elements */
+	u16x4 = vld1_u16(input);
+
+	/* widen uint16_t to uint32_t */
+	u32x4 = vmovl_u16(u16x4);
+
+	/* convert uint32_t to float */
+	f32x4 = vcvtq_f32_u32(u32x4);
+
+	/* scale */
+	f32x4 = vmulq_n_f32(f32x4, scale);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static void
+__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output)
+{
+	*output = scale * vcvts_f32_u32((uint32_t)*input);
+}
+
+int
+ml_uint16_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output)
+{
+	uint16_t *input_buffer;
+	float *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (uint16_t *)input;
+	output_buffer = (float *)output;
+	vlen = 2 * sizeof(float) / sizeof(uint16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert to float16x4_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store float16x4_t */
+	vst1_f16(output, f16x4);
+}
+
+static void
+__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output)
+{
+	float32x4_t f32x4;
+	float16x4_t f16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to float16_t */
+	f16x4 = vcvt_f16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_f16(output, f16x4, 0);
+}
+
+int
+ml_float32_to_float16_neon(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	float16_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (float16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_float16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_float16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x float16_t elements */
+	f16x4 = vld1_f16(input);
+
+	/* convert float16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static void
+__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output)
+{
+	float16x4_t f16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	f16x4 = vld1_dup_f16(input);
+
+	/* convert float16_t to float32_t */
+	f32x4 = vcvt_f32_f16(f16x4);
+
+	/* store 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+ml_float16_to_float32_neon(uint64_t nb_elements, void *input, void *output)
+{
+	float16_t *input_buffer;
+	float32_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(float16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#ifdef __ARM_FEATURE_BF16
+
+static void
+__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load 4 x float32_t elements */
+	f32x4 = vld1q_f32(input);
+
+	/* convert float32x4_t to bfloat16x4_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store bfloat16x4_t */
+	vst1_bf16(output, bf16x4);
+}
+
+static void
+__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output)
+{
+	float32x4_t f32x4;
+	bfloat16x4_t bf16x4;
+
+	/* load element to 4 lanes */
+	f32x4 = vld1q_dup_f32(input);
+
+	/* convert float32_t to bfloat16_t */
+	bf16x4 = vcvt_bf16_f32(f32x4);
+
+	/* store lane 0 / 1 element */
+	vst1_lane_bf16(output, bf16x4, 0);
+}
+
+int
+ml_float32_to_bfloat16_neon(uint64_t nb_elements, void *input, void *output)
+{
+	float32_t *input_buffer;
+	bfloat16_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (float32_t *)input;
+	output_buffer = (bfloat16_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+static void
+__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load 4 x bfloat16_t elements */
+	bf16x4 = vld1_bf16(input);
+
+	/* convert bfloat16x4_t to float32x4_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store float32x4_t */
+	vst1q_f32(output, f32x4);
+}
+
+static void
+__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output)
+{
+	bfloat16x4_t bf16x4;
+	float32x4_t f32x4;
+
+	/* load element to 4 lanes */
+	bf16x4 = vld1_dup_bf16(input);
+
+	/* convert bfloat16_t to float32_t */
+	f32x4 = vcvt_f32_bf16(bf16x4);
+
+	/* store lane 0 / 1 element */
+	vst1q_lane_f32(output, f32x4, 0);
+}
+
+int
+ml_bfloat16_to_float32_neon(uint64_t nb_elements, void *input, void *output)
+{
+	bfloat16_t *input_buffer;
+	float32_t *output_buffer;
+	uint32_t vlen;
+	uint64_t i;
+
+	if ((nb_elements == 0) || (input == NULL) || (output == NULL))
+		return -EINVAL;
+
+	input_buffer = (bfloat16_t *)input;
+	output_buffer = (float32_t *)output;
+	vlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);
+
+	/* convert vlen elements in each iteration */
+	for (i = 0; i < (nb_elements / vlen); i++) {
+		__bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer);
+		input_buffer += vlen;
+		output_buffer += vlen;
+	}
+
+	/* convert leftover elements */
+	i = i * vlen;
+	for (; i < nb_elements; i++) {
+		__bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer);
+		input_buffer++;
+		output_buffer++;
+	}
+
+	return 0;
+}
+
+#endif /* __ARM_FEATURE_BF16 */
diff --git a/drivers/common/ml/ml_utils_neon.h b/drivers/common/ml/ml_utils_neon.h
new file mode 100644
index 0000000000..d912049779
--- /dev/null
+++ b/drivers/common/ml/ml_utils_neon.h
@@ -0,0 +1,23 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2022 Marvell.
+ */
+
+#ifndef _ML_UTILS_NEON_H_
+#define _ML_UTILS_NEON_H_
+
+#include <stdint.h>
+
+int ml_float32_to_int8_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_int8_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_uint8_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_uint8_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_int16_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_int16_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_uint16_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_uint16_to_float32_neon(float scale, uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_float16_neon(uint64_t nb_elements, void *input, void *output);
+int ml_float16_to_float32_neon(uint64_t nb_elements, void *input, void *output);
+int ml_float32_to_bfloat16_neon(uint64_t nb_elements, void *input, void *output);
+int ml_bfloat16_to_float32_neon(uint64_t nb_elements, void *input, void *output);
+
+#endif /*_ML_UTILS_NEON_H_ */