get:
Show a patch.

patch:
Update a patch.

put:
Update a patch.

GET /api/patches/122809/?format=api
HTTP 200 OK
Allow: GET, PUT, PATCH, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept

{
    "id": 122809,
    "url": "http://patches.dpdk.org/api/patches/122809/?format=api",
    "web_url": "http://patches.dpdk.org/project/dpdk/patch/20230201091256.12792-5-syalavarthi@marvell.com/",
    "project": {
        "id": 1,
        "url": "http://patches.dpdk.org/api/projects/1/?format=api",
        "name": "DPDK",
        "link_name": "dpdk",
        "list_id": "dev.dpdk.org",
        "list_email": "dev@dpdk.org",
        "web_url": "http://core.dpdk.org",
        "scm_url": "git://dpdk.org/dpdk",
        "webscm_url": "http://git.dpdk.org/dpdk",
        "list_archive_url": "https://inbox.dpdk.org/dev",
        "list_archive_url_format": "https://inbox.dpdk.org/dev/{}",
        "commit_url_format": ""
    },
    "msgid": "<20230201091256.12792-5-syalavarthi@marvell.com>",
    "list_archive_url": "https://inbox.dpdk.org/dev/20230201091256.12792-5-syalavarthi@marvell.com",
    "date": "2023-02-01T09:12:56",
    "name": "[v5,4/4] mldev: add Arm NEON type conversion routines",
    "commit_ref": null,
    "pull_url": null,
    "state": "superseded",
    "archived": true,
    "hash": "f012d77a7a25ca29f9dbb56d65d35814b7e855c2",
    "submitter": {
        "id": 2480,
        "url": "http://patches.dpdk.org/api/people/2480/?format=api",
        "name": "Srikanth Yalavarthi",
        "email": "syalavarthi@marvell.com"
    },
    "delegate": {
        "id": 1,
        "url": "http://patches.dpdk.org/api/users/1/?format=api",
        "username": "tmonjalo",
        "first_name": "Thomas",
        "last_name": "Monjalon",
        "email": "thomas@monjalon.net"
    },
    "mbox": "http://patches.dpdk.org/project/dpdk/patch/20230201091256.12792-5-syalavarthi@marvell.com/mbox/",
    "series": [
        {
            "id": 26731,
            "url": "http://patches.dpdk.org/api/series/26731/?format=api",
            "web_url": "http://patches.dpdk.org/project/dpdk/list/?series=26731",
            "date": "2023-02-01T09:12:52",
            "name": "Implementation of ML common code",
            "version": 5,
            "mbox": "http://patches.dpdk.org/series/26731/mbox/"
        }
    ],
    "comments": "http://patches.dpdk.org/api/patches/122809/comments/",
    "check": "warning",
    "checks": "http://patches.dpdk.org/api/patches/122809/checks/",
    "tags": {},
    "related": [],
    "headers": {
        "Return-Path": "<dev-bounces@dpdk.org>",
        "X-Original-To": "patchwork@inbox.dpdk.org",
        "Delivered-To": "patchwork@inbox.dpdk.org",
        "Received": [
            "from mails.dpdk.org (mails.dpdk.org [217.70.189.124])\n\tby inbox.dpdk.org (Postfix) with ESMTP id 8E5F341B9D;\n\tWed,  1 Feb 2023 10:13:32 +0100 (CET)",
            "from mails.dpdk.org (localhost [127.0.0.1])\n\tby mails.dpdk.org (Postfix) with ESMTP id 0072942D49;\n\tWed,  1 Feb 2023 10:13:08 +0100 (CET)",
            "from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com\n [67.231.156.173])\n by mails.dpdk.org (Postfix) with ESMTP id B44B942D12\n for <dev@dpdk.org>; Wed,  1 Feb 2023 10:13:05 +0100 (CET)",
            "from pps.filterd (m0045851.ppops.net [127.0.0.1])\n by mx0b-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id\n 3116MFAr010334; Wed, 1 Feb 2023 01:13:02 -0800",
            "from dc5-exch02.marvell.com ([199.233.59.182])\n by mx0b-0016f401.pphosted.com (PPS) with ESMTPS id 3nfjrj0qhb-3\n (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT);\n Wed, 01 Feb 2023 01:13:02 -0800",
            "from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH02.marvell.com\n (10.69.176.39) with Microsoft SMTP Server (TLS) id 15.0.1497.42;\n Wed, 1 Feb 2023 01:13:00 -0800",
            "from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com\n (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.42 via Frontend\n Transport; Wed, 1 Feb 2023 01:13:00 -0800",
            "from ml-host-33.caveonetworks.com (unknown [10.110.143.233])\n by maili.marvell.com (Postfix) with ESMTP id 2AACB3F704C;\n Wed,  1 Feb 2023 01:13:00 -0800 (PST)"
        ],
        "DKIM-Signature": "v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com;\n h=from : to : cc :\n subject : date : message-id : in-reply-to : references : mime-version :\n content-type; s=pfpt0220; bh=KPwoQ0e/FPVHNZB8nsJz2b7Xj3MqEwHr28eVfdwFC10=;\n b=TfkVpoW299tsMOD6vvQVJ2TSzTUJQ53mlkWASJVjwwftFj3qooFLD9gUDqI3CeSN6Tl0\n e9F2Wmg+LscSa+RZ6Kwnycxe38uN/b0xtdC/MzVVTljaghzaFXIhnipEg9y9UAON+k7I\n gnwksx4CvkdfYJMiuIXgY+X8r0DijUO+1D0wt2zWnmMDxC2q737F8jFrtxvddWZTNFZ6\n GGo5I3q2EpZ7jlSzgVVFSDAVjVZ2C7SPdN7JWcwyxRvnGeWJvCXgzz2oTdppmSgUH42W\n zqKzomYPV+wi1NF5oKPNfezohTlOjLkwh36cBO44xKUvxuO4kt9Oj5IWa68dhQBJPBRP UQ==",
        "From": "Srikanth Yalavarthi <syalavarthi@marvell.com>",
        "To": "Srikanth Yalavarthi <syalavarthi@marvell.com>, Ruifeng Wang\n <ruifeng.wang@arm.com>",
        "CC": "<dev@dpdk.org>, <sshankarnara@marvell.com>, <jerinj@marvell.com>,\n <aprabhu@marvell.com>",
        "Subject": "[PATCH v5 4/4] mldev: add Arm NEON type conversion routines",
        "Date": "Wed, 1 Feb 2023 01:12:56 -0800",
        "Message-ID": "<20230201091256.12792-5-syalavarthi@marvell.com>",
        "X-Mailer": "git-send-email 2.17.1",
        "In-Reply-To": "<20230201091256.12792-1-syalavarthi@marvell.com>",
        "References": "<20221208193532.16718-1-syalavarthi@marvell.com>\n <20230201091256.12792-1-syalavarthi@marvell.com>",
        "MIME-Version": "1.0",
        "Content-Type": "text/plain",
        "X-Proofpoint-ORIG-GUID": "0PmFE741JFUsTbQeDFwLJZWnNFijkFDS",
        "X-Proofpoint-GUID": "0PmFE741JFUsTbQeDFwLJZWnNFijkFDS",
        "X-Proofpoint-Virus-Version": "vendor=baseguard\n engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.122.1\n definitions=2023-02-01_03,2023-01-31_01,2022-06-22_01",
        "X-BeenThere": "dev@dpdk.org",
        "X-Mailman-Version": "2.1.29",
        "Precedence": "list",
        "List-Id": "DPDK patches and discussions <dev.dpdk.org>",
        "List-Unsubscribe": "<https://mails.dpdk.org/options/dev>,\n <mailto:dev-request@dpdk.org?subject=unsubscribe>",
        "List-Archive": "<http://mails.dpdk.org/archives/dev/>",
        "List-Post": "<mailto:dev@dpdk.org>",
        "List-Help": "<mailto:dev-request@dpdk.org?subject=help>",
        "List-Subscribe": "<https://mails.dpdk.org/listinfo/dev>,\n <mailto:dev-request@dpdk.org?subject=subscribe>",
        "Errors-To": "dev-bounces@dpdk.org"
    },
    "content": "Added ARM NEON intrinsic based implementations to support conversion\nof data types. Support is enabled to handle int8, uint8, int16, uint16,\nfloat16, float32 and bfloat16 types.\n\nSigned-off-by: Srikanth Yalavarthi <syalavarthi@marvell.com>\n---\nv5:\n* Moved the code from drivers/common/ml to lib/mldev\n* Added rte_ml_io_ prefix to the functions\n\nv2:\n* Dropped use of driver routines to call neon functions\n* Optimization of neon functions. Reduce the number of intrinsic calls.\n\n lib/mldev/meson.build        |   4 +\n lib/mldev/mldev_utils_neon.c | 873 +++++++++++++++++++++++++++++++++++\n 2 files changed, 877 insertions(+)\n create mode 100644 lib/mldev/mldev_utils_neon.c\n\n--\n2.17.1",
    "diff": "diff --git a/lib/mldev/meson.build b/lib/mldev/meson.build\nindex fce9c0ebee..05694b0839 100644\n--- a/lib/mldev/meson.build\n+++ b/lib/mldev/meson.build\n@@ -8,6 +8,10 @@ sources = files(\n         'mldev_utils_scalar.c',\n )\n\n+if arch_subdir == 'arm'\n+    sources += files('mldev_utils_neon.c')\n+endif\n+\n headers = files(\n         'rte_mldev.h',\n )\ndiff --git a/lib/mldev/mldev_utils_neon.c b/lib/mldev/mldev_utils_neon.c\nnew file mode 100644\nindex 0000000000..32b620db20\n--- /dev/null\n+++ b/lib/mldev/mldev_utils_neon.c\n@@ -0,0 +1,873 @@\n+/* SPDX-License-Identifier: BSD-3-Clause\n+ * Copyright (c) 2022 Marvell.\n+ */\n+\n+#include <errno.h>\n+#include <stdint.h>\n+#include <stdlib.h>\n+\n+#include \"mldev_utils.h\"\n+\n+#include <arm_neon.h>\n+\n+/* Description:\n+ * This file implements vector versions of Machine Learning utility functions used to convert data\n+ * types from higher precision to lower precision and vice-versa. Implementation is based on Arm\n+ * Neon intrinsics.\n+ */\n+\n+static inline void\n+__float32_to_int8_neon_s8x8(float scale, float *input, int8_t *output)\n+{\n+\tint16x4_t s16x4_l;\n+\tint16x4_t s16x4_h;\n+\tfloat32x4_t f32x4;\n+\tint16x8_t s16x8;\n+\tint32x4_t s32x4;\n+\tint8x8_t s8x8;\n+\n+\t/* load 4 float32 elements, scale, convert, saturate narrow to int16.\n+\t * Use round to nearest with ties away rounding mode.\n+\t */\n+\tf32x4 = vld1q_f32(input);\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\ts32x4 = vcvtaq_s32_f32(f32x4);\n+\ts16x4_l = vqmovn_s32(s32x4);\n+\n+\t/* load next 4 float32 elements, scale, convert, saturate narrow to int16.\n+\t * Use round to nearest with ties away rounding mode.\n+\t */\n+\tf32x4 = vld1q_f32(input + 4);\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\ts32x4 = vcvtaq_s32_f32(f32x4);\n+\ts16x4_h = vqmovn_s32(s32x4);\n+\n+\t/* combine lower and higher int16x4_t to int16x8_t */\n+\ts16x8 = vcombine_s16(s16x4_l, s16x4_h);\n+\n+\t/* narrow to int8_t */\n+\ts8x8 = vqmovn_s16(s16x8);\n+\n+\t/* store 8 elements */\n+\tvst1_s8(output, s8x8);\n+}\n+\n+static inline void\n+__float32_to_int8_neon_s8x1(float scale, float *input, int8_t *output)\n+{\n+\tint32_t s32;\n+\tint16_t s16;\n+\n+\t/* scale and convert, round to nearest with ties away rounding mode */\n+\ts32 = vcvtas_s32_f32(scale * (*input));\n+\n+\t/* saturate narrow */\n+\ts16 = vqmovns_s32(s32);\n+\n+\t/* convert to int8_t */\n+\t*output = vqmovnh_s16(s16);\n+}\n+\n+int\n+rte_ml_io_float32_to_int8(float scale, uint64_t nb_elements, void *input, void *output)\n+{\n+\tfloat *input_buffer;\n+\tint8_t *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (float *)input;\n+\toutput_buffer = (int8_t *)output;\n+\tvlen = 2 * sizeof(float) / sizeof(int8_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__float32_to_int8_neon_s8x8(scale, input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__float32_to_int8_neon_s8x1(scale, input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__int8_to_float32_neon_f32x8(float scale, int8_t *input, float *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tint16x8_t s16x8;\n+\tint16x4_t s16x4;\n+\tint32x4_t s32x4;\n+\tint8x8_t s8x8;\n+\n+\t/* load 8 x int8_t elements */\n+\ts8x8 = vld1_s8(input);\n+\n+\t/* widen int8_t to int16_t */\n+\ts16x8 = vmovl_s8(s8x8);\n+\n+\t/* convert lower 4 elements: widen to int32_t, convert to float, scale and store */\n+\ts16x4 = vget_low_s16(s16x8);\n+\ts32x4 = vmovl_s16(s16x4);\n+\tf32x4 = vcvtq_f32_s32(s32x4);\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\tvst1q_f32(output, f32x4);\n+\n+\t/* convert higher 4 elements: widen to int32_t, convert to float, scale and store */\n+\ts16x4 = vget_high_s16(s16x8);\n+\ts32x4 = vmovl_s16(s16x4);\n+\tf32x4 = vcvtq_f32_s32(s32x4);\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\tvst1q_f32(output + 4, f32x4);\n+}\n+\n+static inline void\n+__int8_to_float32_neon_f32x1(float scale, int8_t *input, float *output)\n+{\n+\t*output = scale * vcvts_f32_s32((int32_t)*input);\n+}\n+\n+int\n+rte_ml_io_int8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)\n+{\n+\tint8_t *input_buffer;\n+\tfloat *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (int8_t *)input;\n+\toutput_buffer = (float *)output;\n+\tvlen = 2 * sizeof(float) / sizeof(int8_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__int8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__int8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__float32_to_uint8_neon_u8x8(float scale, float *input, uint8_t *output)\n+{\n+\tuint16x4_t u16x4_l;\n+\tuint16x4_t u16x4_h;\n+\tfloat32x4_t f32x4;\n+\tuint32x4_t u32x4;\n+\tuint16x8_t u16x8;\n+\tuint8x8_t u8x8;\n+\n+\t/* load 4 float elements, scale, convert, saturate narrow to uint16_t.\n+\t * use round to nearest with ties away rounding mode.\n+\t */\n+\tf32x4 = vld1q_f32(input);\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\tu32x4 = vcvtaq_u32_f32(f32x4);\n+\tu16x4_l = vqmovn_u32(u32x4);\n+\n+\t/* load next 4 float elements, scale, convert, saturate narrow to uint16_t\n+\t * use round to nearest with ties away rounding mode.\n+\t */\n+\tf32x4 = vld1q_f32(input + 4);\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\tu32x4 = vcvtaq_u32_f32(f32x4);\n+\tu16x4_h = vqmovn_u32(u32x4);\n+\n+\t/* combine lower and higher uint16x4_t */\n+\tu16x8 = vcombine_u16(u16x4_l, u16x4_h);\n+\n+\t/* narrow to uint8x8_t */\n+\tu8x8 = vqmovn_u16(u16x8);\n+\n+\t/* store 8 elements */\n+\tvst1_u8(output, u8x8);\n+}\n+\n+static inline void\n+__float32_to_uint8_neon_u8x1(float scale, float *input, uint8_t *output)\n+{\n+\tuint32_t u32;\n+\tuint16_t u16;\n+\n+\t/* scale and convert, round to nearest with ties away rounding mode */\n+\tu32 = vcvtas_u32_f32(scale * (*input));\n+\n+\t/* saturate narrow */\n+\tu16 = vqmovns_u32(u32);\n+\n+\t/* convert to uint8_t */\n+\t*output = vqmovnh_u16(u16);\n+}\n+\n+int\n+rte_ml_io_float32_to_uint8(float scale, uint64_t nb_elements, void *input, void *output)\n+{\n+\tfloat *input_buffer;\n+\tuint8_t *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (float *)input;\n+\toutput_buffer = (uint8_t *)output;\n+\tvlen = 2 * sizeof(float) / sizeof(uint8_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__float32_to_uint8_neon_u8x8(scale, input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__float32_to_uint8_neon_u8x1(scale, input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__uint8_to_float32_neon_f32x8(float scale, uint8_t *input, float *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tuint16x8_t u16x8;\n+\tuint16x4_t u16x4;\n+\tuint32x4_t u32x4;\n+\tuint8x8_t u8x8;\n+\n+\t/* load 8 x uint8_t elements */\n+\tu8x8 = vld1_u8(input);\n+\n+\t/* widen uint8_t to uint16_t */\n+\tu16x8 = vmovl_u8(u8x8);\n+\n+\t/* convert lower 4 elements: widen to uint32_t, convert to float, scale and store */\n+\tu16x4 = vget_low_u16(u16x8);\n+\tu32x4 = vmovl_u16(u16x4);\n+\tf32x4 = vcvtq_f32_u32(u32x4);\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\tvst1q_f32(output, f32x4);\n+\n+\t/* convert higher 4 elements: widen to uint32_t, convert to float, scale and store */\n+\tu16x4 = vget_high_u16(u16x8);\n+\tu32x4 = vmovl_u16(u16x4);\n+\tf32x4 = vcvtq_f32_u32(u32x4);\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\tvst1q_f32(output + 4, f32x4);\n+}\n+\n+static inline void\n+__uint8_to_float32_neon_f32x1(float scale, uint8_t *input, float *output)\n+{\n+\t*output = scale * vcvts_f32_u32((uint32_t)*input);\n+}\n+\n+int\n+rte_ml_io_uint8_to_float32(float scale, uint64_t nb_elements, void *input, void *output)\n+{\n+\tuint8_t *input_buffer;\n+\tfloat *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint64_t vlen;\n+\tuint64_t i;\n+\n+\tif ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (uint8_t *)input;\n+\toutput_buffer = (float *)output;\n+\tvlen = 2 * sizeof(float) / sizeof(uint8_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__uint8_to_float32_neon_f32x8(scale, input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__uint8_to_float32_neon_f32x1(scale, input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__float32_to_int16_neon_s16x4(float scale, float *input, int16_t *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tint16x4_t s16x4;\n+\tint32x4_t s32x4;\n+\n+\t/* load 4 x float elements */\n+\tf32x4 = vld1q_f32(input);\n+\n+\t/* scale */\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\n+\t/* convert to int32x4_t using round to nearest with ties away rounding mode */\n+\ts32x4 = vcvtaq_s32_f32(f32x4);\n+\n+\t/* saturate narrow to int16x4_t */\n+\ts16x4 = vqmovn_s32(s32x4);\n+\n+\t/* store 4 elements */\n+\tvst1_s16(output, s16x4);\n+}\n+\n+static inline void\n+__float32_to_int16_neon_s16x1(float scale, float *input, int16_t *output)\n+{\n+\tint32_t s32;\n+\n+\t/* scale and convert, round to nearest with ties away rounding mode */\n+\ts32 = vcvtas_s32_f32(scale * (*input));\n+\n+\t/* saturate narrow */\n+\t*output = vqmovns_s32(s32);\n+}\n+\n+int\n+rte_ml_io_float32_to_int16(float scale, uint64_t nb_elements, void *input, void *output)\n+{\n+\tfloat *input_buffer;\n+\tint16_t *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (float *)input;\n+\toutput_buffer = (int16_t *)output;\n+\tvlen = 2 * sizeof(float) / sizeof(int16_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__float32_to_int16_neon_s16x4(scale, input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__float32_to_int16_neon_s16x1(scale, input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__int16_to_float32_neon_f32x4(float scale, int16_t *input, float *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tint16x4_t s16x4;\n+\tint32x4_t s32x4;\n+\n+\t/* load 4 x int16_t elements */\n+\ts16x4 = vld1_s16(input);\n+\n+\t/* widen int16_t to int32_t */\n+\ts32x4 = vmovl_s16(s16x4);\n+\n+\t/* convert int32_t to float */\n+\tf32x4 = vcvtq_f32_s32(s32x4);\n+\n+\t/* scale */\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\n+\t/* store float32x4_t */\n+\tvst1q_f32(output, f32x4);\n+}\n+\n+static inline void\n+__int16_to_float32_neon_f32x1(float scale, int16_t *input, float *output)\n+{\n+\t*output = scale * vcvts_f32_s32((int32_t)*input);\n+}\n+\n+int\n+rte_ml_io_int16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)\n+{\n+\tint16_t *input_buffer;\n+\tfloat *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (int16_t *)input;\n+\toutput_buffer = (float *)output;\n+\tvlen = 2 * sizeof(float) / sizeof(int16_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__int16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__int16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__float32_to_uint16_neon_u16x4(float scale, float *input, uint16_t *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tuint16x4_t u16x4;\n+\tuint32x4_t u32x4;\n+\n+\t/* load 4 float elements */\n+\tf32x4 = vld1q_f32(input);\n+\n+\t/* scale */\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\n+\t/* convert using round to nearest with ties to away rounding mode */\n+\tu32x4 = vcvtaq_u32_f32(f32x4);\n+\n+\t/* saturate narrow */\n+\tu16x4 = vqmovn_u32(u32x4);\n+\n+\t/* store 4 elements */\n+\tvst1_u16(output, u16x4);\n+}\n+\n+static inline void\n+__float32_to_uint16_neon_u16x1(float scale, float *input, uint16_t *output)\n+{\n+\tuint32_t u32;\n+\n+\t/* scale and convert, round to nearest with ties away rounding mode */\n+\tu32 = vcvtas_u32_f32(scale * (*input));\n+\n+\t/* saturate narrow */\n+\t*output = vqmovns_u32(u32);\n+}\n+\n+int\n+rte_ml_io_float32_to_uint16(float scale, uint64_t nb_elements, void *input, void *output)\n+{\n+\tfloat *input_buffer;\n+\tuint16_t *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint64_t vlen;\n+\tuint64_t i;\n+\n+\tif ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (float *)input;\n+\toutput_buffer = (uint16_t *)output;\n+\tvlen = 2 * sizeof(float) / sizeof(uint16_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__float32_to_uint16_neon_u16x4(scale, input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__float32_to_uint16_neon_u16x1(scale, input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__uint16_to_float32_neon_f32x4(float scale, uint16_t *input, float *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tuint16x4_t u16x4;\n+\tuint32x4_t u32x4;\n+\n+\t/* load 4 x uint16_t elements */\n+\tu16x4 = vld1_u16(input);\n+\n+\t/* widen uint16_t to uint32_t */\n+\tu32x4 = vmovl_u16(u16x4);\n+\n+\t/* convert uint32_t to float */\n+\tf32x4 = vcvtq_f32_u32(u32x4);\n+\n+\t/* scale */\n+\tf32x4 = vmulq_n_f32(f32x4, scale);\n+\n+\t/* store float32x4_t */\n+\tvst1q_f32(output, f32x4);\n+}\n+\n+static inline void\n+__uint16_to_float32_neon_f32x1(float scale, uint16_t *input, float *output)\n+{\n+\t*output = scale * vcvts_f32_u32((uint32_t)*input);\n+}\n+\n+int\n+rte_ml_io_uint16_to_float32(float scale, uint64_t nb_elements, void *input, void *output)\n+{\n+\tuint16_t *input_buffer;\n+\tfloat *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((scale == 0) || (nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (uint16_t *)input;\n+\toutput_buffer = (float *)output;\n+\tvlen = 2 * sizeof(float) / sizeof(uint16_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__uint16_to_float32_neon_f32x4(scale, input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__uint16_to_float32_neon_f32x1(scale, input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__float32_to_float16_neon_f16x4(float32_t *input, float16_t *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tfloat16x4_t f16x4;\n+\n+\t/* load 4 x float32_t elements */\n+\tf32x4 = vld1q_f32(input);\n+\n+\t/* convert to float16x4_t */\n+\tf16x4 = vcvt_f16_f32(f32x4);\n+\n+\t/* store float16x4_t */\n+\tvst1_f16(output, f16x4);\n+}\n+\n+static inline void\n+__float32_to_float16_neon_f16x1(float32_t *input, float16_t *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tfloat16x4_t f16x4;\n+\n+\t/* load element to 4 lanes */\n+\tf32x4 = vld1q_dup_f32(input);\n+\n+\t/* convert float32_t to float16_t */\n+\tf16x4 = vcvt_f16_f32(f32x4);\n+\n+\t/* store lane 0 / 1 element */\n+\tvst1_lane_f16(output, f16x4, 0);\n+}\n+\n+int\n+rte_ml_io_float32_to_float16(uint64_t nb_elements, void *input, void *output)\n+{\n+\tfloat32_t *input_buffer;\n+\tfloat16_t *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (float32_t *)input;\n+\toutput_buffer = (float16_t *)output;\n+\tvlen = 2 * sizeof(float32_t) / sizeof(float16_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__float32_to_float16_neon_f16x4(input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__float32_to_float16_neon_f16x1(input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__float16_to_float32_neon_f32x4(float16_t *input, float32_t *output)\n+{\n+\tfloat16x4_t f16x4;\n+\tfloat32x4_t f32x4;\n+\n+\t/* load 4 x float16_t elements */\n+\tf16x4 = vld1_f16(input);\n+\n+\t/* convert float16x4_t to float32x4_t */\n+\tf32x4 = vcvt_f32_f16(f16x4);\n+\n+\t/* store float32x4_t */\n+\tvst1q_f32(output, f32x4);\n+}\n+\n+static inline void\n+__float16_to_float32_neon_f32x1(float16_t *input, float32_t *output)\n+{\n+\tfloat16x4_t f16x4;\n+\tfloat32x4_t f32x4;\n+\n+\t/* load element to 4 lanes */\n+\tf16x4 = vld1_dup_f16(input);\n+\n+\t/* convert float16_t to float32_t */\n+\tf32x4 = vcvt_f32_f16(f16x4);\n+\n+\t/* store 1 element */\n+\tvst1q_lane_f32(output, f32x4, 0);\n+}\n+\n+int\n+rte_ml_io_float16_to_float32(uint64_t nb_elements, void *input, void *output)\n+{\n+\tfloat16_t *input_buffer;\n+\tfloat32_t *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (float16_t *)input;\n+\toutput_buffer = (float32_t *)output;\n+\tvlen = 2 * sizeof(float32_t) / sizeof(float16_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__float16_to_float32_neon_f32x4(input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__float16_to_float32_neon_f32x1(input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+#ifdef __ARM_FEATURE_BF16\n+\n+static inline void\n+__float32_to_bfloat16_neon_f16x4(float32_t *input, bfloat16_t *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tbfloat16x4_t bf16x4;\n+\n+\t/* load 4 x float32_t elements */\n+\tf32x4 = vld1q_f32(input);\n+\n+\t/* convert float32x4_t to bfloat16x4_t */\n+\tbf16x4 = vcvt_bf16_f32(f32x4);\n+\n+\t/* store bfloat16x4_t */\n+\tvst1_bf16(output, bf16x4);\n+}\n+\n+static inline void\n+__float32_to_bfloat16_neon_f16x1(float32_t *input, bfloat16_t *output)\n+{\n+\tfloat32x4_t f32x4;\n+\tbfloat16x4_t bf16x4;\n+\n+\t/* load element to 4 lanes */\n+\tf32x4 = vld1q_dup_f32(input);\n+\n+\t/* convert float32_t to bfloat16_t */\n+\tbf16x4 = vcvt_bf16_f32(f32x4);\n+\n+\t/* store lane 0 / 1 element */\n+\tvst1_lane_bf16(output, bf16x4, 0);\n+}\n+\n+int\n+rte_ml_io_float32_to_bfloat16(uint64_t nb_elements, void *input, void *output)\n+{\n+\tfloat32_t *input_buffer;\n+\tbfloat16_t *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (float32_t *)input;\n+\toutput_buffer = (bfloat16_t *)output;\n+\tvlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__float32_to_bfloat16_neon_f16x4(input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__float32_to_bfloat16_neon_f16x1(input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+static inline void\n+__bfloat16_to_float32_neon_f32x4(bfloat16_t *input, float32_t *output)\n+{\n+\tbfloat16x4_t bf16x4;\n+\tfloat32x4_t f32x4;\n+\n+\t/* load 4 x bfloat16_t elements */\n+\tbf16x4 = vld1_bf16(input);\n+\n+\t/* convert bfloat16x4_t to float32x4_t */\n+\tf32x4 = vcvt_f32_bf16(bf16x4);\n+\n+\t/* store float32x4_t */\n+\tvst1q_f32(output, f32x4);\n+}\n+\n+static inline void\n+__bfloat16_to_float32_neon_f32x1(bfloat16_t *input, float32_t *output)\n+{\n+\tbfloat16x4_t bf16x4;\n+\tfloat32x4_t f32x4;\n+\n+\t/* load element to 4 lanes */\n+\tbf16x4 = vld1_dup_bf16(input);\n+\n+\t/* convert bfloat16_t to float32_t */\n+\tf32x4 = vcvt_f32_bf16(bf16x4);\n+\n+\t/* store lane 0 / 1 element */\n+\tvst1q_lane_f32(output, f32x4, 0);\n+}\n+\n+int\n+rte_ml_io_bfloat16_to_float32(uint64_t nb_elements, void *input, void *output)\n+{\n+\tbfloat16_t *input_buffer;\n+\tfloat32_t *output_buffer;\n+\tuint64_t nb_iterations;\n+\tuint32_t vlen;\n+\tuint64_t i;\n+\n+\tif ((nb_elements == 0) || (input == NULL) || (output == NULL))\n+\t\treturn -EINVAL;\n+\n+\tinput_buffer = (bfloat16_t *)input;\n+\toutput_buffer = (float32_t *)output;\n+\tvlen = 2 * sizeof(float32_t) / sizeof(bfloat16_t);\n+\tnb_iterations = nb_elements / vlen;\n+\n+\t/* convert vlen elements in each iteration */\n+\tfor (i = 0; i < nb_iterations; i++) {\n+\t\t__bfloat16_to_float32_neon_f32x4(input_buffer, output_buffer);\n+\t\tinput_buffer += vlen;\n+\t\toutput_buffer += vlen;\n+\t}\n+\n+\t/* convert leftover elements */\n+\ti = i * vlen;\n+\tfor (; i < nb_elements; i++) {\n+\t\t__bfloat16_to_float32_neon_f32x1(input_buffer, output_buffer);\n+\t\tinput_buffer++;\n+\t\toutput_buffer++;\n+\t}\n+\n+\treturn 0;\n+}\n+\n+#endif /* __ARM_FEATURE_BF16 */\n",
    "prefixes": [
        "v5",
        "4/4"
    ]
}