get:
Show a patch.

patch:
Update a patch.

put:
Update a patch.

GET /api/patches/2341/?format=api
HTTP 200 OK
Allow: GET, PUT, PATCH, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept

{
    "id": 2341,
    "url": "https://patches.dpdk.org/api/patches/2341/?format=api",
    "web_url": "https://patches.dpdk.org/project/dpdk/patch/1421632414-10027-5-git-send-email-zhihong.wang@intel.com/",
    "project": {
        "id": 1,
        "url": "https://patches.dpdk.org/api/projects/1/?format=api",
        "name": "DPDK",
        "link_name": "dpdk",
        "list_id": "dev.dpdk.org",
        "list_email": "dev@dpdk.org",
        "web_url": "http://core.dpdk.org",
        "scm_url": "git://dpdk.org/dpdk",
        "webscm_url": "http://git.dpdk.org/dpdk",
        "list_archive_url": "https://inbox.dpdk.org/dev",
        "list_archive_url_format": "https://inbox.dpdk.org/dev/{}",
        "commit_url_format": ""
    },
    "msgid": "<1421632414-10027-5-git-send-email-zhihong.wang@intel.com>",
    "list_archive_url": "https://inbox.dpdk.org/dev/1421632414-10027-5-git-send-email-zhihong.wang@intel.com",
    "date": "2015-01-19T01:53:34",
    "name": "[dpdk-dev,4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms",
    "commit_ref": null,
    "pull_url": null,
    "state": "superseded",
    "archived": true,
    "hash": "a6170afed0e07025e23aaa05ae508de973d3efe4",
    "submitter": {
        "id": 156,
        "url": "https://patches.dpdk.org/api/people/156/?format=api",
        "name": "Zhihong Wang",
        "email": "zhihong.wang@intel.com"
    },
    "delegate": null,
    "mbox": "https://patches.dpdk.org/project/dpdk/patch/1421632414-10027-5-git-send-email-zhihong.wang@intel.com/mbox/",
    "series": [],
    "comments": "https://patches.dpdk.org/api/patches/2341/comments/",
    "check": "pending",
    "checks": "https://patches.dpdk.org/api/patches/2341/checks/",
    "tags": {},
    "related": [],
    "headers": {
        "Return-Path": "<dev-bounces@dpdk.org>",
        "X-Original-To": "patchwork@dpdk.org",
        "Delivered-To": "patchwork@dpdk.org",
        "Received": [
            "from [92.243.14.124] (localhost [IPv6:::1])\n\tby dpdk.org (Postfix) with ESMTP id A99595A8E;\n\tMon, 19 Jan 2015 02:54:08 +0100 (CET)",
            "from mga11.intel.com (mga11.intel.com [192.55.52.93])\n\tby dpdk.org (Postfix) with ESMTP id 8AC0B1288\n\tfor <dev@dpdk.org>; Mon, 19 Jan 2015 02:54:05 +0100 (CET)",
            "from orsmga001.jf.intel.com ([10.7.209.18])\n\tby fmsmga102.fm.intel.com with ESMTP; 18 Jan 2015 17:53:52 -0800",
            "from shvmail01.sh.intel.com ([10.239.29.42])\n\tby orsmga001.jf.intel.com with ESMTP; 18 Jan 2015 17:53:51 -0800",
            "from shecgisg004.sh.intel.com (shecgisg004.sh.intel.com\n\t[10.239.29.89])\n\tby shvmail01.sh.intel.com with ESMTP id t0J1rmos001536;\n\tMon, 19 Jan 2015 09:53:48 +0800",
            "from shecgisg004.sh.intel.com (localhost [127.0.0.1])\n\tby shecgisg004.sh.intel.com (8.13.6/8.13.6/SuSE Linux 0.8) with ESMTP\n\tid t0J1rk8R010090; Mon, 19 Jan 2015 09:53:48 +0800",
            "(from zwang84@localhost)\n\tby shecgisg004.sh.intel.com (8.13.6/8.13.6/Submit) id t0J1rkKV010086; \n\tMon, 19 Jan 2015 09:53:46 +0800"
        ],
        "X-ExtLoop1": "1",
        "X-IronPort-AV": "E=Sophos;i=\"5.09,422,1418112000\"; d=\"scan'208\";a=\"639125809\"",
        "From": "zhihong.wang@intel.com",
        "To": "dev@dpdk.org",
        "Date": "Mon, 19 Jan 2015 09:53:34 +0800",
        "Message-Id": "<1421632414-10027-5-git-send-email-zhihong.wang@intel.com>",
        "X-Mailer": "git-send-email 1.7.4.1",
        "In-Reply-To": "<1421632414-10027-1-git-send-email-zhihong.wang@intel.com>",
        "References": "<1421632414-10027-1-git-send-email-zhihong.wang@intel.com>",
        "Subject": "[dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in\n\tarch/x86/rte_memcpy.h for both SSE and AVX platforms",
        "X-BeenThere": "dev@dpdk.org",
        "X-Mailman-Version": "2.1.15",
        "Precedence": "list",
        "List-Id": "patches and discussions about DPDK <dev.dpdk.org>",
        "List-Unsubscribe": "<http://dpdk.org/ml/options/dev>,\n\t<mailto:dev-request@dpdk.org?subject=unsubscribe>",
        "List-Archive": "<http://dpdk.org/ml/archives/dev/>",
        "List-Post": "<mailto:dev@dpdk.org>",
        "List-Help": "<mailto:dev-request@dpdk.org?subject=help>",
        "List-Subscribe": "<http://dpdk.org/ml/listinfo/dev>,\n\t<mailto:dev-request@dpdk.org?subject=subscribe>",
        "Errors-To": "dev-bounces@dpdk.org",
        "Sender": "\"dev\" <dev-bounces@dpdk.org>"
    },
    "content": "Main code changes:\n\n1. Differentiate architectural features based on CPU flags\n\n    a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth\n\n    b. Implement separated copy flow specifically optimized for target architecture\n\n2. Rewrite the memcpy function \"rte_memcpy\"\n\n    a. Add store aligning\n\n    b. Add load aligning based on architectural features\n\n    c. Put block copy loop into inline move functions for better control of instruction order\n\n    d. Eliminate unnecessary MOVs\n\n3. Rewrite the inline move functions\n\n    a. Add move functions for unaligned load cases\n\n    b. Change instruction order in copy loops for better pipeline utilization\n\n    c. Use intrinsics instead of assembly code\n\n4. Remove slow glibc call for constant copies\n\nSigned-off-by: Zhihong Wang <zhihong.wang@intel.com>\n---\n .../common/include/arch/x86/rte_memcpy.h           | 664 +++++++++++++++------\n 1 file changed, 493 insertions(+), 171 deletions(-)",
    "diff": "diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h\nindex fb9eba8..69a5c6f 100644\n--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h\n+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h\n@@ -34,166 +34,189 @@\n #ifndef _RTE_MEMCPY_X86_64_H_\n #define _RTE_MEMCPY_X86_64_H_\n \n+/**\n+ * @file\n+ *\n+ * Functions for SSE/AVX/AVX2 implementation of memcpy().\n+ */\n+\n+#include <stdio.h>\n #include <stdint.h>\n #include <string.h>\n-#include <emmintrin.h>\n+#include <x86intrin.h>\n \n #ifdef __cplusplus\n extern \"C\" {\n #endif\n \n-#include \"generic/rte_memcpy.h\"\n+/**\n+ * Copy bytes from one location to another. The locations must not overlap.\n+ *\n+ * @note This is implemented as a macro, so it's address should not be taken\n+ * and care is needed as parameter expressions may be evaluated multiple times.\n+ *\n+ * @param dst\n+ *   Pointer to the destination of the data.\n+ * @param src\n+ *   Pointer to the source data.\n+ * @param n\n+ *   Number of bytes to copy.\n+ * @return\n+ *   Pointer to the destination data.\n+ */\n+static inline void *\n+rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));\n \n-#ifdef __INTEL_COMPILER\n-#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */\n-#endif\n+#ifdef RTE_MACHINE_CPUFLAG_AVX2\n \n+/**\n+ * AVX2 implementation below\n+ */\n+\n+/**\n+ * Copy 16 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n static inline void\n rte_mov16(uint8_t *dst, const uint8_t *src)\n {\n-\t__m128i reg_a;\n-\tasm volatile (\n-\t\t\"movdqu (%[src]), %[reg_a]\\n\\t\"\n-\t\t\"movdqu %[reg_a], (%[dst])\\n\\t\"\n-\t\t: [reg_a] \"=x\" (reg_a)\n-\t\t: [src] \"r\" (src),\n-\t\t  [dst] \"r\"(dst)\n-\t\t: \"memory\"\n-\t);\n+\t__m128i xmm0;\n+\n+\txmm0 = _mm_loadu_si128((const __m128i *)src);\n+\t_mm_storeu_si128((__m128i *)dst, xmm0);\n }\n \n+/**\n+ * Copy 32 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n static inline void\n rte_mov32(uint8_t *dst, const uint8_t *src)\n {\n-\t__m128i reg_a, reg_b;\n-\tasm volatile (\n-\t\t\"movdqu (%[src]), %[reg_a]\\n\\t\"\n-\t\t\"movdqu 16(%[src]), %[reg_b]\\n\\t\"\n-\t\t\"movdqu %[reg_a], (%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_b], 16(%[dst])\\n\\t\"\n-\t\t: [reg_a] \"=x\" (reg_a),\n-\t\t  [reg_b] \"=x\" (reg_b)\n-\t\t: [src] \"r\" (src),\n-\t\t  [dst] \"r\"(dst)\n-\t\t: \"memory\"\n-\t);\n-}\n+\t__m256i ymm0;\n \n-static inline void\n-rte_mov48(uint8_t *dst, const uint8_t *src)\n-{\n-\t__m128i reg_a, reg_b, reg_c;\n-\tasm volatile (\n-\t\t\"movdqu (%[src]), %[reg_a]\\n\\t\"\n-\t\t\"movdqu 16(%[src]), %[reg_b]\\n\\t\"\n-\t\t\"movdqu 32(%[src]), %[reg_c]\\n\\t\"\n-\t\t\"movdqu %[reg_a], (%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_b], 16(%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_c], 32(%[dst])\\n\\t\"\n-\t\t: [reg_a] \"=x\" (reg_a),\n-\t\t  [reg_b] \"=x\" (reg_b),\n-\t\t  [reg_c] \"=x\" (reg_c)\n-\t\t: [src] \"r\" (src),\n-\t\t  [dst] \"r\"(dst)\n-\t\t: \"memory\"\n-\t);\n+\tymm0 = _mm256_loadu_si256((const __m256i *)src);\n+\t_mm256_storeu_si256((__m256i *)dst, ymm0);\n }\n \n+/**\n+ * Copy 64 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n static inline void\n rte_mov64(uint8_t *dst, const uint8_t *src)\n {\n-\t__m128i reg_a, reg_b, reg_c, reg_d;\n-\tasm volatile (\n-\t\t\"movdqu (%[src]), %[reg_a]\\n\\t\"\n-\t\t\"movdqu 16(%[src]), %[reg_b]\\n\\t\"\n-\t\t\"movdqu 32(%[src]), %[reg_c]\\n\\t\"\n-\t\t\"movdqu 48(%[src]), %[reg_d]\\n\\t\"\n-\t\t\"movdqu %[reg_a], (%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_b], 16(%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_c], 32(%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_d], 48(%[dst])\\n\\t\"\n-\t\t: [reg_a] \"=x\" (reg_a),\n-\t\t  [reg_b] \"=x\" (reg_b),\n-\t\t  [reg_c] \"=x\" (reg_c),\n-\t\t  [reg_d] \"=x\" (reg_d)\n-\t\t: [src] \"r\" (src),\n-\t\t  [dst] \"r\"(dst)\n-\t\t: \"memory\"\n-\t);\n+\trte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);\n+\trte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);\n }\n \n+/**\n+ * Copy 128 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n static inline void\n rte_mov128(uint8_t *dst, const uint8_t *src)\n {\n-\t__m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;\n-\tasm volatile (\n-\t\t\"movdqu (%[src]), %[reg_a]\\n\\t\"\n-\t\t\"movdqu 16(%[src]), %[reg_b]\\n\\t\"\n-\t\t\"movdqu 32(%[src]), %[reg_c]\\n\\t\"\n-\t\t\"movdqu 48(%[src]), %[reg_d]\\n\\t\"\n-\t\t\"movdqu 64(%[src]), %[reg_e]\\n\\t\"\n-\t\t\"movdqu 80(%[src]), %[reg_f]\\n\\t\"\n-\t\t\"movdqu 96(%[src]), %[reg_g]\\n\\t\"\n-\t\t\"movdqu 112(%[src]), %[reg_h]\\n\\t\"\n-\t\t\"movdqu %[reg_a], (%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_b], 16(%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_c], 32(%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_d], 48(%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_e], 64(%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_f], 80(%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_g], 96(%[dst])\\n\\t\"\n-\t\t\"movdqu %[reg_h], 112(%[dst])\\n\\t\"\n-\t\t: [reg_a] \"=x\" (reg_a),\n-\t\t  [reg_b] \"=x\" (reg_b),\n-\t\t  [reg_c] \"=x\" (reg_c),\n-\t\t  [reg_d] \"=x\" (reg_d),\n-\t\t  [reg_e] \"=x\" (reg_e),\n-\t\t  [reg_f] \"=x\" (reg_f),\n-\t\t  [reg_g] \"=x\" (reg_g),\n-\t\t  [reg_h] \"=x\" (reg_h)\n-\t\t: [src] \"r\" (src),\n-\t\t  [dst] \"r\"(dst)\n-\t\t: \"memory\"\n-\t);\n+\trte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);\n+\trte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);\n+\trte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);\n+\trte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);\n }\n \n-#ifdef __INTEL_COMPILER\n-#pragma warning(enable:593)\n-#endif\n-\n+/**\n+ * Copy 256 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n static inline void\n rte_mov256(uint8_t *dst, const uint8_t *src)\n {\n-\trte_mov128(dst, src);\n-\trte_mov128(dst + 128, src + 128);\n+\trte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);\n+\trte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);\n+\trte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);\n+\trte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);\n+\trte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);\n+\trte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);\n+\trte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);\n+\trte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);\n }\n \n-#define rte_memcpy(dst, src, n)              \\\n-\t({ (__builtin_constant_p(n)) ?       \\\n-\tmemcpy((dst), (src), (n)) :          \\\n-\trte_memcpy_func((dst), (src), (n)); })\n+/**\n+ * Copy 64-byte blocks from one location to another,\n+ * locations should not overlap.\n+ */\n+static inline void\n+rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n)\n+{\n+\t__m256i ymm0, ymm1;\n+\n+\twhile (n >= 64) {\n+\t\tymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));\n+\t\tn -= 64;\n+\t\tymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));\n+\t\tsrc = (const uint8_t *)src + 64;\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);\n+\t\tdst = (uint8_t *)dst + 64;\n+\t}\n+}\n+\n+/**\n+ * Copy 256-byte blocks from one location to another,\n+ * locations should not overlap.\n+ */\n+static inline void\n+rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n)\n+{\n+\t__m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;\n+\n+\twhile (n >= 256) {\n+\t\tymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));\n+\t\tn -= 256;\n+\t\tymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));\n+\t\tymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));\n+\t\tymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));\n+\t\tymm4 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 * 32));\n+\t\tymm5 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 * 32));\n+\t\tymm6 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 * 32));\n+\t\tymm7 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 * 32));\n+\t\tsrc = (const uint8_t *)src + 256;\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4);\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5);\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6);\n+\t\t_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7);\n+\t\tdst = (uint8_t *)dst + 256;\n+\t}\n+}\n \n static inline void *\n-rte_memcpy_func(void *dst, const void *src, size_t n)\n+rte_memcpy(void *dst, const void *src, size_t n)\n {\n \tvoid *ret = dst;\n+\tint dstofss;\n+\tint bits;\n \n-\t/* We can't copy < 16 bytes using XMM registers so do it manually. */\n+\t/**\n+\t * Copy less than 16 bytes\n+\t */\n \tif (n < 16) {\n \t\tif (n & 0x01) {\n \t\t\t*(uint8_t *)dst = *(const uint8_t *)src;\n-\t\t\tdst = (uint8_t *)dst + 1;\n \t\t\tsrc = (const uint8_t *)src + 1;\n+\t\t\tdst = (uint8_t *)dst + 1;\n \t\t}\n \t\tif (n & 0x02) {\n \t\t\t*(uint16_t *)dst = *(const uint16_t *)src;\n-\t\t\tdst = (uint16_t *)dst + 1;\n \t\t\tsrc = (const uint16_t *)src + 1;\n+\t\t\tdst = (uint16_t *)dst + 1;\n \t\t}\n \t\tif (n & 0x04) {\n \t\t\t*(uint32_t *)dst = *(const uint32_t *)src;\n-\t\t\tdst = (uint32_t *)dst + 1;\n \t\t\tsrc = (const uint32_t *)src + 1;\n+\t\t\tdst = (uint32_t *)dst + 1;\n \t\t}\n \t\tif (n & 0x08) {\n \t\t\t*(uint64_t *)dst = *(const uint64_t *)src;\n@@ -201,95 +224,394 @@ rte_memcpy_func(void *dst, const void *src, size_t n)\n \t\treturn ret;\n \t}\n \n-\t/* Special fast cases for <= 128 bytes */\n+\t/**\n+\t * Fast way when copy size doesn't exceed 512 bytes\n+\t */\n \tif (n <= 32) {\n \t\trte_mov16((uint8_t *)dst, (const uint8_t *)src);\n \t\trte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);\n \t\treturn ret;\n \t}\n-\n \tif (n <= 64) {\n \t\trte_mov32((uint8_t *)dst, (const uint8_t *)src);\n \t\trte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);\n \t\treturn ret;\n \t}\n-\n-\tif (n <= 128) {\n-\t\trte_mov64((uint8_t *)dst, (const uint8_t *)src);\n-\t\trte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + n);\n+\tif (n <= 512) {\n+\t\tif (n >= 256) {\n+\t\t\tn -= 256;\n+\t\t\trte_mov256((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\tsrc = (const uint8_t *)src + 256;\n+\t\t\tdst = (uint8_t *)dst + 256;\n+\t\t}\n+\t\tif (n >= 128) {\n+\t\t\tn -= 128;\n+\t\t\trte_mov128((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\tsrc = (const uint8_t *)src + 128;\n+\t\t\tdst = (uint8_t *)dst + 128;\n+\t\t}\n+\t\tif (n >= 64) {\n+\t\t\tn -= 64;\n+\t\t\trte_mov64((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\tsrc = (const uint8_t *)src + 64;\n+\t\t\tdst = (uint8_t *)dst + 64;\n+\t\t}\n+COPY_BLOCK_64_BACK31:\n+\t\tif (n > 32) {\n+\t\t\trte_mov32((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\trte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);\n+\t\t\treturn ret;\n+\t\t}\n+\t\tif (n > 0) {\n+\t\t\trte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);\n+\t\t}\n \t\treturn ret;\n \t}\n \n-\t/*\n-\t * For large copies > 128 bytes. This combination of 256, 64 and 16 byte\n-\t * copies was found to be faster than doing 128 and 32 byte copies as\n-\t * well.\n+\t/**\n+\t * Make store aligned when copy size exceeds 512 bytes\n \t */\n-\tfor ( ; n >= 256; n -= 256) {\n-\t\trte_mov256((uint8_t *)dst, (const uint8_t *)src);\n-\t\tdst = (uint8_t *)dst + 256;\n-\t\tsrc = (const uint8_t *)src + 256;\n+\tdstofss = 32 - (int)((long long)(void *)dst & 0x1F);\n+\tn -= dstofss;\n+\trte_mov32((uint8_t *)dst, (const uint8_t *)src);\n+\tsrc = (const uint8_t *)src + dstofss;\n+\tdst = (uint8_t *)dst + dstofss;\n+\n+\t/**\n+\t * Copy 256-byte blocks.\n+\t * Use copy block function for better instruction order control,\n+\t * which is important when load is unaligned.\n+\t */\n+\trte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);\n+\tbits = n;\n+\tn = n & 255;\n+\tbits -= n;\n+\tsrc = (const uint8_t *)src + bits;\n+\tdst = (uint8_t *)dst + bits;\n+\n+\t/**\n+\t * Copy 64-byte blocks.\n+\t * Use copy block function for better instruction order control,\n+\t * which is important when load is unaligned.\n+\t */\n+\tif (n >= 64) {\n+\t\trte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);\n+\t\tbits = n;\n+\t\tn = n & 63;\n+\t\tbits -= n;\n+\t\tsrc = (const uint8_t *)src + bits;\n+\t\tdst = (uint8_t *)dst + bits;\n \t}\n \n-\t/*\n-\t * We split the remaining bytes (which will be less than 256) into\n-\t * 64byte (2^6) chunks.\n-\t * Using incrementing integers in the case labels of a switch statement\n-\t * enourages the compiler to use a jump table. To get incrementing\n-\t * integers, we shift the 2 relevant bits to the LSB position to first\n-\t * get decrementing integers, and then subtract.\n+\t/**\n+\t * Copy whatever left\n \t */\n-\tswitch (3 - (n >> 6)) {\n-\tcase 0x00:\n-\t\trte_mov64((uint8_t *)dst, (const uint8_t *)src);\n-\t\tn -= 64;\n-\t\tdst = (uint8_t *)dst + 64;\n-\t\tsrc = (const uint8_t *)src + 64;      /* fallthrough */\n-\tcase 0x01:\n-\t\trte_mov64((uint8_t *)dst, (const uint8_t *)src);\n-\t\tn -= 64;\n-\t\tdst = (uint8_t *)dst + 64;\n-\t\tsrc = (const uint8_t *)src + 64;      /* fallthrough */\n-\tcase 0x02:\n-\t\trte_mov64((uint8_t *)dst, (const uint8_t *)src);\n-\t\tn -= 64;\n-\t\tdst = (uint8_t *)dst + 64;\n-\t\tsrc = (const uint8_t *)src + 64;      /* fallthrough */\n-\tdefault:\n-\t\t;\n+\tgoto COPY_BLOCK_64_BACK31;\n+}\n+\n+#else /* RTE_MACHINE_CPUFLAG_AVX2 */\n+\n+/**\n+ * SSE & AVX implementation below\n+ */\n+\n+/**\n+ * Copy 16 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n+static inline void\n+rte_mov16(uint8_t *dst, const uint8_t *src)\n+{\n+\t__m128i xmm0;\n+\n+\txmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);\n+\t_mm_storeu_si128((__m128i *)dst, xmm0);\n+}\n+\n+/**\n+ * Copy 32 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n+static inline void\n+rte_mov32(uint8_t *dst, const uint8_t *src)\n+{\n+\trte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);\n+\trte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);\n+}\n+\n+/**\n+ * Copy 64 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n+static inline void\n+rte_mov64(uint8_t *dst, const uint8_t *src)\n+{\n+\trte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);\n+\trte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);\n+\trte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);\n+\trte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);\n+}\n+\n+/**\n+ * Copy 128 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n+static inline void\n+rte_mov128(uint8_t *dst, const uint8_t *src)\n+{\n+\trte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);\n+\trte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);\n+\trte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);\n+\trte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);\n+\trte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);\n+\trte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);\n+\trte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);\n+\trte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);\n+}\n+\n+/**\n+ * Copy 256 bytes from one location to another,\n+ * locations should not overlap.\n+ */\n+static inline void\n+rte_mov256(uint8_t *dst, const uint8_t *src)\n+{\n+\trte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);\n+\trte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);\n+\trte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);\n+\trte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);\n+\trte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);\n+\trte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);\n+\trte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);\n+\trte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);\n+\trte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);\n+\trte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);\n+\trte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);\n+\trte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);\n+\trte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);\n+\trte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);\n+\trte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);\n+\trte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);\n+}\n+\n+/**\n+ * Macro for copying unaligned block from one location to another,\n+ * 47 bytes leftover maximum,\n+ * locations should not overlap.\n+ * Requirements:\n+ * - Store is aligned\n+ * - Load offset is <offset>, which must be immediate value within [1, 15]\n+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading\n+ * - <dst>, <src>, <len> must be variables\n+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined\n+ */\n+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                                                         \\\n+{                                                                                                           \\\n+\tint tmp;                                                                                                \\\n+\twhile (len >= 128 + 16 - offset) {                                                                      \\\n+\t\txmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \\\n+\t\tlen -= 128;                                                                                         \\\n+\t\txmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \\\n+\t\txmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \\\n+\t\txmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \\\n+\t\txmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \\\n+\t\txmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \\\n+\t\txmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \\\n+\t\txmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \\\n+\t\txmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \\\n+\t\tsrc = (const uint8_t *)src + 128;                                                                   \\\n+\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \\\n+\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \\\n+\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \\\n+\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \\\n+\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \\\n+\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \\\n+\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \\\n+\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \\\n+\t\tdst = (uint8_t *)dst + 128;                                                                         \\\n+\t}                                                                                                       \\\n+\ttmp = len;                                                                                              \\\n+\tlen = ((len - 16 + offset) & 127) + 16 - offset;                                                        \\\n+\ttmp -= len;                                                                                             \\\n+\tsrc = (const uint8_t *)src + tmp;                                                                       \\\n+\tdst = (uint8_t *)dst + tmp;                                                                             \\\n+\tif (len >= 32 + 16 - offset) {                                                                          \\\n+\t\twhile (len >= 32 + 16 - offset) {                                                                   \\\n+\t\t\txmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \\\n+\t\t\tlen -= 32;                                                                                      \\\n+\t\t\txmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \\\n+\t\t\txmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \\\n+\t\t\tsrc = (const uint8_t *)src + 32;                                                                \\\n+\t\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \\\n+\t\t\t_mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \\\n+\t\t\tdst = (uint8_t *)dst + 32;                                                                      \\\n+\t\t}                                                                                                   \\\n+\t\ttmp = len;                                                                                          \\\n+\t\tlen = ((len - 16 + offset) & 31) + 16 - offset;                                                     \\\n+\t\ttmp -= len;                                                                                         \\\n+\t\tsrc = (const uint8_t *)src + tmp;                                                                   \\\n+\t\tdst = (uint8_t *)dst + tmp;                                                                         \\\n+\t}                                                                                                       \\\n+}\n+\n+static inline void *\n+rte_memcpy(void *dst, const void *src, size_t n)\n+{\n+\t__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;\n+\tvoid *ret = dst;\n+\tint dstofss;\n+\tint srcofs;\n+\n+\t/**\n+\t * Copy less than 16 bytes\n+\t */\n+\tif (n < 16) {\n+\t\tif (n & 0x01) {\n+\t\t\t*(uint8_t *)dst = *(const uint8_t *)src;\n+\t\t\tsrc = (const uint8_t *)src + 1;\n+\t\t\tdst = (uint8_t *)dst + 1;\n+\t\t}\n+\t\tif (n & 0x02) {\n+\t\t\t*(uint16_t *)dst = *(const uint16_t *)src;\n+\t\t\tsrc = (const uint16_t *)src + 1;\n+\t\t\tdst = (uint16_t *)dst + 1;\n+\t\t}\n+\t\tif (n & 0x04) {\n+\t\t\t*(uint32_t *)dst = *(const uint32_t *)src;\n+\t\t\tsrc = (const uint32_t *)src + 1;\n+\t\t\tdst = (uint32_t *)dst + 1;\n+\t\t}\n+\t\tif (n & 0x08) {\n+\t\t\t*(uint64_t *)dst = *(const uint64_t *)src;\n+\t\t}\n+\t\treturn ret;\n \t}\n \n-\t/*\n-\t * We split the remaining bytes (which will be less than 64) into\n-\t * 16byte (2^4) chunks, using the same switch structure as above.\n+\t/**\n+\t * Fast way when copy size doesn't exceed 512 bytes\n \t */\n-\tswitch (3 - (n >> 4)) {\n-\tcase 0x00:\n-\t\trte_mov16((uint8_t *)dst, (const uint8_t *)src);\n-\t\tn -= 16;\n-\t\tdst = (uint8_t *)dst + 16;\n-\t\tsrc = (const uint8_t *)src + 16;      /* fallthrough */\n-\tcase 0x01:\n-\t\trte_mov16((uint8_t *)dst, (const uint8_t *)src);\n-\t\tn -= 16;\n-\t\tdst = (uint8_t *)dst + 16;\n-\t\tsrc = (const uint8_t *)src + 16;      /* fallthrough */\n-\tcase 0x02:\n+\tif (n <= 32) {\n \t\trte_mov16((uint8_t *)dst, (const uint8_t *)src);\n-\t\tn -= 16;\n-\t\tdst = (uint8_t *)dst + 16;\n-\t\tsrc = (const uint8_t *)src + 16;      /* fallthrough */\n-\tdefault:\n-\t\t;\n+\t\trte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);\n+\t\treturn ret;\n \t}\n-\n-\t/* Copy any remaining bytes, without going beyond end of buffers */\n-\tif (n != 0) {\n+\tif (n <= 48) {\n+\t\trte_mov32((uint8_t *)dst, (const uint8_t *)src);\n+\t\trte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);\n+\t\treturn ret;\n+\t}\n+\tif (n <= 64) {\n+\t\trte_mov32((uint8_t *)dst, (const uint8_t *)src);\n+\t\trte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);\n \t\trte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);\n+\t\treturn ret;\n \t}\n-\treturn ret;\n+\tif (n <= 128) {\n+\t\tgoto COPY_BLOCK_128_BACK15;\n+\t}\n+\tif (n <= 512) {\n+\t\tif (n >= 256) {\n+\t\t\tn -= 256;\n+\t\t\trte_mov128((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\trte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);\n+\t\t\tsrc = (const uint8_t *)src + 256;\n+\t\t\tdst = (uint8_t *)dst + 256;\n+\t\t}\n+COPY_BLOCK_255_BACK15:\n+\t\tif (n >= 128) {\n+\t\t\tn -= 128;\n+\t\t\trte_mov128((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\tsrc = (const uint8_t *)src + 128;\n+\t\t\tdst = (uint8_t *)dst + 128;\n+\t\t}\n+COPY_BLOCK_128_BACK15:\n+\t\tif (n >= 64) {\n+\t\t\tn -= 64;\n+\t\t\trte_mov64((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\tsrc = (const uint8_t *)src + 64;\n+\t\t\tdst = (uint8_t *)dst + 64;\n+\t\t}\n+COPY_BLOCK_64_BACK15:\n+\t\tif (n >= 32) {\n+\t\t\tn -= 32;\n+\t\t\trte_mov32((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\tsrc = (const uint8_t *)src + 32;\n+\t\t\tdst = (uint8_t *)dst + 32;\n+\t\t}\n+\t\tif (n > 16) {\n+\t\t\trte_mov16((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\trte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);\n+\t\t\treturn ret;\n+\t\t}\n+\t\tif (n > 0) {\n+\t\t\trte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);\n+\t\t}\n+\t\treturn ret;\n+\t}\n+\n+\t/**\n+\t * Make store aligned when copy size exceeds 512 bytes,\n+\t * and make sure the first 15 bytes are copied, because\n+\t * unaligned copy functions require up to 15 bytes\n+\t * backwards access.\n+\t */\n+\tdstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;\n+\tn -= dstofss;\n+\trte_mov32((uint8_t *)dst, (const uint8_t *)src);\n+\tsrc = (const uint8_t *)src + dstofss;\n+\tdst = (uint8_t *)dst + dstofss;\n+\tsrcofs = (int)((long long)(const void *)src & 0x0F);\n+\n+\t/**\n+\t * For aligned copy\n+\t */\n+\tif (srcofs == 0) {\n+\t\t/**\n+\t\t * Copy 256-byte blocks\n+\t\t */\n+\t\tfor (; n >= 256; n -= 256) {\n+\t\t\trte_mov256((uint8_t *)dst, (const uint8_t *)src);\n+\t\t\tdst = (uint8_t *)dst + 256;\n+\t\t\tsrc = (const uint8_t *)src + 256;\n+\t\t}\n+\n+\t\t/**\n+\t\t * Copy whatever left\n+\t\t */\n+\t\tgoto COPY_BLOCK_255_BACK15;\n+\t}\n+\n+\t/**\n+\t * For copy with unaligned load, use PALIGNR to force load alignment.\n+\t * Use switch here because PALIGNR requires immediate value for shift count.\n+\t */\n+\tswitch (srcofs) {\n+\tcase 0x01: MOVEUNALIGNED_LEFT47(dst, src, n, 0x01); break;\n+\tcase 0x02: MOVEUNALIGNED_LEFT47(dst, src, n, 0x02); break;\n+\tcase 0x03: MOVEUNALIGNED_LEFT47(dst, src, n, 0x03); break;\n+\tcase 0x04: MOVEUNALIGNED_LEFT47(dst, src, n, 0x04); break;\n+\tcase 0x05: MOVEUNALIGNED_LEFT47(dst, src, n, 0x05); break;\n+\tcase 0x06: MOVEUNALIGNED_LEFT47(dst, src, n, 0x06); break;\n+\tcase 0x07: MOVEUNALIGNED_LEFT47(dst, src, n, 0x07); break;\n+\tcase 0x08: MOVEUNALIGNED_LEFT47(dst, src, n, 0x08); break;\n+\tcase 0x09: MOVEUNALIGNED_LEFT47(dst, src, n, 0x09); break;\n+\tcase 0x0A: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0A); break;\n+\tcase 0x0B: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0B); break;\n+\tcase 0x0C: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0C); break;\n+\tcase 0x0D: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0D); break;\n+\tcase 0x0E: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0E); break;\n+\tcase 0x0F: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0F); break;\n+\tdefault:;\n+\t}\n+\n+\t/**\n+\t * Copy whatever left\n+\t */\n+\tgoto COPY_BLOCK_64_BACK15;\n }\n \n+#endif /* RTE_MACHINE_CPUFLAG_AVX2 */\n+\n #ifdef __cplusplus\n }\n #endif\n",
    "prefixes": [
        "dpdk-dev",
        "4/4"
    ]
}