From patchwork Tue Dec 19 11:14:28 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32452 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 6B23A1B019; Tue, 19 Dec 2017 12:15:01 +0100 (CET) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 8ECD21B019 for ; Tue, 19 Dec 2017 12:14:54 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="3023078" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga002.fm.intel.com with ESMTP; 19 Dec 2017 03:14:51 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEpOm003085; Tue, 19 Dec 2017 11:14:51 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEpIl010177; Tue, 19 Dec 2017 11:14:51 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEpkX010173; Tue, 19 Dec 2017 11:14:51 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:28 +0000 Message-Id: <3aa7c38c79a039a6131b89e790418319c86a6ce2.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 01/23] eal: move get_virtual_area out of linuxapp eal_memory.c X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Move get_virtual_area out of linuxapp EAL memory and make it common to EAL, so that other code could reserve virtual areas as well. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memory.c | 70 ++++++++++++++++++++++++++++++ lib/librte_eal/common/eal_private.h | 29 +++++++++++++ lib/librte_eal/linuxapp/eal/eal_memory.c | 71 ++----------------------------- 3 files changed, 102 insertions(+), 68 deletions(-) diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c index fc6c44d..96570a7 100644 --- a/lib/librte_eal/common/eal_common_memory.c +++ b/lib/librte_eal/common/eal_common_memory.c @@ -31,6 +31,8 @@ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ +#include +#include #include #include #include @@ -49,6 +51,74 @@ #include "eal_internal_cfg.h" /* + * Try to mmap *size bytes in /dev/zero. If it is successful, return the + * pointer to the mmap'd area and keep *size unmodified. Else, retry + * with a smaller zone: decrease *size by hugepage_sz until it reaches + * 0. In this case, return NULL. Note: this function returns an address + * which is a multiple of hugepage size. + */ + +static uint64_t baseaddr_offset; + +void * +eal_get_virtual_area(void *requested_addr, uint64_t *size, + uint64_t page_sz, int flags) +{ + bool addr_is_hint, allow_shrink; + void *addr; + + RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size); + + addr_is_hint = (flags & EAL_VIRTUAL_AREA_ADDR_IS_HINT) > 0; + allow_shrink = (flags & EAL_VIRTUAL_AREA_ALLOW_SHRINK) > 0; + + if (requested_addr == NULL && internal_config.base_virtaddr != 0) { + requested_addr = (void*) (internal_config.base_virtaddr + + baseaddr_offset); + addr_is_hint = true; + } + + do { + // TODO: we may not necessarily be using memory mapped by this + // function for hugepage mapping, so... HUGETLB flag? + + addr = mmap(requested_addr, + (*size) + page_sz, PROT_READ, +#ifdef RTE_ARCH_PPC_64 + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, +#else + MAP_PRIVATE | MAP_ANONYMOUS, +#endif + -1, 0); + if (addr == MAP_FAILED && allow_shrink) + *size -= page_sz; + } while (allow_shrink && addr == MAP_FAILED && *size > 0); + + if (addr == MAP_FAILED) { + RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n", + strerror(errno)); + return NULL; + } else if (requested_addr != NULL && !addr_is_hint && + addr != requested_addr) { + RTE_LOG(ERR, EAL, "Cannot get a virtual area at requested address: %p\n", + requested_addr); + munmap(addr, (*size) + page_sz); + return NULL; + } + + /* align addr to page size boundary */ + addr = RTE_PTR_ALIGN(addr, page_sz); + + RTE_LOG(DEBUG, EAL, "Virtual area found at %p (size = 0x%zx)\n", + addr, *size); + + baseaddr_offset += *size; + + return addr; +} + + +/* * Return a pointer to a read-only table of struct rte_physmem_desc * elements, containing the layout of all addressable physical * memory. The last element of the table contains a NULL address. diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h index 462226f..5d57fc1 100644 --- a/lib/librte_eal/common/eal_private.h +++ b/lib/librte_eal/common/eal_private.h @@ -34,6 +34,7 @@ #ifndef _EAL_PRIVATE_H_ #define _EAL_PRIVATE_H_ +#include #include #include #include @@ -224,4 +225,32 @@ int rte_eal_hugepage_attach(void); */ struct rte_bus *rte_bus_find_by_device_name(const char *str); +/** + * Get virtual area of specified size from the OS. + * + * This function is private to the EAL. + * + * @param requested_addr + * Address where to request address space. + * @param size + * Size of requested area. + * @param page_sz + * Page size on which to align requested virtual area. + * @param flags + * EAL_VIRTUAL_AREA_* flags. + * + * @return + * Virtual area address if successful. + * NULL if unsuccessful. + */ + +#define EAL_VIRTUAL_AREA_ADDR_IS_HINT 0x1 +/**< don't fail if cannot get exact requested address */ +#define EAL_VIRTUAL_AREA_ALLOW_SHRINK 0x2 +/**< try getting smaller sized (decrement by page size) virtual areas if cannot + * get area of requested size. */ +void * +eal_get_virtual_area(void *requested_addr, uint64_t *size, + uint64_t page_sz, int flags); + #endif /* _EAL_PRIVATE_H_ */ diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index 16a181c..dd18d98 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -86,8 +86,6 @@ * zone as well as a physical contiguous zone. */ -static uint64_t baseaddr_offset; - static bool phys_addrs_available = true; #define RANDOMIZE_VA_SPACE_FILE "/proc/sys/kernel/randomize_va_space" @@ -250,71 +248,6 @@ aslr_enabled(void) } } -/* - * Try to mmap *size bytes in /dev/zero. If it is successful, return the - * pointer to the mmap'd area and keep *size unmodified. Else, retry - * with a smaller zone: decrease *size by hugepage_sz until it reaches - * 0. In this case, return NULL. Note: this function returns an address - * which is a multiple of hugepage size. - */ -static void * -get_virtual_area(size_t *size, size_t hugepage_sz) -{ - void *addr; - int fd; - long aligned_addr; - - if (internal_config.base_virtaddr != 0) { - addr = (void*) (uintptr_t) (internal_config.base_virtaddr + - baseaddr_offset); - } - else addr = NULL; - - RTE_LOG(DEBUG, EAL, "Ask a virtual area of 0x%zx bytes\n", *size); - - fd = open("/dev/zero", O_RDONLY); - if (fd < 0){ - RTE_LOG(ERR, EAL, "Cannot open /dev/zero\n"); - return NULL; - } - do { - addr = mmap(addr, - (*size) + hugepage_sz, PROT_READ, -#ifdef RTE_ARCH_PPC_64 - MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -#else - MAP_PRIVATE, -#endif - fd, 0); - if (addr == MAP_FAILED) - *size -= hugepage_sz; - } while (addr == MAP_FAILED && *size > 0); - - if (addr == MAP_FAILED) { - close(fd); - RTE_LOG(ERR, EAL, "Cannot get a virtual area: %s\n", - strerror(errno)); - return NULL; - } - - munmap(addr, (*size) + hugepage_sz); - close(fd); - - /* align addr to a huge page size boundary */ - aligned_addr = (long)addr; - aligned_addr += (hugepage_sz - 1); - aligned_addr &= (~(hugepage_sz - 1)); - addr = (void *)(aligned_addr); - - RTE_LOG(DEBUG, EAL, "Virtual area found at %p (size = 0x%zx)\n", - addr, *size); - - /* increment offset */ - baseaddr_offset += *size; - - return addr; -} - static sigjmp_buf huge_jmpenv; static void huge_sigbus_handler(int signo __rte_unused) @@ -463,7 +396,9 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi, /* get the biggest virtual memory area up to * vma_len. If it fails, vma_addr is NULL, so * let the kernel provide the address. */ - vma_addr = get_virtual_area(&vma_len, hpi->hugepage_sz); + vma_addr = eal_get_virtual_area(NULL, &vma_len, + hpi->hugepage_sz, + EAL_VIRTUAL_AREA_ALLOW_SHRINK); if (vma_addr == NULL) vma_len = hugepage_sz; } From patchwork Tue Dec 19 11:14:29 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32456 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 80DD91B170; Tue, 19 Dec 2017 12:15:08 +0100 (CET) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 53A8A1B01E for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="13553687" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga003.jf.intel.com with ESMTP; 19 Dec 2017 03:14:51 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEpiR003089; Tue, 19 Dec 2017 11:14:51 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEpBx010185; Tue, 19 Dec 2017 11:14:51 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEpjK010181; Tue, 19 Dec 2017 11:14:51 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:29 +0000 Message-Id: <9c79ab3e6218435b6d1af51dfe5d99aaad044d20.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 02/23] eal: add function to report number of detected sockets X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" At the moment, we always rely on scanning everything for every socket up until RTE_MAX_NUMA_NODES and checking if there's a memseg associated with each socket if we want to find out how many sockets we actually have. This becomes a problem when we may have memory on socket but it's not allocated yet, so we do the detection on lcore scan instead, and store the value for later use. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_lcore.c | 11 +++++++++++ lib/librte_eal/common/include/rte_eal.h | 1 + lib/librte_eal/common/include/rte_lcore.h | 8 ++++++++ 3 files changed, 20 insertions(+) diff --git a/lib/librte_eal/common/eal_common_lcore.c b/lib/librte_eal/common/eal_common_lcore.c index 0db1555..7566a6b 100644 --- a/lib/librte_eal/common/eal_common_lcore.c +++ b/lib/librte_eal/common/eal_common_lcore.c @@ -57,6 +57,7 @@ rte_eal_cpu_init(void) struct rte_config *config = rte_eal_get_configuration(); unsigned lcore_id; unsigned count = 0; + unsigned max_socket_id = 0; /* * Parse the maximum set of logical cores, detect the subset of running @@ -100,6 +101,8 @@ rte_eal_cpu_init(void) lcore_id, lcore_config[lcore_id].core_id, lcore_config[lcore_id].socket_id); count++; + max_socket_id = RTE_MAX(max_socket_id, + lcore_config[lcore_id].socket_id); } /* Set the count of enabled logical cores of the EAL configuration */ config->lcore_count = count; @@ -108,5 +111,13 @@ rte_eal_cpu_init(void) RTE_MAX_LCORE); RTE_LOG(INFO, EAL, "Detected %u lcore(s)\n", config->lcore_count); + config->numa_node_count = max_socket_id + 1; + RTE_LOG(INFO, EAL, "Detected %u NUMA nodes\n", config->numa_node_count); + return 0; } + +unsigned rte_num_sockets(void) { + const struct rte_config *config = rte_eal_get_configuration(); + return config->numa_node_count; +} diff --git a/lib/librte_eal/common/include/rte_eal.h b/lib/librte_eal/common/include/rte_eal.h index 8e4e71c..5b12914 100644 --- a/lib/librte_eal/common/include/rte_eal.h +++ b/lib/librte_eal/common/include/rte_eal.h @@ -83,6 +83,7 @@ enum rte_proc_type_t { struct rte_config { uint32_t master_lcore; /**< Id of the master lcore */ uint32_t lcore_count; /**< Number of available logical cores. */ + uint32_t numa_node_count; /**< Number of detected NUMA nodes. */ uint32_t service_lcore_count;/**< Number of available service cores. */ enum rte_lcore_role_t lcore_role[RTE_MAX_LCORE]; /**< State of cores. */ diff --git a/lib/librte_eal/common/include/rte_lcore.h b/lib/librte_eal/common/include/rte_lcore.h index c89e6ba..6a75c9b 100644 --- a/lib/librte_eal/common/include/rte_lcore.h +++ b/lib/librte_eal/common/include/rte_lcore.h @@ -148,6 +148,14 @@ rte_lcore_index(int lcore_id) unsigned rte_socket_id(void); /** + * Return number of physical sockets on the system. + * @return + * the number of physical sockets as recognized by EAL + * + */ +unsigned rte_num_sockets(void); + +/** * Get the ID of the physical socket of the specified lcore * * @param lcore_id From patchwork Tue Dec 19 11:14:30 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32462 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 245121B1C3; Tue, 19 Dec 2017 12:15:15 +0100 (CET) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 638551B017 for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="188280521" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga006.fm.intel.com with ESMTP; 19 Dec 2017 03:14:52 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEp2K003092; Tue, 19 Dec 2017 11:14:51 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEpeo010192; Tue, 19 Dec 2017 11:14:51 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEpRi010188; Tue, 19 Dec 2017 11:14:51 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:30 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 03/23] eal: add rte_fbarray X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" rte_fbarray is a simple resizable array, not unlike vectors in higher-level languages. Rationale for its existence is the following: since we are going to map memory page-by-page, there could be quite a lot of memory segments to keep track of (for smaller page sizes, page count can easily reach thousands). We can't really make page lists truly dynamic and infinitely expandable, because that involves reallocating memory (which is a big no-no in multiprocess). What we can do instead is have a maximum capacity as something really, really large, and preallocate address space for that, but only use a small portion of that memory as needed, via mmap()'ing portions of the address space to an actual file. This also doubles as a mechanism to share fbarrays between processes (although multiprocess is neither implemented nor tested at the moment). Hence the name: file-backed array. In addition, in understanding that we will frequently need to scan this array for free space and iterating over array linearly can become slow, rte_fbarray provides facilities to index array's usage. The following use cases are covered: - find next free/used slot (useful either for adding new elements to fbarray, or walking the list) - find starting index for next N free/used slots (useful for when we want to allocate chunk of VA-contiguous memory composed of several pages) - find how many contiguous free/used slots there are, starting from specified index (useful for when we want to figure out how many pages we have until next hole in allocated memory, to speed up some bulk operations where we would otherwise have to walk the array and add pages one by one) This is accomplished by storing a usage mask in-memory, right after the data section of the array, and using some bit-level magic to figure out the info we need. rte_fbarray is a bit clunky to use and its primary purpose is to be used within EAL for certain things, but hopefully it is (or can be made so) generic enough to be useful in other contexts. Note that current implementation is leaking fd's whenever new allocations happen. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/Makefile | 2 +- lib/librte_eal/common/eal_common_fbarray.c | 585 ++++++++++++++++++++++++++++ lib/librte_eal/common/eal_filesystem.h | 13 + lib/librte_eal/common/include/rte_fbarray.h | 98 +++++ lib/librte_eal/linuxapp/eal/Makefile | 1 + 5 files changed, 698 insertions(+), 1 deletion(-) create mode 100755 lib/librte_eal/common/eal_common_fbarray.c create mode 100755 lib/librte_eal/common/include/rte_fbarray.h diff --git a/lib/librte_eal/common/Makefile b/lib/librte_eal/common/Makefile index 9effd0d..7868698 100644 --- a/lib/librte_eal/common/Makefile +++ b/lib/librte_eal/common/Makefile @@ -43,7 +43,7 @@ INC += rte_hexdump.h rte_devargs.h rte_bus.h rte_dev.h INC += rte_pci_dev_feature_defs.h rte_pci_dev_features.h INC += rte_malloc.h rte_keepalive.h rte_time.h INC += rte_service.h rte_service_component.h -INC += rte_bitmap.h rte_vfio.h +INC += rte_bitmap.h rte_vfio.h rte_fbarray.h GENERIC_INC := rte_atomic.h rte_byteorder.h rte_cycles.h rte_prefetch.h GENERIC_INC += rte_spinlock.h rte_memcpy.h rte_cpuflags.h rte_rwlock.h diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c new file mode 100755 index 0000000..6e71909 --- /dev/null +++ b/lib/librte_eal/common/eal_common_fbarray.c @@ -0,0 +1,585 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "eal_filesystem.h" +#include "eal_private.h" + +#include "rte_fbarray.h" + +#define MASK_SHIFT 6ULL +#define MASK_ALIGN (1 << MASK_SHIFT) +#define MASK_LEN_TO_IDX(x) ((x) >> MASK_SHIFT) +#define MASK_LEN_TO_MOD(x) ((x) - RTE_ALIGN_FLOOR(x, MASK_ALIGN)) +#define MASK_GET_IDX(idx, mod) ((idx << MASK_SHIFT) + mod) + +/* + * This is a mask that is always stored at the end of array, to provide fast + * way of finding free/used spots without looping through each element. + */ + +struct used_mask { + int n_masks; + uint64_t data[]; +}; + +static size_t +calc_mask_size(int len) { + return sizeof(struct used_mask) + sizeof(uint64_t) * MASK_LEN_TO_IDX(len); +} + +static size_t +calc_data_size(size_t page_sz, int elt_sz, int len) { + size_t data_sz = elt_sz * len; + size_t msk_sz = calc_mask_size(len); + return RTE_ALIGN_CEIL(data_sz + msk_sz, page_sz); +} + +static struct used_mask * +get_used_mask(void *data, int elt_sz, int len) { + return (struct used_mask *) RTE_PTR_ADD(data, elt_sz * len); +} + +static void +move_mask(void *data, int elt_sz, int old_len, int new_len) { + struct used_mask *old_msk, *new_msk; + + old_msk = get_used_mask(data, elt_sz, old_len); + new_msk = get_used_mask(data, elt_sz, new_len); + + memset(new_msk, 0, calc_mask_size(new_len)); + memcpy(new_msk, old_msk, calc_mask_size(old_len)); + memset(old_msk, 0, calc_mask_size(old_len)); + new_msk->n_masks = MASK_LEN_TO_IDX(new_len); +} + +static int +expand_and_map(void *addr, const char *name, size_t old_len, size_t new_len) { + char path[PATH_MAX]; + void *map_addr, *adj_addr; + size_t map_len; + int fd, ret = 0; + + map_len = new_len - old_len; + adj_addr = RTE_PTR_ADD(addr, old_len); + + eal_get_fbarray_path(path, sizeof(path), name); + + /* open our file */ + fd = open(path, O_CREAT | O_RDWR, 0600); + if (fd < 0) { + RTE_LOG(ERR, EAL, "Cannot open %s\n", path); + return -1; + } + if (ftruncate(fd, new_len)) { + RTE_LOG(ERR, EAL, "Cannot truncate %s\n", path); + ret = -1; + goto out; + } + + map_addr = mmap(adj_addr, map_len, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE | MAP_FIXED, fd, old_len); + if (map_addr != adj_addr) { + RTE_LOG(ERR, EAL, "mmap() failed: %s\n", strerror(errno)); + ret = -1; + goto out; + } +out: + close(fd); + return ret; +} + +static int +find_next_n(const struct used_mask *msk, int start, int n, bool used) { + int msk_idx, lookahead_idx, first, first_mod; + uint64_t first_msk; + + /* + * mask only has granularity of MASK_ALIGN, but start may not be aligned + * on that boundary, so construct a special mask to exclude anything we + * don't want to see to avoid confusing ctz. + */ + first = MASK_LEN_TO_IDX(start); + first_mod = MASK_LEN_TO_MOD(start); + first_msk = ~((1ULL << first_mod) - 1); + + for (msk_idx = first; msk_idx < msk->n_masks; msk_idx++) { + uint64_t cur_msk, lookahead_msk; + int run_start, clz, left; + bool found = false; + /* + * The process of getting n consecutive bits for arbitrary n is + * a bit involved, but here it is in a nutshell: + * + * 1. let n be the number of consecutive bits we're looking for + * 2. check if n can fit in one mask, and if so, do n-1 + * rshift-ands to see if there is an appropriate run inside + * our current mask + * 2a. if we found a run, bail out early + * 2b. if we didn't find a run, proceed + * 3. invert the mask and count leading zeroes (that is, count + * how many consecutive set bits we had starting from the + * end of current mask) as k + * 3a. if k is 0, continue to next mask + * 3b. if k is not 0, we have a potential run + * 4. to satisfy our requirements, next mask must have n-k + * consecutive set bits right at the start, so we will do + * (n-k-1) rshift-ands and check if first bit is set. + * + * Step 4 will need to be repeated if (n-k) > MASK_ALIGN until + * we either run out of masks, lose the run, or find what we + * were looking for. + */ + cur_msk = msk->data[msk_idx]; + left = n; + + /* if we're looking for free spaces, invert the mask */ + if (!used) + cur_msk = ~cur_msk; + + /* ignore everything before start on first iteration */ + if (msk_idx == first) + cur_msk &= first_msk; + + /* if n can fit in within a single mask, do a search */ + if (n <= MASK_ALIGN) { + uint64_t tmp_msk = cur_msk; + int s_idx; + for (s_idx = 0; s_idx < n - 1; s_idx++) { + tmp_msk &= tmp_msk >> 1ULL; + } + /* we found what we were looking for */ + if (tmp_msk != 0) { + run_start = __builtin_ctzll(tmp_msk); + return MASK_GET_IDX(msk_idx, run_start); + } + } + + /* + * we didn't find our run within the mask, or n > MASK_ALIGN, + * so we're going for plan B. + */ + + /* count leading zeroes on inverted mask */ + clz = __builtin_clzll(~cur_msk); + + /* if there aren't any runs at the end either, just continue */ + if (clz == 0) + continue; + + /* we have a partial run at the end, so try looking ahead */ + run_start = MASK_ALIGN - clz; + left -= clz; + + for (lookahead_idx = msk_idx + 1; lookahead_idx < msk->n_masks; + lookahead_idx++) { + int s_idx, need; + lookahead_msk = msk->data[lookahead_idx]; + + /* if we're looking for free space, invert the mask */ + if (!used) + lookahead_msk = ~lookahead_msk; + + /* figure out how many consecutive bits we need here */ + need = RTE_MIN(left, MASK_ALIGN); + + for (s_idx = 0; s_idx < need - 1; s_idx++) + lookahead_msk &= lookahead_msk >> 1ULL; + + /* if first bit is not set, we've lost the run */ + if ((lookahead_msk & 1) == 0) + break; + + left -= need; + + /* check if we've found what we were looking for */ + if (left == 0) { + found = true; + break; + } + } + + /* we didn't find anything, so continue */ + if (!found) { + continue; + } + + return MASK_GET_IDX(msk_idx, run_start); + } + return used ? -ENOENT : -ENOSPC; +} + +static int +find_next(const struct used_mask *msk, int start, bool used) { + int idx, first, first_mod; + uint64_t first_msk; + + /* + * mask only has granularity of MASK_ALIGN, but start may not be aligned + * on that boundary, so construct a special mask to exclude anything we + * don't want to see to avoid confusing ctz. + */ + first = MASK_LEN_TO_IDX(start); + first_mod = MASK_LEN_TO_MOD(start); + first_msk = ~((1ULL << first_mod) - 1ULL); + + for (idx = first; idx < msk->n_masks; idx++) { + uint64_t cur = msk->data[idx]; + int found; + + /* if we're looking for free entries, invert mask */ + if (!used) { + cur = ~cur; + } + + /* ignore everything before start on first iteration */ + if (idx == first) + cur &= first_msk; + + /* check if we have any entries */ + if (cur == 0) + continue; + + /* + * find first set bit - that will correspond to whatever it is + * that we're looking for. + */ + found = __builtin_ctzll(cur); + return MASK_GET_IDX(idx, found); + } + return used ? -ENOENT : -ENOSPC; +} + +static int +find_contig(const struct used_mask *msk, int start, bool used) { + int idx, first; + int need_len, result = 0; + + first = MASK_LEN_TO_IDX(start); + for (idx = first; idx < msk->n_masks; idx++, result += need_len) { + uint64_t cur = msk->data[idx]; + int run_len; + + need_len = MASK_ALIGN; + + /* if we're looking for free entries, invert mask */ + if (!used) { + cur = ~cur; + } + + /* ignore everything before start on first iteration */ + if (idx == first) { + cur >>= start; + /* at the start, we don't need the full mask len */ + need_len -= start; + } + + /* we will be looking for zeroes, so invert the mask */ + cur = ~cur; + + /* if mask is zero, we have a complete run */ + if (cur == 0) + continue; + + /* + * see if current run ends before mask end. + */ + run_len = __builtin_ctzll(cur); + + /* add however many zeroes we've had in the last run and quit */ + if (run_len < need_len) { + result += run_len; + break; + } + } + return result; +} + +int +rte_fbarray_alloc(struct rte_fbarray *arr, const char *name, int cur_len, + int max_len, int elt_sz) { + size_t max_mmap_len, cur_mmap_len, page_sz; + char path[PATH_MAX]; + struct used_mask *msk; + void *data; + + // TODO: validation + + /* lengths must be aligned */ + cur_len = RTE_ALIGN_CEIL(cur_len, MASK_ALIGN); + max_len = RTE_ALIGN_CEIL(max_len, MASK_ALIGN); + + page_sz = sysconf(_SC_PAGESIZE); + + cur_mmap_len = calc_data_size(page_sz, elt_sz, cur_len); + max_mmap_len = calc_data_size(page_sz, elt_sz, max_len); + + data = eal_get_virtual_area(NULL, &max_mmap_len, page_sz, 0); + if (data == NULL) + return -1; + + eal_get_fbarray_path(path, sizeof(path), name); + unlink(path); + + if (expand_and_map(data, name, 0, cur_mmap_len)) { + return -1; + } + + /* populate data structure */ + snprintf(arr->name, sizeof(arr->name), "%s", name); + arr->data = data; + arr->capacity = max_len; + arr->len = cur_len; + arr->elt_sz = elt_sz; + arr->count = 0; + + msk = get_used_mask(data, elt_sz, cur_len); + msk->n_masks = MASK_LEN_TO_IDX(cur_len); + + return 0; +} + +int +rte_fbarray_attach(const struct rte_fbarray *arr) { + uint64_t max_mmap_len, cur_mmap_len, page_sz; + void *data; + + page_sz = sysconf(_SC_PAGESIZE); + + cur_mmap_len = calc_data_size(page_sz, arr->elt_sz, arr->len); + max_mmap_len = calc_data_size(page_sz, arr->elt_sz, arr->capacity); + + data = eal_get_virtual_area(arr->data, &max_mmap_len, page_sz, 0); + if (data == NULL) + return -1; + + if (expand_and_map(data, arr->name, 0, cur_mmap_len)) { + return -1; + } + + return 0; +} + +void +rte_fbarray_free(struct rte_fbarray *arr) { + size_t page_sz = sysconf(_SC_PAGESIZE); + munmap(arr->data, calc_data_size(page_sz, arr->elt_sz, arr->capacity)); + memset(arr, 0, sizeof(*arr)); +} + +int +rte_fbarray_resize(struct rte_fbarray *arr, int new_len) { + size_t cur_mmap_len, new_mmap_len, page_sz; + + // TODO: validation + if (arr->len >= new_len) { + RTE_LOG(ERR, EAL, "Invalid length: %i >= %i\n", arr->len, new_len); + return -1; + } + + page_sz = sysconf(_SC_PAGESIZE); + + new_len = RTE_ALIGN_CEIL(new_len, MASK_ALIGN); + + cur_mmap_len = calc_data_size(page_sz, arr->elt_sz, arr->len); + new_mmap_len = calc_data_size(page_sz, arr->elt_sz, new_len); + + if (cur_mmap_len != new_mmap_len && + expand_and_map(arr->data, arr->name, cur_mmap_len, + new_mmap_len)) { + return -1; + } + + move_mask(arr->data, arr->elt_sz, arr->len, new_len); + + arr->len = new_len; + + return 0; +} + +void * +rte_fbarray_get(const struct rte_fbarray *arr, int idx) { + if (idx >= arr->len || idx < 0) + return NULL; + return RTE_PTR_ADD(arr->data, idx * arr->elt_sz); +} + +// TODO: replace -1 with debug sanity checks +int +rte_fbarray_set_used(struct rte_fbarray *arr, int idx, bool used) { + struct used_mask *msk = get_used_mask(arr->data, arr->elt_sz, arr->len); + bool already_used; + int msk_idx = MASK_LEN_TO_IDX(idx); + uint64_t msk_bit = 1ULL << MASK_LEN_TO_MOD(idx); + + if (idx >= arr->len || idx < 0) + return -1; + + already_used = (msk->data[msk_idx] & msk_bit) != 0; + + /* nothing to be done */ + if (used == already_used) + return 0; + + if (used) { + msk->data[msk_idx] |= msk_bit; + arr->count++; + } else { + msk->data[msk_idx] &= ~msk_bit; + arr->count--; + } + + return 0; +} +int +rte_fbarray_is_used(const struct rte_fbarray *arr, int idx) { + struct used_mask *msk = get_used_mask(arr->data, arr->elt_sz, arr->len); + int msk_idx = MASK_LEN_TO_IDX(idx); + uint64_t msk_bit = 1ULL << MASK_LEN_TO_MOD(idx); + + if (idx >= arr->len || idx < 0) + return -1; + + return (msk->data[msk_idx] & msk_bit) != 0; +} + +int +rte_fbarray_find_next_free(const struct rte_fbarray *arr, int start) { + if (start >= arr->len || start < 0) + return -EINVAL; + + if (arr->len == arr->count) + return -ENOSPC; + + return find_next(get_used_mask(arr->data, arr->elt_sz, arr->len), + start, false); +} + +int +rte_fbarray_find_next_used(const struct rte_fbarray *arr, int start) { + if (start >= arr->len || start < 0) + return -EINVAL; + + if (arr->count == 0) + return -1; + + return find_next(get_used_mask(arr->data, arr->elt_sz, arr->len), + start, true); +} + +int +rte_fbarray_find_next_n_free(const struct rte_fbarray *arr, int start, int n) { + if (start >= arr->len || start < 0 || n > arr->len) + return -EINVAL; + + if (arr->len == arr->count || arr->len - arr->count < n) + return -ENOSPC; + + return find_next_n(get_used_mask(arr->data, arr->elt_sz, arr->len), + start, n, false); +} + +int +rte_fbarray_find_next_n_used(const struct rte_fbarray *arr, int start, int n) { + if (start >= arr->len || start < 0 || n > arr->len) + return -EINVAL; + + if (arr->count < n) + return -ENOENT; + + return find_next_n(get_used_mask(arr->data, arr->elt_sz, arr->len), + start, n, true); +} + +int +rte_fbarray_find_contig_free(const struct rte_fbarray *arr, int start) { + if (start >= arr->len || start < 0) + return -EINVAL; + + if (arr->len == arr->count) + return -ENOSPC; + + if (arr->count == 0) + return arr->len - start; + + return find_contig(get_used_mask(arr->data, arr->elt_sz, arr->len), + start, false); +} + +int +rte_fbarray_find_contig_used(const struct rte_fbarray *arr, int start) { + if (start >= arr->len || start < 0) + return -EINVAL; + + if (arr->count == 0) + return -ENOENT; + + return find_contig(get_used_mask(arr->data, arr->elt_sz, arr->len), + start, true); +} + +int +rte_fbarray_find_idx(const struct rte_fbarray *arr, const void *elt) { + void *end; + + end = RTE_PTR_ADD(arr->data, arr->elt_sz * arr->len); + + if (elt < arr->data || elt >= end) + return -EINVAL; + return RTE_PTR_DIFF(elt, arr->data) / arr->elt_sz; +} + +void +rte_fbarray_dump_metadata(const struct rte_fbarray *arr, FILE *f) { + const struct used_mask *msk = get_used_mask(arr->data, arr->elt_sz, arr->len); + + fprintf(f, "File-backed array: %s\n", arr->name); + fprintf(f, "size: %i occupied: %i capacity: %i elt_sz: %i\n", + arr->len, arr->count, arr->capacity, arr->elt_sz); + if (!arr->data) { + fprintf(f, "not allocated\n"); + return; + } + + for (int i = 0; i < msk->n_masks; i++) { + fprintf(f, "msk idx %i: 0x%016lx\n", i, msk->data[i]); + } +} diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h index 8acbd99..10e7474 100644 --- a/lib/librte_eal/common/eal_filesystem.h +++ b/lib/librte_eal/common/eal_filesystem.h @@ -42,6 +42,7 @@ /** Path of rte config file. */ #define RUNTIME_CONFIG_FMT "%s/.%s_config" +#define FBARRAY_FMT "%s/%s_%s" #include #include @@ -67,6 +68,18 @@ eal_runtime_config_path(void) return buffer; } +static inline const char * +eal_get_fbarray_path(char *buffer, size_t buflen, const char *name) { + const char *directory = default_config_dir; + const char *home_dir = getenv("HOME"); + + if (getuid() != 0 && home_dir != NULL) + directory = home_dir; + snprintf(buffer, buflen - 1, FBARRAY_FMT, directory, + internal_config.hugefile_prefix, name); + return buffer; +} + /** Path of hugepage info file. */ #define HUGEPAGE_INFO_FMT "%s/.%s_hugepage_info" diff --git a/lib/librte_eal/common/include/rte_fbarray.h b/lib/librte_eal/common/include/rte_fbarray.h new file mode 100755 index 0000000..d06c1ac --- /dev/null +++ b/lib/librte_eal/common/include/rte_fbarray.h @@ -0,0 +1,98 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef RTE_FBARRAY_H +#define RTE_FBARRAY_H + +#include +#include + +#define RTE_FBARRAY_NAME_LEN 64 + +struct rte_fbarray { + char name[RTE_FBARRAY_NAME_LEN]; /**< name associated with an array */ + int count; /**< number of entries stored */ + int len; /**< current length of the array */ + int capacity; /**< maximum length of the array */ + int elt_sz; /**< size of each element */ + void *data; /**< data pointer */ +}; + +// TODO: tmp? shmget? + +int +rte_fbarray_alloc(struct rte_fbarray *arr, const char *name, int cur_len, + int max_len, int elt_sz); + +int +rte_fbarray_attach(const struct rte_fbarray *arr); + +void +rte_fbarray_free(struct rte_fbarray *arr); + +int +rte_fbarray_resize(struct rte_fbarray *arr, int new_len); + +void * +rte_fbarray_get(const struct rte_fbarray *arr, int idx); + +int +rte_fbarray_find_idx(const struct rte_fbarray *arr, const void *elt); + +int +rte_fbarray_set_used(struct rte_fbarray *arr, int idx, bool used); + +int +rte_fbarray_is_used(const struct rte_fbarray *arr, int idx); + +int +rte_fbarray_find_next_free(const struct rte_fbarray *arr, int start); + +int +rte_fbarray_find_next_used(const struct rte_fbarray *arr, int start); + +int +rte_fbarray_find_next_n_free(const struct rte_fbarray *arr, int start, int n); + +int +rte_fbarray_find_next_n_used(const struct rte_fbarray *arr, int start, int n); + +int +rte_fbarray_find_contig_free(const struct rte_fbarray *arr, int start); + +int +rte_fbarray_find_contig_used(const struct rte_fbarray *arr, int start); + +void +rte_fbarray_dump_metadata(const struct rte_fbarray *arr, FILE *f); + +#endif // RTE_FBARRAY_H diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile index 5a7b8b2..782e1ad 100644 --- a/lib/librte_eal/linuxapp/eal/Makefile +++ b/lib/librte_eal/linuxapp/eal/Makefile @@ -86,6 +86,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_dev.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_options.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_thread.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_proc.c +SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_fbarray.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_malloc.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += malloc_elem.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += malloc_heap.c From patchwork Tue Dec 19 11:14:31 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32455 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 613C61B16B; Tue, 19 Dec 2017 12:15:07 +0100 (CET) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 562B11B01F for ; Tue, 19 Dec 2017 12:14:54 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="19495265" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga002.jf.intel.com with ESMTP; 19 Dec 2017 03:14:52 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEpal003095; Tue, 19 Dec 2017 11:14:51 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEpPN010200; Tue, 19 Dec 2017 11:14:51 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEprw010196; Tue, 19 Dec 2017 11:14:51 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:31 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 04/23] eal: move all locking to heap X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Down the line, we will need to do everything from the heap as any alloc or free may trigger alloc/free OS memory, which would involve growing/shrinking heap. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/malloc_elem.c | 16 ++-------------- lib/librte_eal/common/malloc_heap.c | 36 ++++++++++++++++++++++++++++++++++++ lib/librte_eal/common/malloc_heap.h | 6 ++++++ lib/librte_eal/common/rte_malloc.c | 4 ++-- 4 files changed, 46 insertions(+), 16 deletions(-) diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 98bcd37..6b4f2a5 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -271,10 +271,6 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2) int malloc_elem_free(struct malloc_elem *elem) { - if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY) - return -1; - - rte_spinlock_lock(&(elem->heap->lock)); size_t sz = elem->size - sizeof(*elem) - MALLOC_ELEM_TRAILER_LEN; uint8_t *ptr = (uint8_t *)&elem[1]; struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size); @@ -302,8 +298,6 @@ malloc_elem_free(struct malloc_elem *elem) memset(ptr, 0, sz); - rte_spinlock_unlock(&(elem->heap->lock)); - return 0; } @@ -320,11 +314,10 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size) return 0; struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size); - rte_spinlock_lock(&elem->heap->lock); if (next ->state != ELEM_FREE) - goto err_return; + return -1; if (elem->size + next->size < new_size) - goto err_return; + return -1; /* we now know the element fits, so remove from free list, * join the two @@ -339,10 +332,5 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size) split_elem(elem, split_pt); malloc_elem_free_list_insert(split_pt); } - rte_spinlock_unlock(&elem->heap->lock); return 0; - -err_return: - rte_spinlock_unlock(&elem->heap->lock); - return -1; } diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index 267a4c6..099e448 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -174,6 +174,42 @@ malloc_heap_alloc(struct malloc_heap *heap, return elem == NULL ? NULL : (void *)(&elem[1]); } +int +malloc_heap_free(struct malloc_elem *elem) { + struct malloc_heap *heap; + int ret; + + if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY) + return -1; + + /* elem may be merged with previous element, so keep heap address */ + heap = elem->heap; + + rte_spinlock_lock(&(heap->lock)); + + ret = malloc_elem_free(elem); + + rte_spinlock_unlock(&(heap->lock)); + + return ret; +} + +int +malloc_heap_resize(struct malloc_elem *elem, size_t size) { + int ret; + + if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY) + return -1; + + rte_spinlock_lock(&(elem->heap->lock)); + + ret = malloc_elem_resize(elem, size); + + rte_spinlock_unlock(&(elem->heap->lock)); + + return ret; +} + /* * Function to retrieve data for heap on given socket */ diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h index 3ccbef0..3767ef3 100644 --- a/lib/librte_eal/common/malloc_heap.h +++ b/lib/librte_eal/common/malloc_heap.h @@ -57,6 +57,12 @@ malloc_heap_alloc(struct malloc_heap *heap, const char *type, size_t size, unsigned flags, size_t align, size_t bound); int +malloc_heap_free(struct malloc_elem *elem); + +int +malloc_heap_resize(struct malloc_elem *elem, size_t size); + +int malloc_heap_get_stats(const struct malloc_heap *heap, struct rte_malloc_socket_stats *socket_stats); diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c index fe2278b..74b5417 100644 --- a/lib/librte_eal/common/rte_malloc.c +++ b/lib/librte_eal/common/rte_malloc.c @@ -58,7 +58,7 @@ void rte_free(void *addr) { if (addr == NULL) return; - if (malloc_elem_free(malloc_elem_from_data(addr)) < 0) + if (malloc_heap_free(malloc_elem_from_data(addr)) < 0) rte_panic("Fatal error: Invalid memory\n"); } @@ -169,7 +169,7 @@ rte_realloc(void *ptr, size_t size, unsigned align) size = RTE_CACHE_LINE_ROUNDUP(size), align = RTE_CACHE_LINE_ROUNDUP(align); /* check alignment matches first, and if ok, see if we can resize block */ if (RTE_PTR_ALIGN(ptr,align) == ptr && - malloc_elem_resize(elem, size) == 0) + malloc_heap_resize(elem, size) == 0) return ptr; /* either alignment is off, or we have no room to expand, From patchwork Tue Dec 19 11:14:32 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32453 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 93EF91B03F; Tue, 19 Dec 2017 12:15:04 +0100 (CET) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 50D831B01C for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="14827537" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga001.fm.intel.com with ESMTP; 19 Dec 2017 03:14:52 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEprR003098; Tue, 19 Dec 2017 11:14:51 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEprO010208; Tue, 19 Dec 2017 11:14:51 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEpcj010204; Tue, 19 Dec 2017 11:14:51 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:32 +0000 Message-Id: <384ff9cb344a35f47f30506070592076ea2cb48e.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 05/23] eal: protect malloc heap stats with a lock X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" This does not change the public API, as this API is not meant to be called directly. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/malloc_heap.c | 7 ++++++- lib/librte_eal/common/malloc_heap.h | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index 099e448..b3a1043 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -214,12 +214,14 @@ malloc_heap_resize(struct malloc_elem *elem, size_t size) { * Function to retrieve data for heap on given socket */ int -malloc_heap_get_stats(const struct malloc_heap *heap, +malloc_heap_get_stats(struct malloc_heap *heap, struct rte_malloc_socket_stats *socket_stats) { size_t idx; struct malloc_elem *elem; + rte_spinlock_lock(&(heap->lock)); + /* Initialise variables for heap */ socket_stats->free_count = 0; socket_stats->heap_freesz_bytes = 0; @@ -241,6 +243,9 @@ malloc_heap_get_stats(const struct malloc_heap *heap, socket_stats->heap_allocsz_bytes = (socket_stats->heap_totalsz_bytes - socket_stats->heap_freesz_bytes); socket_stats->alloc_count = heap->alloc_count; + + rte_spinlock_unlock(&(heap->lock)); + return 0; } diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h index 3767ef3..df04dd8 100644 --- a/lib/librte_eal/common/malloc_heap.h +++ b/lib/librte_eal/common/malloc_heap.h @@ -63,7 +63,7 @@ int malloc_heap_resize(struct malloc_elem *elem, size_t size); int -malloc_heap_get_stats(const struct malloc_heap *heap, +malloc_heap_get_stats(struct malloc_heap *heap, struct rte_malloc_socket_stats *socket_stats); int From patchwork Tue Dec 19 11:14:33 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32468 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 93A1C1B1D8; Tue, 19 Dec 2017 12:15:24 +0100 (CET) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 911561B020 for ; Tue, 19 Dec 2017 12:14:56 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="3584772" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga007.fm.intel.com with ESMTP; 19 Dec 2017 03:14:52 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEphN003101; Tue, 19 Dec 2017 11:14:51 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEpLZ010215; Tue, 19 Dec 2017 11:14:51 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEpkk010211; Tue, 19 Dec 2017 11:14:51 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:33 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 06/23] eal: make malloc a doubly-linked list X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" As we are preparing for dynamic memory allocation, we need to be able to handle holes in our malloc heap, hence we're switching to doubly linked list, and prepare infrastructure to support it. Since our heap is now aware where are our first and last elements, there is no longer any need to have a dummy element at the end of each heap, so get rid of that as well. Instead, let insert/remove/ join/split operations handle end-of-list conditions automatically. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/include/rte_malloc_heap.h | 6 + lib/librte_eal/common/malloc_elem.c | 196 +++++++++++++++++++----- lib/librte_eal/common/malloc_elem.h | 7 +- lib/librte_eal/common/malloc_heap.c | 8 +- 4 files changed, 170 insertions(+), 47 deletions(-) diff --git a/lib/librte_eal/common/include/rte_malloc_heap.h b/lib/librte_eal/common/include/rte_malloc_heap.h index b270356..48a46c9 100644 --- a/lib/librte_eal/common/include/rte_malloc_heap.h +++ b/lib/librte_eal/common/include/rte_malloc_heap.h @@ -42,12 +42,18 @@ /* Number of free lists per heap, grouped by size. */ #define RTE_HEAP_NUM_FREELISTS 13 +/* dummy definition, for pointers */ +struct malloc_elem; + /** * Structure to hold malloc heap */ struct malloc_heap { rte_spinlock_t lock; LIST_HEAD(, malloc_elem) free_head[RTE_HEAP_NUM_FREELISTS]; + struct malloc_elem *first; + struct malloc_elem *last; + unsigned alloc_count; size_t total_size; } __rte_cache_aligned; diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 6b4f2a5..7609a9b 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -60,6 +60,7 @@ malloc_elem_init(struct malloc_elem *elem, elem->heap = heap; elem->ms = ms; elem->prev = NULL; + elem->next = NULL; memset(&elem->free_list, 0, sizeof(elem->free_list)); elem->state = ELEM_FREE; elem->size = size; @@ -68,15 +69,56 @@ malloc_elem_init(struct malloc_elem *elem, set_trailer(elem); } -/* - * Initialize a dummy malloc_elem header for the end-of-memseg marker - */ void -malloc_elem_mkend(struct malloc_elem *elem, struct malloc_elem *prev) +malloc_elem_insert(struct malloc_elem *elem) { - malloc_elem_init(elem, prev->heap, prev->ms, 0); - elem->prev = prev; - elem->state = ELEM_BUSY; /* mark busy so its never merged */ + struct malloc_elem *prev_elem, *next_elem; + struct malloc_heap *heap = elem->heap; + + if (heap->first == NULL && heap->last == NULL) { + /* if empty heap */ + heap->first = elem; + heap->last = elem; + prev_elem = NULL; + next_elem = NULL; + } else if (elem < heap->first) { + /* if lower than start */ + prev_elem = NULL; + next_elem = heap->first; + heap->first = elem; + } else if (elem > heap->last) { + /* if higher than end */ + prev_elem = heap->last; + next_elem = NULL; + heap->last = elem; + } else { + /* the new memory is somewhere inbetween start and end */ + uint64_t dist_from_start, dist_from_end; + + dist_from_end = RTE_PTR_DIFF(heap->last, elem); + dist_from_start = RTE_PTR_DIFF(elem, heap->first); + + /* check which is closer, and find closest list entries */ + if (dist_from_start < dist_from_end) { + prev_elem = heap->first; + while (prev_elem->next < elem) + prev_elem = prev_elem->next; + next_elem = prev_elem->next; + } else { + next_elem = heap->last; + while (next_elem->prev > elem) + next_elem = next_elem->prev; + prev_elem = next_elem->prev; + } + } + + /* insert new element */ + elem->prev = prev_elem; + elem->next = next_elem; + if (prev_elem) + prev_elem->next = elem; + if (next_elem) + next_elem->prev = elem; } /* @@ -126,18 +168,55 @@ malloc_elem_can_hold(struct malloc_elem *elem, size_t size, unsigned align, static void split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt) { - struct malloc_elem *next_elem = RTE_PTR_ADD(elem, elem->size); + struct malloc_elem *next_elem = elem->next; const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem; const size_t new_elem_size = elem->size - old_elem_size; malloc_elem_init(split_pt, elem->heap, elem->ms, new_elem_size); split_pt->prev = elem; - next_elem->prev = split_pt; + split_pt->next = next_elem; + if (next_elem) + next_elem->prev = split_pt; + else + elem->heap->last = split_pt; + elem->next = split_pt; elem->size = old_elem_size; set_trailer(elem); } /* + * our malloc heap is a doubly linked list, so doubly remove our element. + */ +static void __rte_unused +remove_elem(struct malloc_elem *elem) { + struct malloc_elem *next, *prev; + next = elem->next; + prev = elem->prev; + + if (next) + next->prev = prev; + else + elem->heap->last = prev; + if (prev) + prev->next = next; + else + elem->heap->first = next; + + elem->prev = NULL; + elem->next = NULL; +} + +static int +next_elem_is_adjacent(struct malloc_elem *elem) { + return elem->next == RTE_PTR_ADD(elem, elem->size); +} + +static int +prev_elem_is_adjacent(struct malloc_elem *elem) { + return elem == RTE_PTR_ADD(elem->prev, elem->prev->size); +} + +/* * Given an element size, compute its freelist index. * We free an element into the freelist containing similarly-sized elements. * We try to allocate elements starting with the freelist containing @@ -220,6 +299,9 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align, split_elem(elem, new_free_elem); malloc_elem_free_list_insert(new_free_elem); + + if (elem == elem->heap->last) + elem->heap->last = new_free_elem; } if (old_elem_size < MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) { @@ -258,9 +340,61 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align, static inline void join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2) { - struct malloc_elem *next = RTE_PTR_ADD(elem2, elem2->size); + struct malloc_elem *next = elem2->next; elem1->size += elem2->size; - next->prev = elem1; + if (next) + next->prev = elem1; + else + elem1->heap->last = elem1; + elem1->next = next; +} + +static struct malloc_elem * +elem_join_adjacent_free(struct malloc_elem *elem) { + /* + * check if next element exists, is adjacent and is free, if so join + * with it, need to remove from free list. + */ + if (elem->next != NULL && elem->next->state == ELEM_FREE && + next_elem_is_adjacent(elem)){ + void *erase; + + /* we will want to erase the trailer and header */ + erase = RTE_PTR_SUB(elem->next, MALLOC_ELEM_TRAILER_LEN); + + /* remove from free list, join to this one */ + elem_free_list_remove(elem->next); + join_elem(elem, elem->next); + + /* erase header and trailer */ + memset(erase, 0, MALLOC_ELEM_OVERHEAD); + } + + /* + * check if prev element exists, is adjacent and is free, if so join + * with it, need to remove from free list. + */ + if (elem->prev != NULL && elem->prev->state == ELEM_FREE && + prev_elem_is_adjacent(elem)) { + struct malloc_elem *new_elem; + void *erase; + + /* we will want to erase trailer and header */ + erase = RTE_PTR_SUB(elem, MALLOC_ELEM_TRAILER_LEN); + + /* remove from free list, join to this one */ + elem_free_list_remove(elem->prev); + + new_elem = elem->prev; + join_elem(new_elem, elem); + + /* erase header and trailer */ + memset(erase, 0, MALLOC_ELEM_OVERHEAD); + + elem = new_elem; + } + + return elem; } /* @@ -271,32 +405,20 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2) int malloc_elem_free(struct malloc_elem *elem) { - size_t sz = elem->size - sizeof(*elem) - MALLOC_ELEM_TRAILER_LEN; - uint8_t *ptr = (uint8_t *)&elem[1]; - struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size); - if (next->state == ELEM_FREE){ - /* remove from free list, join to this one */ - elem_free_list_remove(next); - join_elem(elem, next); - sz += (sizeof(*elem) + MALLOC_ELEM_TRAILER_LEN); - } + void *ptr; + size_t data_len; + + ptr = RTE_PTR_ADD(elem, sizeof(*elem)); + data_len = elem->size - MALLOC_ELEM_OVERHEAD; + + elem = elem_join_adjacent_free(elem); - /* check if previous element is free, if so join with it and return, - * need to re-insert in free list, as that element's size is changing - */ - if (elem->prev != NULL && elem->prev->state == ELEM_FREE) { - elem_free_list_remove(elem->prev); - join_elem(elem->prev, elem); - sz += (sizeof(*elem) + MALLOC_ELEM_TRAILER_LEN); - ptr -= (sizeof(*elem) + MALLOC_ELEM_TRAILER_LEN); - elem = elem->prev; - } malloc_elem_free_list_insert(elem); /* decrease heap's count of allocated elements */ elem->heap->alloc_count--; - memset(ptr, 0, sz); + memset(ptr, 0, data_len); return 0; } @@ -309,21 +431,23 @@ int malloc_elem_resize(struct malloc_elem *elem, size_t size) { const size_t new_size = size + elem->pad + MALLOC_ELEM_OVERHEAD; + /* if we request a smaller size, then always return ok */ if (elem->size >= new_size) return 0; - struct malloc_elem *next = RTE_PTR_ADD(elem, elem->size); - if (next ->state != ELEM_FREE) + /* check if there is a next element, it's free and adjacent */ + if (!elem->next || elem->next->state != ELEM_FREE || + !next_elem_is_adjacent(elem)) return -1; - if (elem->size + next->size < new_size) + if (elem->size + elem->next->size < new_size) return -1; /* we now know the element fits, so remove from free list, * join the two */ - elem_free_list_remove(next); - join_elem(elem, next); + elem_free_list_remove(elem->next); + join_elem(elem, elem->next); if (elem->size - new_size >= MIN_DATA_SIZE + MALLOC_ELEM_OVERHEAD) { /* now we have a big block together. Lets cut it down a bit, by splitting */ diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index ce39129..b3d39c0 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -48,6 +48,7 @@ enum elem_state { struct malloc_elem { struct malloc_heap *heap; struct malloc_elem *volatile prev; /* points to prev elem in memseg */ + struct malloc_elem *volatile next; /* points to next elem in memseg */ LIST_ENTRY(malloc_elem) free_list; /* list of free elements in heap */ const struct rte_memseg *ms; volatile enum elem_state state; @@ -139,12 +140,8 @@ malloc_elem_init(struct malloc_elem *elem, const struct rte_memseg *ms, size_t size); -/* - * initialise a dummy malloc_elem header for the end-of-memseg marker - */ void -malloc_elem_mkend(struct malloc_elem *elem, - struct malloc_elem *prev_free); +malloc_elem_insert(struct malloc_elem *elem); /* * return true if the current malloc_elem can hold a block of data diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index b3a1043..1b35468 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -99,15 +99,11 @@ check_hugepage_sz(unsigned flags, uint64_t hugepage_sz) static void malloc_heap_add_memseg(struct malloc_heap *heap, struct rte_memseg *ms) { - /* allocate the memory block headers, one at end, one at start */ struct malloc_elem *start_elem = (struct malloc_elem *)ms->addr; - struct malloc_elem *end_elem = RTE_PTR_ADD(ms->addr, - ms->len - MALLOC_ELEM_OVERHEAD); - end_elem = RTE_PTR_ALIGN_FLOOR(end_elem, RTE_CACHE_LINE_SIZE); - const size_t elem_size = (uintptr_t)end_elem - (uintptr_t)start_elem; + const size_t elem_size = ms->len - MALLOC_ELEM_OVERHEAD; malloc_elem_init(start_elem, heap, ms, elem_size); - malloc_elem_mkend(end_elem, start_elem); + malloc_elem_insert(start_elem); malloc_elem_free_list_insert(start_elem); heap->total_size += elem_size; From patchwork Tue Dec 19 11:14:34 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32457 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id CAA231B1A3; Tue, 19 Dec 2017 12:15:09 +0100 (CET) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id AB87A1B020 for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="185440362" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga005.jf.intel.com with ESMTP; 19 Dec 2017 03:14:52 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEpWL003104; Tue, 19 Dec 2017 11:14:51 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEpVL010223; Tue, 19 Dec 2017 11:14:51 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEpit010218; Tue, 19 Dec 2017 11:14:51 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:34 +0000 Message-Id: <65cbd0ba86f5ab39b7f3567fe866a68ea04d7f71.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 07/23] eal: make malloc_elem_join_adjacent_free public X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" We need this function to join newly allocated segments with the heap. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/malloc_elem.c | 6 +++--- lib/librte_eal/common/malloc_elem.h | 3 +++ 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 7609a9b..782aaa7 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -349,8 +349,8 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2) elem1->next = next; } -static struct malloc_elem * -elem_join_adjacent_free(struct malloc_elem *elem) { +struct malloc_elem * +malloc_elem_join_adjacent_free(struct malloc_elem *elem) { /* * check if next element exists, is adjacent and is free, if so join * with it, need to remove from free list. @@ -411,7 +411,7 @@ malloc_elem_free(struct malloc_elem *elem) ptr = RTE_PTR_ADD(elem, sizeof(*elem)); data_len = elem->size - MALLOC_ELEM_OVERHEAD; - elem = elem_join_adjacent_free(elem); + elem = malloc_elem_join_adjacent_free(elem); malloc_elem_free_list_insert(elem); diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index b3d39c0..cf27b59 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -167,6 +167,9 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, int malloc_elem_free(struct malloc_elem *elem); +struct malloc_elem * +malloc_elem_join_adjacent_free(struct malloc_elem *elem); + /* * attempt to resize a malloc_elem by expanding into any free space * immediately after it in memory. From patchwork Tue Dec 19 11:14:35 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32454 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 29CB31B052; Tue, 19 Dec 2017 12:15:06 +0100 (CET) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 49DBA1B01B for ; Tue, 19 Dec 2017 12:14:54 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="3913246" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga007.jf.intel.com with ESMTP; 19 Dec 2017 03:14:52 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEqks003107; Tue, 19 Dec 2017 11:14:52 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEqB9010230; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEp1s010226; Tue, 19 Dec 2017 11:14:51 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:35 +0000 Message-Id: <8cf0f1b4a35b5f49fc0849dc06886a317e534eb9.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 08/23] eal: add "single file segments" command-line option X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" For now, this option does nothing, but it will be useful in dynamic memory allocation down the line. Currently, DPDK stores all pages as separate files in hugetlbfs. This option will allow storing all pages in one file (one file per socket, per page size). Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_options.c | 4 ++++ lib/librte_eal/common/eal_internal_cfg.h | 3 +++ lib/librte_eal/common/eal_options.h | 2 ++ lib/librte_eal/linuxapp/eal/eal.c | 1 + 4 files changed, 10 insertions(+) diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c index 996a034..c3f7c41 100644 --- a/lib/librte_eal/common/eal_common_options.c +++ b/lib/librte_eal/common/eal_common_options.c @@ -98,6 +98,7 @@ eal_long_options[] = { {OPT_VDEV, 1, NULL, OPT_VDEV_NUM }, {OPT_VFIO_INTR, 1, NULL, OPT_VFIO_INTR_NUM }, {OPT_VMWARE_TSC_MAP, 0, NULL, OPT_VMWARE_TSC_MAP_NUM }, + {OPT_SINGLE_FILE_SEGMENTS, 0, NULL, OPT_SINGLE_FILE_SEGMENTS_NUM}, {0, 0, NULL, 0 } }; @@ -1158,6 +1159,9 @@ eal_parse_common_option(int opt, const char *optarg, } core_parsed = 1; break; + case OPT_SINGLE_FILE_SEGMENTS_NUM: + conf->single_file_segments = 1; + break; /* don't know what to do, leave this to caller */ default: diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h index fa6ccbe..484a32e 100644 --- a/lib/librte_eal/common/eal_internal_cfg.h +++ b/lib/librte_eal/common/eal_internal_cfg.h @@ -76,6 +76,9 @@ struct internal_config { volatile unsigned force_sockets; volatile uint64_t socket_mem[RTE_MAX_NUMA_NODES]; /**< amount of memory per socket */ uintptr_t base_virtaddr; /**< base address to try and reserve memory from */ + volatile unsigned single_file_segments; + /**< true if storing all pages within single files (per-page-size, + * per-node). */ volatile int syslog_facility; /**< facility passed to openlog() */ /** default interrupt mode for VFIO */ volatile enum rte_intr_mode vfio_intr_mode; diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h index 30e6bb4..26a682a 100644 --- a/lib/librte_eal/common/eal_options.h +++ b/lib/librte_eal/common/eal_options.h @@ -83,6 +83,8 @@ enum { OPT_VFIO_INTR_NUM, #define OPT_VMWARE_TSC_MAP "vmware-tsc-map" OPT_VMWARE_TSC_MAP_NUM, +#define OPT_SINGLE_FILE_SEGMENTS "single-file-segments" + OPT_SINGLE_FILE_SEGMENTS_NUM, OPT_LONG_MAX_NUM }; diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index 229eec9..2a3127f 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -366,6 +366,7 @@ eal_usage(const char *prgname) " --"OPT_BASE_VIRTADDR" Base virtual address\n" " --"OPT_CREATE_UIO_DEV" Create /dev/uioX (usually done by hotplug)\n" " --"OPT_VFIO_INTR" Interrupt mode for VFIO (legacy|msi|msix)\n" + " --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n" "\n"); /* Allow the application to print its usage message too if hook is set */ if ( rte_application_usage_hook ) { From patchwork Tue Dec 19 11:14:36 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32458 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 125011B1A8; Tue, 19 Dec 2017 12:15:11 +0100 (CET) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id A51A41B019 for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="14827539" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga001.fm.intel.com with ESMTP; 19 Dec 2017 03:14:52 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEqiD003110; Tue, 19 Dec 2017 11:14:52 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEqbR010237; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEqHC010233; Tue, 19 Dec 2017 11:14:52 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:36 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 09/23] eal: add "legacy memory" option X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" This adds a "--legacy-mem" command-line switch. It will be used to go back to the old memory behavior, one where we can't dynamically allocate/free memory (the downside), but one where the user can get physically contiguous memory, like before (the upside). For now, nothing but the legacy behavior exists, non-legacy memory init sequence will be added later. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_options.c | 4 ++++ lib/librte_eal/common/eal_internal_cfg.h | 3 +++ lib/librte_eal/common/eal_options.h | 2 ++ lib/librte_eal/linuxapp/eal/eal.c | 1 + lib/librte_eal/linuxapp/eal/eal_memory.c | 22 ++++++++++++++++++---- 5 files changed, 28 insertions(+), 4 deletions(-) diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c index c3f7c41..88ff35a 100644 --- a/lib/librte_eal/common/eal_common_options.c +++ b/lib/librte_eal/common/eal_common_options.c @@ -99,6 +99,7 @@ eal_long_options[] = { {OPT_VFIO_INTR, 1, NULL, OPT_VFIO_INTR_NUM }, {OPT_VMWARE_TSC_MAP, 0, NULL, OPT_VMWARE_TSC_MAP_NUM }, {OPT_SINGLE_FILE_SEGMENTS, 0, NULL, OPT_SINGLE_FILE_SEGMENTS_NUM}, + {OPT_LEGACY_MEM, 0, NULL, OPT_LEGACY_MEM_NUM }, {0, 0, NULL, 0 } }; @@ -1162,6 +1163,9 @@ eal_parse_common_option(int opt, const char *optarg, case OPT_SINGLE_FILE_SEGMENTS_NUM: conf->single_file_segments = 1; break; + case OPT_LEGACY_MEM_NUM: + conf->legacy_mem = 1; + break; /* don't know what to do, leave this to caller */ default: diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h index 484a32e..62ab15b 100644 --- a/lib/librte_eal/common/eal_internal_cfg.h +++ b/lib/librte_eal/common/eal_internal_cfg.h @@ -79,6 +79,9 @@ struct internal_config { volatile unsigned single_file_segments; /**< true if storing all pages within single files (per-page-size, * per-node). */ + volatile unsigned legacy_mem; + /**< true to enable legacy memory behavior (no dynamic allocation, + * contiguous segments). */ volatile int syslog_facility; /**< facility passed to openlog() */ /** default interrupt mode for VFIO */ volatile enum rte_intr_mode vfio_intr_mode; diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h index 26a682a..d09b034 100644 --- a/lib/librte_eal/common/eal_options.h +++ b/lib/librte_eal/common/eal_options.h @@ -85,6 +85,8 @@ enum { OPT_VMWARE_TSC_MAP_NUM, #define OPT_SINGLE_FILE_SEGMENTS "single-file-segments" OPT_SINGLE_FILE_SEGMENTS_NUM, +#define OPT_LEGACY_MEM "legacy-mem" + OPT_LEGACY_MEM_NUM, OPT_LONG_MAX_NUM }; diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index 2a3127f..37ae8e0 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -367,6 +367,7 @@ eal_usage(const char *prgname) " --"OPT_CREATE_UIO_DEV" Create /dev/uioX (usually done by hotplug)\n" " --"OPT_VFIO_INTR" Interrupt mode for VFIO (legacy|msi|msix)\n" " --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n" + " --"OPT_LEGACY_MEM" Legacy memory mode (no dynamic allocation, contiguous segments)\n" "\n"); /* Allow the application to print its usage message too if hook is set */ if ( rte_application_usage_hook ) { diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index dd18d98..5b18af9 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -940,8 +940,8 @@ huge_recover_sigbus(void) * 6. unmap the first mapping * 7. fill memsegs in configuration with contiguous zones */ -int -rte_eal_hugepage_init(void) +static int +eal_legacy_hugepage_init(void) { struct rte_mem_config *mcfg; struct hugepage_file *hugepage = NULL, *tmp_hp = NULL; @@ -1283,8 +1283,8 @@ getFileSize(int fd) * configuration and finds the hugepages which form that segment, mapping them * in order to form a contiguous block in the virtual memory space */ -int -rte_eal_hugepage_attach(void) +static int +eal_legacy_hugepage_attach(void) { const struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; struct hugepage_file *hp = NULL; @@ -1435,6 +1435,20 @@ rte_eal_hugepage_attach(void) } int +rte_eal_hugepage_init(void) { + if (internal_config.legacy_mem) + return eal_legacy_hugepage_init(); + return -1; +} + +int +rte_eal_hugepage_attach(void) { + if (internal_config.legacy_mem) + return eal_legacy_hugepage_attach(); + return -1; +} + +int rte_eal_using_phys_addrs(void) { return phys_addrs_available; From patchwork Tue Dec 19 11:14:37 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32464 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id C16B31B1CD; Tue, 19 Dec 2017 12:15:17 +0100 (CET) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 4B6A11B01F for ; Tue, 19 Dec 2017 12:14:56 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="4054827" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga006.jf.intel.com with ESMTP; 19 Dec 2017 03:14:52 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEqVT003113; Tue, 19 Dec 2017 11:14:52 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEqGF010245; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEqTQ010240; Tue, 19 Dec 2017 11:14:52 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:37 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 10/23] eal: read hugepage counts from node-specific sysfs path X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" For non-legacy memory init mode, instead of looking at generic sysfs path, look at sysfs paths pertaining to each NUMA node for hugepage counts. Note that per-NUMA node path does not provide information regarding reserved pages, so we might not get the best info from these paths, but this saves us from the whole mapping/remapping business before we're actually able to tell which page is on which socket, because we no longer require our memory to be physically contiguous. Legacy memory init will not use this. Signed-off-by: Anatoly Burakov --- lib/librte_eal/linuxapp/eal/eal_hugepage_info.c | 73 +++++++++++++++++++++++-- 1 file changed, 67 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c index 86e174f..a85c15a 100644 --- a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c +++ b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c @@ -59,6 +59,7 @@ #include "eal_filesystem.h" static const char sys_dir_path[] = "/sys/kernel/mm/hugepages"; +static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node"; /* this function is only called from eal_hugepage_info_init which itself * is only called from a primary process */ @@ -99,6 +100,42 @@ get_num_hugepages(const char *subdir) return num_pages; } +static uint32_t +get_num_hugepages_on_node(const char *subdir, unsigned socket) { + char path[PATH_MAX], socketpath[PATH_MAX]; + DIR *socketdir; + long unsigned num_pages = 0; + const char *nr_hp_file = "free_hugepages"; + + snprintf(socketpath, sizeof(socketpath), "%s/node%u/hugepages", + sys_pages_numa_dir_path, socket); + + socketdir = opendir(socketpath); + if (socketdir) { + /* Keep calm and carry on */ + closedir(socketdir); + } else { + /* Can't find socket dir, so ignore it */ + return 0; + } + + snprintf(path, sizeof(path), "%s/%s/%s", + socketpath, subdir, nr_hp_file); + if (eal_parse_sysfs_value(path, &num_pages) < 0) + return 0; + + if (num_pages == 0) + RTE_LOG(WARNING, EAL, "No free hugepages reported in %s\n", + subdir); + + /* we want to return a uint32_t and more than this looks suspicious + * anyway ... */ + if (num_pages > UINT32_MAX) + num_pages = UINT32_MAX; + + return num_pages; +} + static uint64_t get_default_hp_size(void) { @@ -277,7 +314,7 @@ eal_hugepage_info_init(void) { const char dirent_start_text[] = "hugepages-"; const size_t dirent_start_len = sizeof(dirent_start_text) - 1; - unsigned i, num_sizes = 0; + unsigned i, total_pages, num_sizes = 0; DIR *dir; struct dirent *dirent; @@ -331,9 +368,24 @@ eal_hugepage_info_init(void) if (clear_hugedir(hpi->hugedir) == -1) break; - /* for now, put all pages into socket 0, - * later they will be sorted */ - hpi->num_pages[0] = get_num_hugepages(dirent->d_name); + /* first, try to put all hugepages into relevant sockets, but + * if first attempts fails, fall back to collecting all pages + * in one socket and sorting them later */ + total_pages = 0; + /* we also don't want to do this for legacy init */ + if (!internal_config.legacy_mem) + for (i = 0; i < rte_num_sockets(); i++) { + unsigned num_pages = + get_num_hugepages_on_node( + dirent->d_name, i); + hpi->num_pages[i] = num_pages; + total_pages += num_pages; + } + /* we failed to sort memory from the get go, so fall + * back to old way */ + if (total_pages == 0) { + hpi->num_pages[0] = get_num_hugepages(dirent->d_name); + } #ifndef RTE_ARCH_64 /* for 32-bit systems, limit number of hugepages to @@ -357,10 +409,19 @@ eal_hugepage_info_init(void) sizeof(internal_config.hugepage_info[0]), compare_hpi); /* now we have all info, check we have at least one valid size */ - for (i = 0; i < num_sizes; i++) + for (i = 0; i < num_sizes; i++) { + /* pages may no longer all be on socket 0, so check all */ + unsigned j, num_pages = 0; + + for (j = 0; j < RTE_MAX_NUMA_NODES; j++) { + struct hugepage_info *hpi = + &internal_config.hugepage_info[i]; + num_pages += hpi->num_pages[j]; + } if (internal_config.hugepage_info[i].hugedir != NULL && - internal_config.hugepage_info[i].num_pages[0] > 0) + num_pages > 0) return 0; + } /* no valid hugepage mounts available, return error */ return -1; From patchwork Tue Dec 19 11:14:38 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32467 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id DC4D01B1E0; Tue, 19 Dec 2017 12:15:22 +0100 (CET) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 060C81B01B for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="3913250" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga007.jf.intel.com with ESMTP; 19 Dec 2017 03:14:53 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEqk7003117; Tue, 19 Dec 2017 11:14:52 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEqqI010256; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEq1J010252; Tue, 19 Dec 2017 11:14:52 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:38 +0000 Message-Id: <47525ef673993d1b0fa091c3b8b7305d5ccec671.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 11/23] eal: replace memseg with memseg lists X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Before, we were aggregating multiple pages into one memseg, so the number of memsegs was small. Now, each page gets its own memseg, so the list of memsegs is huge. To accommodate the new memseg list size and to keep the under-the-hood workings sane, the memseg list is now not just a single list, but multiple lists. To be precise, each hugepage size available on the system gets a memseg list per socket (so, for example, on a 2-socket system with 2M and 1G hugepages, we will get 4 memseg lists). In order to support dynamic memory allocation, we reserve all memory in advance. As in, we do an anonymous mmap() of the entire maximum size of memory per hugepage size (which is limited to either RTE_MAX_MEMSEG_PER_LIST or 128G worth of memory, whichever is the smaller one). The limit is arbitrary. So, for each hugepage size, we get (by default) up to 128G worth of memory, per socket. The address space is claimed at the start, in eal_common_memory.c. The actual page allocation code is in eal_memalloc.c (Linux-only for now), and largely consists of moved EAL memory init code. Pages in the list are also indexed by address. That is, for non-legacy mode, in order to figure out where the page belongs, one can simply look at base address for a memseg list. Similarly, figuring out IOVA address of a memzone is a matter of finding the right memseg list, getting offset and dividing by page size to get the appropriate memseg. For legacy mode, old behavior of walking the memseg list remains. Due to switch to fbarray, secondary processes are not currently supported nor tested. Also, one particular API call (dump physmem layout) no longer makes sense not only becase there can now be holes in memseg list, but also because there are several memseg lists to choose from. In legacy mode, nothing is preallocated, and all memsegs are in a list like before, but each segment still resides in an appropriate memseg list. The rest of the changes are really ripple effects from the memseg change - heap changes, compile fixes, and rewrites to support fbarray-backed memseg lists. Signed-off-by: Anatoly Burakov --- config/common_base | 3 +- drivers/bus/pci/linux/pci.c | 29 ++- drivers/net/virtio/virtio_user/vhost_kernel.c | 106 +++++--- lib/librte_eal/common/eal_common_memory.c | 245 ++++++++++++++++-- lib/librte_eal/common/eal_common_memzone.c | 5 +- lib/librte_eal/common/eal_hugepages.h | 1 + lib/librte_eal/common/include/rte_eal_memconfig.h | 22 +- lib/librte_eal/common/include/rte_memory.h | 16 ++ lib/librte_eal/common/malloc_elem.c | 8 +- lib/librte_eal/common/malloc_elem.h | 6 +- lib/librte_eal/common/malloc_heap.c | 88 +++++-- lib/librte_eal/common/rte_malloc.c | 20 +- lib/librte_eal/linuxapp/eal/eal.c | 21 +- lib/librte_eal/linuxapp/eal/eal_memory.c | 299 ++++++++++++++-------- lib/librte_eal/linuxapp/eal/eal_vfio.c | 162 ++++++++---- test/test/test_malloc.c | 29 ++- test/test/test_memory.c | 44 +++- test/test/test_memzone.c | 17 +- 18 files changed, 815 insertions(+), 306 deletions(-) diff --git a/config/common_base b/config/common_base index e74febe..9730d4c 100644 --- a/config/common_base +++ b/config/common_base @@ -90,7 +90,8 @@ CONFIG_RTE_CACHE_LINE_SIZE=64 CONFIG_RTE_LIBRTE_EAL=y CONFIG_RTE_MAX_LCORE=128 CONFIG_RTE_MAX_NUMA_NODES=8 -CONFIG_RTE_MAX_MEMSEG=256 +CONFIG_RTE_MAX_MEMSEG_LISTS=16 +CONFIG_RTE_MAX_MEMSEG_PER_LIST=32768 CONFIG_RTE_MAX_MEMZONE=2560 CONFIG_RTE_MAX_TAILQ=32 CONFIG_RTE_ENABLE_ASSERT=n diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c index 5da6728..6d3100f 100644 --- a/drivers/bus/pci/linux/pci.c +++ b/drivers/bus/pci/linux/pci.c @@ -148,19 +148,30 @@ rte_pci_unmap_device(struct rte_pci_device *dev) void * pci_find_max_end_va(void) { - const struct rte_memseg *seg = rte_eal_get_physmem_layout(); - const struct rte_memseg *last = seg; - unsigned i = 0; + void *cur_end, *max_end = NULL; + int i = 0; - for (i = 0; i < RTE_MAX_MEMSEG; i++, seg++) { - if (seg->addr == NULL) - break; + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + const struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; + const struct rte_memseg_list *msl = &mcfg->memsegs[i]; + const struct rte_fbarray *arr = &msl->memseg_arr; - if (seg->addr > last->addr) - last = seg; + if (arr->capacity == 0) + continue; + /* + * we need to handle legacy mem case, so don't rely on page size + * to calculate max VA end + */ + while ((i = rte_fbarray_find_next_used(arr, i)) >= 0) { + struct rte_memseg *ms = rte_fbarray_get(arr, i); + cur_end = RTE_PTR_ADD(ms->addr, ms->len); + if (cur_end > max_end) + max_end = cur_end; + } } - return RTE_PTR_ADD(last->addr, last->len); + return max_end; } /* parse one line of the "resource" sysfs file (note that the 'line' diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c index 68d28b1..f3f1549 100644 --- a/drivers/net/virtio/virtio_user/vhost_kernel.c +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c @@ -99,6 +99,40 @@ static uint64_t vhost_req_user_to_kernel[] = { [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, }; +/* returns number of segments processed */ +static int +add_memory_region(struct vhost_memory_region *mr, const struct rte_fbarray *arr, + int reg_start_idx, int max) { + const struct rte_memseg *ms; + void *start_addr, *expected_addr; + uint64_t len; + int idx; + + idx = reg_start_idx; + len = 0; + start_addr = NULL; + expected_addr = NULL; + + /* we could've relied on page size, but we have to support legacy mem */ + while (idx < max){ + ms = rte_fbarray_get(arr, idx); + if (expected_addr == NULL) { + start_addr = ms->addr; + expected_addr = RTE_PTR_ADD(ms->addr, ms->len); + } else if (ms->addr != expected_addr) + break; + len += ms->len; + idx++; + } + + mr->guest_phys_addr = (uint64_t)(uintptr_t) start_addr; + mr->userspace_addr = (uint64_t)(uintptr_t) start_addr; + mr->memory_size = len; + mr->mmap_offset = 0; + + return idx; +} + /* By default, vhost kernel module allows 64 regions, but DPDK allows * 256 segments. As a relief, below function merges those virtually * adjacent memsegs into one region. @@ -106,8 +140,7 @@ static uint64_t vhost_req_user_to_kernel[] = { static struct vhost_memory_kernel * prepare_vhost_memory_kernel(void) { - uint32_t i, j, k = 0; - struct rte_memseg *seg; + uint32_t list_idx, region_nr = 0; struct vhost_memory_region *mr; struct vhost_memory_kernel *vm; @@ -117,52 +150,41 @@ prepare_vhost_memory_kernel(void) if (!vm) return NULL; - for (i = 0; i < RTE_MAX_MEMSEG; ++i) { - seg = &rte_eal_get_configuration()->mem_config->memseg[i]; - if (!seg->addr) - break; - - int new_region = 1; - - for (j = 0; j < k; ++j) { - mr = &vm->regions[j]; + for (list_idx = 0; list_idx < RTE_MAX_MEMSEG_LISTS; ++list_idx) { + const struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; + const struct rte_memseg_list *msl = &mcfg->memsegs[list_idx]; + const struct rte_fbarray *arr = &msl->memseg_arr; + int reg_start_idx, search_idx; - if (mr->userspace_addr + mr->memory_size == - (uint64_t)(uintptr_t)seg->addr) { - mr->memory_size += seg->len; - new_region = 0; - break; - } - - if ((uint64_t)(uintptr_t)seg->addr + seg->len == - mr->userspace_addr) { - mr->guest_phys_addr = - (uint64_t)(uintptr_t)seg->addr; - mr->userspace_addr = - (uint64_t)(uintptr_t)seg->addr; - mr->memory_size += seg->len; - new_region = 0; - break; - } - } - - if (new_region == 0) + /* skip empty segment lists */ + if (arr->count == 0) continue; - mr = &vm->regions[k++]; - /* use vaddr here! */ - mr->guest_phys_addr = (uint64_t)(uintptr_t)seg->addr; - mr->userspace_addr = (uint64_t)(uintptr_t)seg->addr; - mr->memory_size = seg->len; - mr->mmap_offset = 0; - - if (k >= max_regions) { - free(vm); - return NULL; + search_idx = 0; + while ((reg_start_idx = rte_fbarray_find_next_used(arr, + search_idx)) >= 0) { + int reg_n_pages; + if (region_nr >= max_regions) { + free(vm); + return NULL; + } + mr = &vm->regions[region_nr++]; + + /* + * we know memseg starts at search_idx, check how many + * segments there are + */ + reg_n_pages = rte_fbarray_find_contig_used(arr, + search_idx); + + /* look at at most reg_n_pages of memsegs */ + search_idx = add_memory_region(mr, arr, reg_start_idx, + search_idx + reg_n_pages); } } - vm->nregions = k; + vm->nregions = region_nr; vm->padding = 0; return vm; } diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c index 96570a7..bdd465b 100644 --- a/lib/librte_eal/common/eal_common_memory.c +++ b/lib/librte_eal/common/eal_common_memory.c @@ -42,6 +42,7 @@ #include #include +#include #include #include #include @@ -58,6 +59,8 @@ * which is a multiple of hugepage size. */ +#define MEMSEG_LIST_FMT "memseg-%luk-%i" + static uint64_t baseaddr_offset; void * @@ -117,6 +120,178 @@ eal_get_virtual_area(void *requested_addr, uint64_t *size, return addr; } +static uint64_t +get_mem_amount(uint64_t page_sz) { + uint64_t area_sz; + + // TODO: saner heuristics + /* limit to RTE_MAX_MEMSEG_PER_LIST pages or 128G worth of memory */ + area_sz = RTE_MIN(page_sz * RTE_MAX_MEMSEG_PER_LIST, 1ULL << 37); + + return rte_align64pow2(area_sz); +} + +static int +get_max_num_pages(uint64_t page_sz, uint64_t mem_amount) { + return mem_amount / page_sz; +} + +static int +get_min_num_pages(int max_pages) { + return RTE_MIN(256, max_pages); +} + +static int +alloc_memseg_list(struct rte_memseg_list *msl, uint64_t page_sz, + int socket_id) { + char name[RTE_FBARRAY_NAME_LEN]; + int min_pages, max_pages; + uint64_t mem_amount; + void *addr; + + if (!internal_config.legacy_mem) { + mem_amount = get_mem_amount(page_sz); + max_pages = get_max_num_pages(page_sz, mem_amount); + min_pages = get_min_num_pages(max_pages); + + // TODO: allow shrink? + addr = eal_get_virtual_area(NULL, &mem_amount, page_sz, 0); + if (addr == NULL) { + RTE_LOG(ERR, EAL, "Cannot reserve memory\n"); + return -1; + } + } else { + addr = NULL; + min_pages = 256; + max_pages = 256; + } + + snprintf(name, sizeof(name), MEMSEG_LIST_FMT, page_sz >> 10, socket_id); + if (rte_fbarray_alloc(&msl->memseg_arr, name, min_pages, max_pages, + sizeof(struct rte_memseg))) { + RTE_LOG(ERR, EAL, "Cannot allocate memseg list\n"); + return -1; + } + + msl->hugepage_sz = page_sz; + msl->socket_id = socket_id; + msl->base_va = addr; + + return 0; +} + +static int +memseg_init(void) { + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int socket_id, hpi_idx, msl_idx = 0; + struct rte_memseg_list *msl; + + if (rte_eal_process_type() == RTE_PROC_SECONDARY) { + RTE_LOG(ERR, EAL, "Secondary process not supported\n"); + return -1; + } + + /* create memseg lists */ + for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes; + hpi_idx++) { + struct hugepage_info *hpi; + uint64_t hugepage_sz; + + hpi = &internal_config.hugepage_info[hpi_idx]; + hugepage_sz = hpi->hugepage_sz; + + for (socket_id = 0; socket_id < (int) rte_num_sockets(); + socket_id++) { + if (msl_idx >= RTE_MAX_MEMSEG_LISTS) { + RTE_LOG(ERR, EAL, + "No more space in memseg lists\n"); + return -1; + } + msl = &mcfg->memsegs[msl_idx++]; + + if (alloc_memseg_list(msl, hugepage_sz, socket_id)) { + return -1; + } + } + } + return 0; +} + +static const struct rte_memseg * +virt2memseg(const void *addr, const struct rte_memseg_list *msl) { + const struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; + const struct rte_fbarray *arr; + int msl_idx, ms_idx; + + /* first, find appropriate memseg list, if it wasn't specified */ + if (msl == NULL) { + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + void *start, *end; + msl = &mcfg->memsegs[msl_idx]; + + start = msl->base_va; + end = RTE_PTR_ADD(start, msl->hugepage_sz * + msl->memseg_arr.capacity); + if (addr >= start && addr < end) + break; + } + /* if we didn't find our memseg list */ + if (msl_idx == RTE_MAX_MEMSEG_LISTS) + return NULL; + } else { + /* a memseg list was specified, check if it's the right one */ + void *start, *end; + start = msl->base_va; + end = RTE_PTR_ADD(start, msl->hugepage_sz * + msl->memseg_arr.capacity); + + if (addr < start || addr >= end) + return NULL; + } + + /* now, calculate index */ + arr = &msl->memseg_arr; + ms_idx = RTE_PTR_DIFF(addr, msl->base_va) / msl->hugepage_sz; + return rte_fbarray_get(arr, ms_idx); +} + +static const struct rte_memseg * +virt2memseg_legacy(const void *addr) { + const struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; + const struct rte_memseg_list *msl; + const struct rte_fbarray *arr; + int msl_idx, ms_idx; + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + msl = &mcfg->memsegs[msl_idx]; + arr = &msl->memseg_arr; + + ms_idx = 0; + while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) { + const struct rte_memseg *ms; + void *start, *end; + ms = rte_fbarray_get(arr, ms_idx); + start = ms->addr; + end = RTE_PTR_ADD(start, ms->len); + if (addr >= start && addr < end) + return ms; + ms_idx++; + } + } + return NULL; +} + +const struct rte_memseg * +rte_mem_virt2memseg(const void *addr, const struct rte_memseg_list *msl) { + /* for legacy memory, we just walk the list, like in the old days. */ + if (internal_config.legacy_mem) { + return virt2memseg_legacy(addr); + } else { + return virt2memseg(addr, msl); + } +} + /* * Return a pointer to a read-only table of struct rte_physmem_desc @@ -126,7 +301,9 @@ eal_get_virtual_area(void *requested_addr, uint64_t *size, const struct rte_memseg * rte_eal_get_physmem_layout(void) { - return rte_eal_get_configuration()->mem_config->memseg; + struct rte_fbarray *arr; + arr = &rte_eal_get_configuration()->mem_config->memsegs[0].memseg_arr; + return rte_fbarray_get(arr, 0); } @@ -141,11 +318,24 @@ rte_eal_get_physmem_size(void) /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; - for (i = 0; i < RTE_MAX_MEMSEG; i++) { - if (mcfg->memseg[i].addr == NULL) - break; + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + const struct rte_memseg_list *msl = &mcfg->memsegs[i]; + + if (msl->memseg_arr.count == 0) + continue; + + /* for legacy mem mode, walk the memsegs */ + if (internal_config.legacy_mem) { + const struct rte_fbarray *arr = &msl->memseg_arr; + int ms_idx = 0; - total_len += mcfg->memseg[i].len; + while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx) >= 0)) { + const struct rte_memseg *ms = + rte_fbarray_get(arr, ms_idx); + total_len += ms->len; + } + } else + total_len += msl->hugepage_sz * msl->memseg_arr.count; } return total_len; @@ -161,21 +351,29 @@ rte_dump_physmem_layout(FILE *f) /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; - for (i = 0; i < RTE_MAX_MEMSEG; i++) { - if (mcfg->memseg[i].addr == NULL) - break; - - fprintf(f, "Segment %u: IOVA:0x%"PRIx64", len:%zu, " - "virt:%p, socket_id:%"PRId32", " - "hugepage_sz:%"PRIu64", nchannel:%"PRIx32", " - "nrank:%"PRIx32"\n", i, - mcfg->memseg[i].iova, - mcfg->memseg[i].len, - mcfg->memseg[i].addr, - mcfg->memseg[i].socket_id, - mcfg->memseg[i].hugepage_sz, - mcfg->memseg[i].nchannel, - mcfg->memseg[i].nrank); + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + const struct rte_memseg_list *msl = &mcfg->memsegs[i]; + const struct rte_fbarray *arr = &msl->memseg_arr; + int m_idx = 0; + + if (arr->count == 0) + continue; + + while ((m_idx = rte_fbarray_find_next_used(arr, m_idx)) >= 0) { + struct rte_memseg *ms = rte_fbarray_get(arr, m_idx); + fprintf(f, "Page %u-%u: iova:0x%"PRIx64", len:%zu, " + "virt:%p, socket_id:%"PRId32", " + "hugepage_sz:%"PRIu64", nchannel:%"PRIx32", " + "nrank:%"PRIx32"\n", i, m_idx, + ms->iova, + ms->len, + ms->addr, + ms->socket_id, + ms->hugepage_sz, + ms->nchannel, + ms->nrank); + m_idx++; + } } } @@ -220,9 +418,14 @@ rte_mem_lock_page(const void *virt) int rte_eal_memory_init(void) { + int retval; RTE_LOG(DEBUG, EAL, "Setting up physically contiguous memory...\n"); - const int retval = rte_eal_process_type() == RTE_PROC_PRIMARY ? + retval = memseg_init(); + if (retval < 0) + return -1; + + retval = rte_eal_process_type() == RTE_PROC_PRIMARY ? rte_eal_hugepage_init() : rte_eal_hugepage_attach(); if (retval < 0) diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c index ea072a2..f558ac2 100644 --- a/lib/librte_eal/common/eal_common_memzone.c +++ b/lib/librte_eal/common/eal_common_memzone.c @@ -254,10 +254,9 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, mz->iova = rte_malloc_virt2iova(mz_addr); mz->addr = mz_addr; mz->len = (requested_len == 0 ? elem->size : requested_len); - mz->hugepage_sz = elem->ms->hugepage_sz; - mz->socket_id = elem->ms->socket_id; + mz->hugepage_sz = elem->msl->hugepage_sz; + mz->socket_id = elem->msl->socket_id; mz->flags = 0; - mz->memseg_id = elem->ms - rte_eal_get_configuration()->mem_config->memseg; return mz; } diff --git a/lib/librte_eal/common/eal_hugepages.h b/lib/librte_eal/common/eal_hugepages.h index 68369f2..cf91009 100644 --- a/lib/librte_eal/common/eal_hugepages.h +++ b/lib/librte_eal/common/eal_hugepages.h @@ -52,6 +52,7 @@ struct hugepage_file { int socket_id; /**< NUMA socket ID */ int file_id; /**< the '%d' in HUGEFILE_FMT */ int memseg_id; /**< the memory segment to which page belongs */ + int memseg_list_id; /**< the memory segment list to which page belongs */ char filepath[MAX_HUGEPAGE_PATH]; /**< path to backing file on filesystem */ }; diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h b/lib/librte_eal/common/include/rte_eal_memconfig.h index b9eee70..c9b57a4 100644 --- a/lib/librte_eal/common/include/rte_eal_memconfig.h +++ b/lib/librte_eal/common/include/rte_eal_memconfig.h @@ -40,12 +40,30 @@ #include #include #include +#include #ifdef __cplusplus extern "C" { #endif /** + * memseg list is a special case as we need to store a bunch of other data + * together with the array itself. + */ +struct rte_memseg_list { + RTE_STD_C11 + union { + void *base_va; + /**< Base virtual address for this memseg list. */ + uint64_t addr_64; + /**< Makes sure addr is always 64-bits */ + }; + int socket_id; /**< Socket ID for all memsegs in this list. */ + uint64_t hugepage_sz; /**< page size for all memsegs in this list. */ + struct rte_fbarray memseg_arr; +}; + +/** * the structure for the memory configuration for the RTE. * Used by the rte_config structure. It is separated out, as for multi-process * support, the memory details should be shared across instances @@ -71,9 +89,11 @@ struct rte_mem_config { uint32_t memzone_cnt; /**< Number of allocated memzones */ /* memory segments and zones */ - struct rte_memseg memseg[RTE_MAX_MEMSEG]; /**< Physmem descriptors. */ struct rte_memzone memzone[RTE_MAX_MEMZONE]; /**< Memzone descriptors. */ + struct rte_memseg_list memsegs[RTE_MAX_MEMSEG_LISTS]; + /**< list of dynamic arrays holding memsegs */ + struct rte_tailq_head tailq_head[RTE_MAX_TAILQ]; /**< Tailqs for objects */ /* Heaps of Malloc per socket */ diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h index 14aacea..f005716 100644 --- a/lib/librte_eal/common/include/rte_memory.h +++ b/lib/librte_eal/common/include/rte_memory.h @@ -50,6 +50,9 @@ extern "C" { #include +/* forward declaration for pointers */ +struct rte_memseg_list; + __extension__ enum rte_page_sizes { RTE_PGSIZE_4K = 1ULL << 12, @@ -158,6 +161,19 @@ phys_addr_t rte_mem_virt2phy(const void *virt); rte_iova_t rte_mem_virt2iova(const void *virt); /** + * Get memseg corresponding to virtual memory address. + * + * @param virt + * The virtual address. + * @param msl + * Memseg list in which to look for memsegs (can be NULL). + * @return + * Memseg to which this virtual address belongs to. + */ +const struct rte_memseg *rte_mem_virt2memseg(const void *virt, + const struct rte_memseg_list *msl); + +/** * Get the layout of the available physical memory. * * It can be useful for an application to have the full physical diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 782aaa7..ab09b94 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -54,11 +54,11 @@ * Initialize a general malloc_elem header structure */ void -malloc_elem_init(struct malloc_elem *elem, - struct malloc_heap *heap, const struct rte_memseg *ms, size_t size) +malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap, + const struct rte_memseg_list *msl, size_t size) { elem->heap = heap; - elem->ms = ms; + elem->msl = msl; elem->prev = NULL; elem->next = NULL; memset(&elem->free_list, 0, sizeof(elem->free_list)); @@ -172,7 +172,7 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt) const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem; const size_t new_elem_size = elem->size - old_elem_size; - malloc_elem_init(split_pt, elem->heap, elem->ms, new_elem_size); + malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size); split_pt->prev = elem; split_pt->next = next_elem; if (next_elem) diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index cf27b59..330bddc 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -34,7 +34,7 @@ #ifndef MALLOC_ELEM_H_ #define MALLOC_ELEM_H_ -#include +#include /* dummy definition of struct so we can use pointers to it in malloc_elem struct */ struct malloc_heap; @@ -50,7 +50,7 @@ struct malloc_elem { struct malloc_elem *volatile prev; /* points to prev elem in memseg */ struct malloc_elem *volatile next; /* points to next elem in memseg */ LIST_ENTRY(malloc_elem) free_list; /* list of free elements in heap */ - const struct rte_memseg *ms; + const struct rte_memseg_list *msl; volatile enum elem_state state; uint32_t pad; size_t size; @@ -137,7 +137,7 @@ malloc_elem_from_data(const void *data) void malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap, - const struct rte_memseg *ms, + const struct rte_memseg_list *msl, size_t size); void diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index 1b35468..5fa21fe 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -50,6 +50,7 @@ #include #include +#include "eal_internal_cfg.h" #include "malloc_elem.h" #include "malloc_heap.h" @@ -91,22 +92,25 @@ check_hugepage_sz(unsigned flags, uint64_t hugepage_sz) } /* - * Expand the heap with a memseg. - * This reserves the zone and sets a dummy malloc_elem header at the end - * to prevent overflow. The rest of the zone is added to free list as a single - * large free block + * Expand the heap with a memory area. */ -static void -malloc_heap_add_memseg(struct malloc_heap *heap, struct rte_memseg *ms) +static struct malloc_elem * +malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl, + void *start, size_t len) { - struct malloc_elem *start_elem = (struct malloc_elem *)ms->addr; - const size_t elem_size = ms->len - MALLOC_ELEM_OVERHEAD; + struct malloc_elem *elem = start; + + malloc_elem_init(elem, heap, msl, len); + + malloc_elem_insert(elem); + + elem = malloc_elem_join_adjacent_free(elem); - malloc_elem_init(start_elem, heap, ms, elem_size); - malloc_elem_insert(start_elem); - malloc_elem_free_list_insert(start_elem); + malloc_elem_free_list_insert(elem); - heap->total_size += elem_size; + heap->total_size += len; + + return elem; } /* @@ -127,7 +131,7 @@ find_suitable_element(struct malloc_heap *heap, size_t size, for (elem = LIST_FIRST(&heap->free_head[idx]); !!elem; elem = LIST_NEXT(elem, free_list)) { if (malloc_elem_can_hold(elem, size, align, bound)) { - if (check_hugepage_sz(flags, elem->ms->hugepage_sz)) + if (check_hugepage_sz(flags, elem->msl->hugepage_sz)) return elem; if (alt_elem == NULL) alt_elem = elem; @@ -249,16 +253,62 @@ int rte_eal_malloc_heap_init(void) { struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; - unsigned ms_cnt; - struct rte_memseg *ms; + int msl_idx; + struct rte_memseg_list *msl; if (mcfg == NULL) return -1; - for (ms = &mcfg->memseg[0], ms_cnt = 0; - (ms_cnt < RTE_MAX_MEMSEG) && (ms->len > 0); - ms_cnt++, ms++) { - malloc_heap_add_memseg(&mcfg->malloc_heaps[ms->socket_id], ms); + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + int start; + struct rte_fbarray *arr; + struct malloc_heap *heap; + + msl = &mcfg->memsegs[msl_idx]; + arr = &msl->memseg_arr; + heap = &mcfg->malloc_heaps[msl->socket_id]; + + if (arr->capacity == 0) + continue; + + /* for legacy mode, just walk the list */ + if (internal_config.legacy_mem) { + int ms_idx = 0; + while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) { + struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx); + malloc_heap_add_memory(heap, msl, ms->addr, ms->len); + ms_idx++; + RTE_LOG(DEBUG, EAL, "Heap on socket %d was expanded by %zdMB\n", + msl->socket_id, ms->len >> 20ULL); + } + continue; + } + + /* find first segment */ + start = rte_fbarray_find_next_used(arr, 0); + + while (start >= 0) { + int contig_segs; + struct rte_memseg *start_seg; + size_t len, hugepage_sz = msl->hugepage_sz; + + /* find how many pages we can lump in together */ + contig_segs = rte_fbarray_find_contig_used(arr, start); + start_seg = rte_fbarray_get(arr, start); + len = contig_segs * hugepage_sz; + + /* + * we've found (hopefully) a bunch of contiguous + * segments, so add them to the heap. + */ + malloc_heap_add_memory(heap, msl, start_seg->addr, len); + + RTE_LOG(DEBUG, EAL, "Heap on socket %d was expanded by %zdMB\n", + msl->socket_id, len >> 20ULL); + + start = rte_fbarray_find_next_used(arr, + start + contig_segs); + } } return 0; diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c index 74b5417..92cd7d8 100644 --- a/lib/librte_eal/common/rte_malloc.c +++ b/lib/librte_eal/common/rte_malloc.c @@ -251,17 +251,21 @@ rte_malloc_set_limit(__rte_unused const char *type, rte_iova_t rte_malloc_virt2iova(const void *addr) { - rte_iova_t iova; + const struct rte_memseg *ms; const struct malloc_elem *elem = malloc_elem_from_data(addr); + if (elem == NULL) return RTE_BAD_IOVA; - if (elem->ms->iova == RTE_BAD_IOVA) - return RTE_BAD_IOVA; if (rte_eal_iova_mode() == RTE_IOVA_VA) - iova = (uintptr_t)addr; - else - iova = elem->ms->iova + - RTE_PTR_DIFF(addr, elem->ms->addr); - return iova; + return (uintptr_t) addr; + + ms = rte_mem_virt2memseg(addr, elem->msl); + if (ms == NULL) + return RTE_BAD_IOVA; + + if (ms->iova == RTE_BAD_IOVA) + return RTE_BAD_IOVA; + + return ms->iova + RTE_PTR_DIFF(addr, ms->addr); } diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c index 37ae8e0..a27536f 100644 --- a/lib/librte_eal/linuxapp/eal/eal.c +++ b/lib/librte_eal/linuxapp/eal/eal.c @@ -102,8 +102,8 @@ static int mem_cfg_fd = -1; static struct flock wr_lock = { .l_type = F_WRLCK, .l_whence = SEEK_SET, - .l_start = offsetof(struct rte_mem_config, memseg), - .l_len = sizeof(early_mem_config.memseg), + .l_start = offsetof(struct rte_mem_config, memsegs), + .l_len = sizeof(early_mem_config.memsegs), }; /* Address of global and public configuration */ @@ -661,17 +661,20 @@ eal_parse_args(int argc, char **argv) static void eal_check_mem_on_local_socket(void) { - const struct rte_memseg *ms; + const struct rte_memseg_list *msl; int i, socket_id; socket_id = rte_lcore_to_socket_id(rte_config.master_lcore); - ms = rte_eal_get_physmem_layout(); - - for (i = 0; i < RTE_MAX_MEMSEG; i++) - if (ms[i].socket_id == socket_id && - ms[i].len > 0) - return; + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + msl = &rte_eal_get_configuration()->mem_config->memsegs[i]; + if (msl->socket_id != socket_id) + continue; + /* for legacy memory, check if there's anything allocated */ + if (internal_config.legacy_mem && msl->memseg_arr.count == 0) + continue; + return; + } RTE_LOG(WARNING, EAL, "WARNING: Master core has no " "memory on local socket!\n"); diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index 5b18af9..59f6889 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -929,6 +929,24 @@ huge_recover_sigbus(void) } } +static struct rte_memseg_list * +get_memseg_list(int socket, uint64_t page_sz) { + struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl; + int msl_idx; + + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + msl = &mcfg->memsegs[msl_idx]; + if (msl->hugepage_sz != page_sz) + continue; + if (msl->socket_id != socket) + continue; + return msl; + } + return NULL; +} + /* * Prepare physical memory mapping: fill configuration structure with * these infos, return 0 on success. @@ -946,11 +964,14 @@ eal_legacy_hugepage_init(void) struct rte_mem_config *mcfg; struct hugepage_file *hugepage = NULL, *tmp_hp = NULL; struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES]; + struct rte_fbarray *arr; + struct rte_memseg *ms; uint64_t memory[RTE_MAX_NUMA_NODES]; unsigned hp_offset; int i, j, new_memseg; + int ms_idx, msl_idx; int nr_hugefiles, nr_hugepages = 0; void *addr; @@ -963,6 +984,9 @@ eal_legacy_hugepage_init(void) /* hugetlbfs can be disabled */ if (internal_config.no_hugetlbfs) { + arr = &mcfg->memsegs[0].memseg_arr; + ms = rte_fbarray_get(arr, 0); + addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); if (addr == MAP_FAILED) { @@ -970,14 +994,15 @@ eal_legacy_hugepage_init(void) strerror(errno)); return -1; } + rte_fbarray_set_used(arr, 0, true); if (rte_eal_iova_mode() == RTE_IOVA_VA) - mcfg->memseg[0].iova = (uintptr_t)addr; + ms->iova = (uintptr_t)addr; else - mcfg->memseg[0].iova = RTE_BAD_IOVA; - mcfg->memseg[0].addr = addr; - mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K; - mcfg->memseg[0].len = internal_config.memory; - mcfg->memseg[0].socket_id = 0; + ms->iova = RTE_BAD_IOVA; + ms->addr = addr; + ms->hugepage_sz = RTE_PGSIZE_4K; + ms->len = internal_config.memory; + ms->socket_id = 0; return 0; } @@ -1218,27 +1243,59 @@ eal_legacy_hugepage_init(void) #endif if (new_memseg) { - j += 1; - if (j == RTE_MAX_MEMSEG) - break; + struct rte_memseg_list *msl; + int socket; + uint64_t page_sz; - mcfg->memseg[j].iova = hugepage[i].physaddr; - mcfg->memseg[j].addr = hugepage[i].final_va; - mcfg->memseg[j].len = hugepage[i].size; - mcfg->memseg[j].socket_id = hugepage[i].socket_id; - mcfg->memseg[j].hugepage_sz = hugepage[i].size; + socket = hugepage[i].socket_id; + page_sz = hugepage[i].size; + + if (page_sz == 0) + continue; + + /* figure out where to put this memseg */ + msl = get_memseg_list(socket, page_sz); + if (!msl) + rte_panic("Unknown socket or page sz: %i %lx\n", + socket, page_sz); + msl_idx = msl - &mcfg->memsegs[0]; + arr = &msl->memseg_arr; + /* + * we may run out of space, so check if we have enough + * and expand if necessary + */ + if (arr->count >= arr->len) { + int new_len = arr->len * 2; + new_len = RTE_MIN(new_len, arr->capacity); + if (rte_fbarray_resize(arr, new_len)) { + RTE_LOG(ERR, EAL, "Couldn't expand memseg list\n"); + break; + } + } + ms_idx = rte_fbarray_find_next_free(arr, 0); + ms = rte_fbarray_get(arr, ms_idx); + + ms->iova = hugepage[i].physaddr; + ms->addr = hugepage[i].final_va; + ms->len = page_sz; + ms->socket_id = socket; + ms->hugepage_sz = page_sz; + + /* segment may be empty */ + rte_fbarray_set_used(arr, ms_idx, true); } /* continuation of previous memseg */ else { #ifdef RTE_ARCH_PPC_64 /* Use the phy and virt address of the last page as segment * address for IBM Power architecture */ - mcfg->memseg[j].iova = hugepage[i].physaddr; - mcfg->memseg[j].addr = hugepage[i].final_va; + ms->iova = hugepage[i].physaddr; + ms->addr = hugepage[i].final_va; #endif - mcfg->memseg[j].len += mcfg->memseg[j].hugepage_sz; + ms->len += ms->hugepage_sz; } - hugepage[i].memseg_id = j; + hugepage[i].memseg_id = ms_idx; + hugepage[i].memseg_list_id = msl_idx; } if (i < nr_hugefiles) { @@ -1248,7 +1305,7 @@ eal_legacy_hugepage_init(void) "Please either increase it or request less amount " "of memory.\n", i, nr_hugefiles, RTE_STR(CONFIG_RTE_MAX_MEMSEG), - RTE_MAX_MEMSEG); + RTE_MAX_MEMSEG_PER_LIST); goto fail; } @@ -1289,8 +1346,9 @@ eal_legacy_hugepage_attach(void) const struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; struct hugepage_file *hp = NULL; unsigned num_hp = 0; - unsigned i, s = 0; /* s used to track the segment number */ - unsigned max_seg = RTE_MAX_MEMSEG; + unsigned i; + int ms_idx, msl_idx; + unsigned cur_seg, max_seg; off_t size = 0; int fd, fd_zero = -1, fd_hugepage = -1; @@ -1315,53 +1373,63 @@ eal_legacy_hugepage_attach(void) } /* map all segments into memory to make sure we get the addrs */ - for (s = 0; s < RTE_MAX_MEMSEG; ++s) { - void *base_addr; + max_seg = 0; + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + const struct rte_memseg_list *msl = &mcfg->memsegs[msl_idx]; + const struct rte_fbarray *arr = &msl->memseg_arr; - /* - * the first memory segment with len==0 is the one that - * follows the last valid segment. - */ - if (mcfg->memseg[s].len == 0) - break; + ms_idx = 0; + while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) { + struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx); + void *base_addr; - /* - * fdzero is mmapped to get a contiguous block of virtual - * addresses of the appropriate memseg size. - * use mmap to get identical addresses as the primary process. - */ - base_addr = mmap(mcfg->memseg[s].addr, mcfg->memseg[s].len, - PROT_READ, + ms = rte_fbarray_get(arr, ms_idx); + + /* + * the first memory segment with len==0 is the one that + * follows the last valid segment. + */ + if (ms->len == 0) + break; + + /* + * fdzero is mmapped to get a contiguous block of virtual + * addresses of the appropriate memseg size. + * use mmap to get identical addresses as the primary process. + */ + base_addr = mmap(ms->addr, ms->len, + PROT_READ, #ifdef RTE_ARCH_PPC_64 - MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, #else - MAP_PRIVATE, + MAP_PRIVATE, #endif - fd_zero, 0); - if (base_addr == MAP_FAILED || - base_addr != mcfg->memseg[s].addr) { - max_seg = s; - if (base_addr != MAP_FAILED) { - /* errno is stale, don't use */ - RTE_LOG(ERR, EAL, "Could not mmap %llu bytes " - "in /dev/zero at [%p], got [%p] - " - "please use '--base-virtaddr' option\n", - (unsigned long long)mcfg->memseg[s].len, - mcfg->memseg[s].addr, base_addr); - munmap(base_addr, mcfg->memseg[s].len); - } else { - RTE_LOG(ERR, EAL, "Could not mmap %llu bytes " - "in /dev/zero at [%p]: '%s'\n", - (unsigned long long)mcfg->memseg[s].len, - mcfg->memseg[s].addr, strerror(errno)); - } - if (aslr_enabled() > 0) { - RTE_LOG(ERR, EAL, "It is recommended to " - "disable ASLR in the kernel " - "and retry running both primary " - "and secondary processes\n"); + fd_zero, 0); + if (base_addr == MAP_FAILED || base_addr != ms->addr) { + if (base_addr != MAP_FAILED) { + /* errno is stale, don't use */ + RTE_LOG(ERR, EAL, "Could not mmap %llu bytes " + "in /dev/zero at [%p], got [%p] - " + "please use '--base-virtaddr' option\n", + (unsigned long long)ms->len, + ms->addr, base_addr); + munmap(base_addr, ms->len); + } else { + RTE_LOG(ERR, EAL, "Could not mmap %llu bytes " + "in /dev/zero at [%p]: '%s'\n", + (unsigned long long)ms->len, + ms->addr, strerror(errno)); + } + if (aslr_enabled() > 0) { + RTE_LOG(ERR, EAL, "It is recommended to " + "disable ASLR in the kernel " + "and retry running both primary " + "and secondary processes\n"); + } + goto error; } - goto error; + max_seg++; + ms_idx++; } } @@ -1375,46 +1443,54 @@ eal_legacy_hugepage_attach(void) num_hp = size / sizeof(struct hugepage_file); RTE_LOG(DEBUG, EAL, "Analysing %u files\n", num_hp); - s = 0; - while (s < RTE_MAX_MEMSEG && mcfg->memseg[s].len > 0){ - void *addr, *base_addr; - uintptr_t offset = 0; - size_t mapping_size; - /* - * free previously mapped memory so we can map the - * hugepages into the space - */ - base_addr = mcfg->memseg[s].addr; - munmap(base_addr, mcfg->memseg[s].len); - - /* find the hugepages for this segment and map them - * we don't need to worry about order, as the server sorted the - * entries before it did the second mmap of them */ - for (i = 0; i < num_hp && offset < mcfg->memseg[s].len; i++){ - if (hp[i].memseg_id == (int)s){ - fd = open(hp[i].filepath, O_RDWR); - if (fd < 0) { - RTE_LOG(ERR, EAL, "Could not open %s\n", - hp[i].filepath); - goto error; - } - mapping_size = hp[i].size; - addr = mmap(RTE_PTR_ADD(base_addr, offset), - mapping_size, PROT_READ | PROT_WRITE, - MAP_SHARED, fd, 0); - close(fd); /* close file both on success and on failure */ - if (addr == MAP_FAILED || - addr != RTE_PTR_ADD(base_addr, offset)) { - RTE_LOG(ERR, EAL, "Could not mmap %s\n", - hp[i].filepath); - goto error; + /* map all segments into memory to make sure we get the addrs */ + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + const struct rte_memseg_list *msl = &mcfg->memsegs[msl_idx]; + const struct rte_fbarray *arr = &msl->memseg_arr; + + ms_idx = 0; + while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) { + struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx); + void *addr, *base_addr; + uintptr_t offset = 0; + size_t mapping_size; + + ms = rte_fbarray_get(arr, ms_idx); + /* + * free previously mapped memory so we can map the + * hugepages into the space + */ + base_addr = ms->addr; + munmap(base_addr, ms->len); + + /* find the hugepages for this segment and map them + * we don't need to worry about order, as the server sorted the + * entries before it did the second mmap of them */ + for (i = 0; i < num_hp && offset < ms->len; i++){ + if (hp[i].memseg_id == ms_idx && + hp[i].memseg_list_id == msl_idx) { + fd = open(hp[i].filepath, O_RDWR); + if (fd < 0) { + RTE_LOG(ERR, EAL, "Could not open %s\n", + hp[i].filepath); + goto error; + } + mapping_size = hp[i].size; + addr = mmap(RTE_PTR_ADD(base_addr, offset), + mapping_size, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); + close(fd); /* close file both on success and on failure */ + if (addr == MAP_FAILED || + addr != RTE_PTR_ADD(base_addr, offset)) { + RTE_LOG(ERR, EAL, "Could not mmap %s\n", + hp[i].filepath); + goto error; + } + offset+=mapping_size; } - offset+=mapping_size; } - } - RTE_LOG(DEBUG, EAL, "Mapped segment %u of size 0x%llx\n", s, - (unsigned long long)mcfg->memseg[s].len); - s++; + RTE_LOG(DEBUG, EAL, "Mapped segment of size 0x%llx\n", + (unsigned long long)ms->len); } } /* unmap the hugepage config file, since we are done using it */ munmap(hp, size); @@ -1423,8 +1499,27 @@ eal_legacy_hugepage_attach(void) return 0; error: - for (i = 0; i < max_seg && mcfg->memseg[i].len > 0; i++) - munmap(mcfg->memseg[i].addr, mcfg->memseg[i].len); + /* map all segments into memory to make sure we get the addrs */ + cur_seg = 0; + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + const struct rte_memseg_list *msl = &mcfg->memsegs[msl_idx]; + const struct rte_fbarray *arr = &msl->memseg_arr; + + if (cur_seg >= max_seg) + break; + + ms_idx = 0; + while ((ms_idx = rte_fbarray_find_next_used(arr, ms_idx)) >= 0) { + struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx); + + if (cur_seg >= max_seg) + break; + ms = rte_fbarray_get(arr, i); + munmap(ms->addr, ms->len); + + cur_seg++; + } + } if (hp != NULL && hp != MAP_FAILED) munmap(hp, size); if (fd_zero >= 0) diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c index 58f0123..09dfc68 100644 --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c @@ -696,33 +696,52 @@ vfio_get_group_no(const char *sysfs_base, static int vfio_type1_dma_map(int vfio_container_fd) { - const struct rte_memseg *ms = rte_eal_get_physmem_layout(); int i, ret; /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ - for (i = 0; i < RTE_MAX_MEMSEG; i++) { + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { struct vfio_iommu_type1_dma_map dma_map; + const struct rte_memseg_list *msl; + const struct rte_fbarray *arr; + int ms_idx, next_idx; - if (ms[i].addr == NULL) - break; + msl = &rte_eal_get_configuration()->mem_config->memsegs[i]; + arr = &msl->memseg_arr; - memset(&dma_map, 0, sizeof(dma_map)); - dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); - dma_map.vaddr = ms[i].addr_64; - dma_map.size = ms[i].len; - if (rte_eal_iova_mode() == RTE_IOVA_VA) - dma_map.iova = dma_map.vaddr; - else - dma_map.iova = ms[i].iova; - dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; + /* skip empty memseg lists */ + if (arr->count == 0) + continue; - ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map); + next_idx = 0; - if (ret) { - RTE_LOG(ERR, EAL, " cannot set up DMA remapping, " - "error %i (%s)\n", errno, - strerror(errno)); - return -1; + // TODO: don't bother with physical addresses? + while ((ms_idx = rte_fbarray_find_next_used(arr, + next_idx) >= 0)) { + uint64_t addr, len, hw_addr; + const struct rte_memseg *ms; + next_idx = ms_idx + 1; + + ms = rte_fbarray_get(arr, ms_idx); + + addr = ms->addr_64; + len = ms->hugepage_sz; + hw_addr = ms->iova; + + memset(&dma_map, 0, sizeof(dma_map)); + dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); + dma_map.vaddr = addr; + dma_map.size = len; + dma_map.iova = hw_addr; + dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; + + ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map); + + if (ret) { + RTE_LOG(ERR, EAL, " cannot set up DMA remapping, " + "error %i (%s)\n", errno, + strerror(errno)); + return -1; + } } } @@ -732,8 +751,8 @@ vfio_type1_dma_map(int vfio_container_fd) static int vfio_spapr_dma_map(int vfio_container_fd) { - const struct rte_memseg *ms = rte_eal_get_physmem_layout(); int i, ret; + uint64_t hugepage_sz = 0; struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg), @@ -767,17 +786,31 @@ vfio_spapr_dma_map(int vfio_container_fd) } /* create DMA window from 0 to max(phys_addr + len) */ - for (i = 0; i < RTE_MAX_MEMSEG; i++) { - if (ms[i].addr == NULL) - break; - - create.window_size = RTE_MAX(create.window_size, - ms[i].iova + ms[i].len); + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + const struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; + const struct rte_memseg_list *msl = &mcfg->memsegs[i]; + const struct rte_fbarray *arr = &msl->memseg_arr; + int idx, next_idx; + + if (msl->base_va == NULL) + continue; + if (msl->memseg_arr.count == 0) + continue; + + next_idx = 0; + while ((idx = rte_fbarray_find_next_used(arr, next_idx)) >= 0) { + const struct rte_memseg *ms = rte_fbarray_get(arr, idx); + hugepage_sz = RTE_MAX(hugepage_sz, ms->hugepage_sz); + create.window_size = RTE_MAX(create.window_size, + ms[i].iova + ms[i].len); + next_idx = idx + 1; + } } /* sPAPR requires window size to be a power of 2 */ create.window_size = rte_align64pow2(create.window_size); - create.page_shift = __builtin_ctzll(ms->hugepage_sz); + create.page_shift = __builtin_ctzll(hugepage_sz); create.levels = 1; ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create); @@ -793,41 +826,60 @@ vfio_spapr_dma_map(int vfio_container_fd) } /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ - for (i = 0; i < RTE_MAX_MEMSEG; i++) { + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { struct vfio_iommu_type1_dma_map dma_map; + const struct rte_memseg_list *msl; + const struct rte_fbarray *arr; + int ms_idx, next_idx; - if (ms[i].addr == NULL) - break; + msl = &rte_eal_get_configuration()->mem_config->memsegs[i]; + arr = &msl->memseg_arr; - reg.vaddr = (uintptr_t) ms[i].addr; - reg.size = ms[i].len; - ret = ioctl(vfio_container_fd, - VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®); - if (ret) { - RTE_LOG(ERR, EAL, " cannot register vaddr for IOMMU, " - "error %i (%s)\n", errno, strerror(errno)); - return -1; - } + /* skip empty memseg lists */ + if (arr->count == 0) + continue; - memset(&dma_map, 0, sizeof(dma_map)); - dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); - dma_map.vaddr = ms[i].addr_64; - dma_map.size = ms[i].len; - if (rte_eal_iova_mode() == RTE_IOVA_VA) - dma_map.iova = dma_map.vaddr; - else - dma_map.iova = ms[i].iova; - dma_map.flags = VFIO_DMA_MAP_FLAG_READ | - VFIO_DMA_MAP_FLAG_WRITE; + next_idx = 0; - ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map); + while ((ms_idx = rte_fbarray_find_next_used(arr, + next_idx) >= 0)) { + uint64_t addr, len, hw_addr; + const struct rte_memseg *ms; + next_idx = ms_idx + 1; - if (ret) { - RTE_LOG(ERR, EAL, " cannot set up DMA remapping, " - "error %i (%s)\n", errno, strerror(errno)); - return -1; - } + ms = rte_fbarray_get(arr, ms_idx); + + addr = ms->addr_64; + len = ms->hugepage_sz; + hw_addr = ms->iova; + reg.vaddr = (uintptr_t) addr; + reg.size = len; + ret = ioctl(vfio_container_fd, + VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®); + if (ret) { + RTE_LOG(ERR, EAL, " cannot register vaddr for IOMMU, error %i (%s)\n", + errno, strerror(errno)); + return -1; + } + + memset(&dma_map, 0, sizeof(dma_map)); + dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); + dma_map.vaddr = addr; + dma_map.size = len; + dma_map.iova = hw_addr; + dma_map.flags = VFIO_DMA_MAP_FLAG_READ | + VFIO_DMA_MAP_FLAG_WRITE; + + ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map); + + if (ret) { + RTE_LOG(ERR, EAL, " cannot set up DMA remapping, " + "error %i (%s)\n", errno, + strerror(errno)); + return -1; + } + } } return 0; diff --git a/test/test/test_malloc.c b/test/test/test_malloc.c index 4572caf..ae24c33 100644 --- a/test/test/test_malloc.c +++ b/test/test/test_malloc.c @@ -41,6 +41,7 @@ #include #include +#include #include #include #include @@ -734,15 +735,23 @@ test_malloc_bad_params(void) return -1; } -/* Check if memory is available on a specific socket */ +/* Check if memory is avilable on a specific socket */ static int is_mem_on_socket(int32_t socket) { - const struct rte_memseg *ms = rte_eal_get_physmem_layout(); + const struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; unsigned i; - for (i = 0; i < RTE_MAX_MEMSEG; i++) { - if (socket == ms[i].socket_id) + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + const struct rte_memseg_list *msl = + &mcfg->memsegs[i]; + const struct rte_fbarray *arr = &msl->memseg_arr; + + if (msl->socket_id != socket) + continue; + + if (arr->count) return 1; } return 0; @@ -755,16 +764,8 @@ is_mem_on_socket(int32_t socket) static int32_t addr_to_socket(void * addr) { - const struct rte_memseg *ms = rte_eal_get_physmem_layout(); - unsigned i; - - for (i = 0; i < RTE_MAX_MEMSEG; i++) { - if ((ms[i].addr <= addr) && - ((uintptr_t)addr < - ((uintptr_t)ms[i].addr + (uintptr_t)ms[i].len))) - return ms[i].socket_id; - } - return -1; + const struct rte_memseg *ms = rte_mem_virt2memseg(addr, NULL); + return ms == NULL ? -1 : ms->socket_id; } /* Test using rte_[c|m|zm]alloc_socket() on a specific socket */ diff --git a/test/test/test_memory.c b/test/test/test_memory.c index 921bdc8..0d877c8 100644 --- a/test/test/test_memory.c +++ b/test/test/test_memory.c @@ -34,8 +34,11 @@ #include #include +#include +#include #include #include +#include #include "test.h" @@ -54,10 +57,12 @@ static int test_memory(void) { + const struct rte_memzone *mz = NULL; uint64_t s; unsigned i; size_t j; - const struct rte_memseg *mem; + const struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; /* * dump the mapped memory: the python-expect script checks @@ -69,20 +74,43 @@ test_memory(void) /* check that memory size is != 0 */ s = rte_eal_get_physmem_size(); if (s == 0) { - printf("No memory detected\n"); - return -1; + printf("No memory detected, attempting to allocate\n"); + mz = rte_memzone_reserve("tmp", 1000, SOCKET_ID_ANY, 0); + + if (!mz) { + printf("Failed to allocate a memzone\n"); + return -1; + } } /* try to read memory (should not segfault) */ - mem = rte_eal_get_physmem_layout(); - for (i = 0; i < RTE_MAX_MEMSEG && mem[i].addr != NULL ; i++) { + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + const struct rte_memseg_list *msl = &mcfg->memsegs[i]; + const struct rte_fbarray *arr = &msl->memseg_arr; + int search_idx, cur_idx; + + if (arr->count == 0) + continue; + + search_idx = 0; - /* check memory */ - for (j = 0; j= 0) { + const struct rte_memseg *ms; + + ms = rte_fbarray_get(arr, cur_idx); + + /* check memory */ + for (j = 0; j < ms->len; j++) { + *((volatile uint8_t *) ms->addr + j); + } + search_idx = cur_idx + 1; } } + if (mz) + rte_memzone_free(mz); + return 0; } diff --git a/test/test/test_memzone.c b/test/test/test_memzone.c index 1cf235a..47af721 100644 --- a/test/test/test_memzone.c +++ b/test/test/test_memzone.c @@ -132,22 +132,25 @@ static int test_memzone_reserve_flags(void) { const struct rte_memzone *mz; - const struct rte_memseg *ms; int hugepage_2MB_avail = 0; int hugepage_1GB_avail = 0; int hugepage_16MB_avail = 0; int hugepage_16GB_avail = 0; const size_t size = 100; int i = 0; - ms = rte_eal_get_physmem_layout(); - for (i = 0; i < RTE_MAX_MEMSEG; i++) { - if (ms[i].hugepage_sz == RTE_PGSIZE_2M) + + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl = &mcfg->memsegs[i]; + + if (msl->hugepage_sz == RTE_PGSIZE_2M) hugepage_2MB_avail = 1; - if (ms[i].hugepage_sz == RTE_PGSIZE_1G) + if (msl->hugepage_sz == RTE_PGSIZE_1G) hugepage_1GB_avail = 1; - if (ms[i].hugepage_sz == RTE_PGSIZE_16M) + if (msl->hugepage_sz == RTE_PGSIZE_16M) hugepage_16MB_avail = 1; - if (ms[i].hugepage_sz == RTE_PGSIZE_16G) + if (msl->hugepage_sz == RTE_PGSIZE_16G) hugepage_16GB_avail = 1; } /* Display the availability of 2MB ,1GB, 16MB, 16GB pages */ From patchwork Tue Dec 19 11:14:39 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32461 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 6927B1B1BD; Tue, 19 Dec 2017 12:15:14 +0100 (CET) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id D9DAB1B01C for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="3023084" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga002.fm.intel.com with ESMTP; 19 Dec 2017 03:14:53 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEqFG003120; Tue, 19 Dec 2017 11:14:52 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEqUd010264; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEqE7010259; Tue, 19 Dec 2017 11:14:52 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:39 +0000 Message-Id: <5db1054689acb30ec0139f06dd2c43845dea68cd.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 12/23] eal: add support for dynamic memory allocation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Nothing uses that code yet. The bulk of it is copied from old memory allocation stuff (eal_memory.c). We provide an API to allocate either one page or multiple pages, guaranteeing that we'll get contiguous VA for all of the pages that we requested. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_memalloc.h | 47 ++++ lib/librte_eal/linuxapp/eal/Makefile | 2 + lib/librte_eal/linuxapp/eal/eal_memalloc.c | 416 +++++++++++++++++++++++++++++ 3 files changed, 465 insertions(+) create mode 100755 lib/librte_eal/common/eal_memalloc.h create mode 100755 lib/librte_eal/linuxapp/eal/eal_memalloc.c diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h new file mode 100755 index 0000000..59fd330 --- /dev/null +++ b/lib/librte_eal/common/eal_memalloc.h @@ -0,0 +1,47 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef EAL_MEMALLOC_H +#define EAL_MEMALLOC_H + +#include + +#include + +struct rte_memseg * +eal_memalloc_alloc_page(uint64_t size, int socket); + +int +eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size, + int socket, bool exact); + +#endif // EAL_MEMALLOC_H diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile index 782e1ad..88f10e9 100644 --- a/lib/librte_eal/linuxapp/eal/Makefile +++ b/lib/librte_eal/linuxapp/eal/Makefile @@ -62,6 +62,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_thread.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_log.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vfio_mp_sync.c +SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_memalloc.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_debug.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_lcore.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_timer.c @@ -105,6 +106,7 @@ CFLAGS_eal_interrupts.o := -D_GNU_SOURCE CFLAGS_eal_vfio_mp_sync.o := -D_GNU_SOURCE CFLAGS_eal_timer.o := -D_GNU_SOURCE CFLAGS_eal_lcore.o := -D_GNU_SOURCE +CFLAGS_eal_memalloc.o := -D_GNU_SOURCE CFLAGS_eal_thread.o := -D_GNU_SOURCE CFLAGS_eal_log.o := -D_GNU_SOURCE CFLAGS_eal_common_log.o := -D_GNU_SOURCE diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c new file mode 100755 index 0000000..527c2f6 --- /dev/null +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -0,0 +1,416 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#define _FILE_OFFSET_BITS 64 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES +#include +#include +#endif + +#include +#include +#include +#include +#include + +#include "eal_filesystem.h" +#include "eal_internal_cfg.h" +#include "eal_memalloc.h" + +static sigjmp_buf huge_jmpenv; + +static void __rte_unused huge_sigbus_handler(int signo __rte_unused) +{ + siglongjmp(huge_jmpenv, 1); +} + +/* Put setjmp into a wrap method to avoid compiling error. Any non-volatile, + * non-static local variable in the stack frame calling sigsetjmp might be + * clobbered by a call to longjmp. + */ +static int __rte_unused huge_wrap_sigsetjmp(void) +{ + return sigsetjmp(huge_jmpenv, 1); +} + +static struct sigaction huge_action_old; +static int huge_need_recover; + +static void __rte_unused +huge_register_sigbus(void) +{ + sigset_t mask; + struct sigaction action; + + sigemptyset(&mask); + sigaddset(&mask, SIGBUS); + action.sa_flags = 0; + action.sa_mask = mask; + action.sa_handler = huge_sigbus_handler; + + huge_need_recover = !sigaction(SIGBUS, &action, &huge_action_old); +} + +static void __rte_unused +huge_recover_sigbus(void) +{ + if (huge_need_recover) { + sigaction(SIGBUS, &huge_action_old, NULL); + huge_need_recover = 0; + } +} + +#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES +static bool +prepare_numa(int *oldpolicy, struct bitmask *oldmask, int socket_id) { + bool have_numa = true; + + /* Check if kernel supports NUMA. */ + if (numa_available() != 0) { + RTE_LOG(DEBUG, EAL, "NUMA is not supported.\n"); + have_numa = false; + } + + if (have_numa) { + RTE_LOG(DEBUG, EAL, "Trying to obtain current memory policy.\n"); + if (get_mempolicy(oldpolicy, oldmask->maskp, + oldmask->size + 1, 0, 0) < 0) { + RTE_LOG(ERR, EAL, + "Failed to get current mempolicy: %s. " + "Assuming MPOL_DEFAULT.\n", strerror(errno)); + oldpolicy = MPOL_DEFAULT; + } + RTE_LOG(DEBUG, EAL, + "Setting policy MPOL_PREFERRED for socket %d\n", + socket_id); + numa_set_preferred(socket_id); + } + return have_numa; +} + +static void +resotre_numa(int *oldpolicy, struct bitmask *oldmask) { + RTE_LOG(DEBUG, EAL, + "Restoring previous memory policy: %d\n", *oldpolicy); + if (oldpolicy == MPOL_DEFAULT) { + numa_set_localalloc(); + } else if (set_mempolicy(*oldpolicy, oldmask->maskp, + oldmask->size + 1) < 0) { + RTE_LOG(ERR, EAL, "Failed to restore mempolicy: %s\n", + strerror(errno)); + numa_set_localalloc(); + } + numa_free_cpumask(oldmask); +} +#endif + +static int +alloc_page(struct rte_memseg *ms, void *addr, uint64_t size, int socket_id, + struct hugepage_info *hi, unsigned list_idx, unsigned seg_idx) { + int cur_socket_id = 0; + uint64_t fa_offset; + char path[PATH_MAX]; + int ret = 0; + + if (internal_config.single_file_segments) { + eal_get_hugefile_path(path, sizeof(path), hi->hugedir, list_idx); + } else { + eal_get_hugefile_path(path, sizeof(path), hi->hugedir, + list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); + } + + /* try to create hugepage file */ + int fd = open(path, O_CREAT | O_RDWR, 0600); + if (fd < 0) { + RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", __func__, + strerror(errno)); + goto fname; + } + if (internal_config.single_file_segments) { + fa_offset = seg_idx * size; + if (fallocate(fd, 0, fa_offset, size)) { + RTE_LOG(DEBUG, EAL, "%s(): fallocate() failed: %s\n", + __func__, strerror(errno)); + goto opened; + } + } else { + if (ftruncate(fd, size) < 0) { + RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n", + __func__, strerror(errno)); + goto opened; + } + fa_offset = 0; + } + + /* map the segment, and populate page tables, + * the kernel fills this segment with zeros */ + void *va = mmap(addr, size, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE | MAP_FIXED, fd, fa_offset); + if (va == MAP_FAILED) { + RTE_LOG(DEBUG, EAL, "%s(): mmap() failed: %s\n", __func__, + strerror(errno)); + goto resized; + } + if (va != addr) { + RTE_LOG(DEBUG, EAL, "%s(): wrong mmap() address\n", __func__); + goto mapped; + } + + rte_iova_t iova = rte_mem_virt2iova(addr); + if (iova == RTE_BAD_PHYS_ADDR) { + RTE_LOG(DEBUG, EAL, "%s(): can't get IOVA addr\n", + __func__); + goto mapped; + } + +#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES + move_pages(getpid(), 1, &addr, NULL, &cur_socket_id, 0); + + if (cur_socket_id != socket_id) { + RTE_LOG(DEBUG, EAL, + "%s(): allocation happened on wrong socket (wanted %d, got %d)\n", + __func__, socket_id, cur_socket_id); + goto mapped; + } +#endif + + /* In linux, hugetlb limitations, like cgroup, are + * enforced at fault time instead of mmap(), even + * with the option of MAP_POPULATE. Kernel will send + * a SIGBUS signal. To avoid to be killed, save stack + * environment here, if SIGBUS happens, we can jump + * back here. + */ + if (huge_wrap_sigsetjmp()) { + RTE_LOG(DEBUG, EAL, "SIGBUS: Cannot mmap more hugepages of size %uMB\n", + (unsigned)(size / 0x100000)); + goto mapped; + } + *(int *)addr = *(int *) addr; + + close(fd); + + ms->addr = addr; + ms->hugepage_sz = size; + ms->len = size; + ms->nchannel = rte_memory_get_nchannel(); + ms->nrank = rte_memory_get_nrank(); + ms->iova = iova; + ms->socket_id = socket_id; + + goto out; + +mapped: + munmap(addr, size); +resized: + if (internal_config.single_file_segments) + fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + fa_offset, size); + else { + unlink(path); + } +opened: + close(fd); +fname: + /* anything but goto out is an error */ + ret = -1; +out: + return ret; +} + +int +eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, + uint64_t size, int socket, bool exact) { + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl = NULL; + void *addr; + unsigned msl_idx; + int cur_idx, next_idx, end_idx, i, ret = 0; +#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES + bool have_numa; + int oldpolicy; + struct bitmask *oldmask = numa_allocate_nodemask(); +#endif + struct hugepage_info *hi = NULL; + + /* dynamic allocation not supported in legacy mode */ + if (internal_config.legacy_mem) + return -1; + + for (i = 0; i < (int) RTE_DIM(internal_config.hugepage_info); i++) { + if (size == + internal_config.hugepage_info[i].hugepage_sz) { + hi = &internal_config.hugepage_info[i]; + break; + } + } + if (!hi) { + RTE_LOG(ERR, EAL, "%s(): can't find relevant hugepage_info entry\n", + __func__); + return -1; + } + + /* find our memseg list */ + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + struct rte_memseg_list *cur_msl = &mcfg->memsegs[msl_idx]; + + if (cur_msl->hugepage_sz != size) { + continue; + } + if (cur_msl->socket_id != socket) { + continue; + } + msl = cur_msl; + break; + } + if (!msl) { + RTE_LOG(ERR, EAL, "%s(): couldn't find suitable memseg_list\n", + __func__); + return -1; + } + + /* first, try finding space in already existing list */ + cur_idx = rte_fbarray_find_next_n_free(&msl->memseg_arr, 0, n); + + if (cur_idx < 0) { + int old_len = msl->memseg_arr.len; + int space = 0; + int new_len = old_len; + + /* grow new len until we can either fit n or can't grow */ + while (new_len < msl->memseg_arr.capacity && + (space < n)) { + new_len = RTE_MIN(new_len * 2, msl->memseg_arr.capacity); + space = new_len - old_len; + } + + /* check if we can expand the list */ + if (old_len == new_len) { + /* can't expand, the list is full */ + RTE_LOG(ERR, EAL, "%s(): no space in memseg list\n", + __func__); + return -1; + } + + if (rte_fbarray_resize(&msl->memseg_arr, new_len)) { + RTE_LOG(ERR, EAL, "%s(): can't resize memseg list\n", + __func__); + return -1; + } + + /* + * we could conceivably end up with free space at the end of the + * list that wasn't enough to cover everything but can cover + * some of it, so start at (old_len - n) if possible. + */ + next_idx = RTE_MAX(0, old_len - n); + + cur_idx = rte_fbarray_find_next_n_free(&msl->memseg_arr, + next_idx, n); + + if (cur_idx < 0) { + /* still no space, bail out */ + RTE_LOG(ERR, EAL, "%s(): no space in memseg list\n", + __func__); + return -1; + } + } + + end_idx = cur_idx + n; + +#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES + have_numa = prepare_numa(&oldpolicy, oldmask, socket); +#endif + + for (i = 0; cur_idx < end_idx; cur_idx++, i++) { + struct rte_memseg *cur; + + cur = rte_fbarray_get(&msl->memseg_arr, cur_idx); + addr = RTE_PTR_ADD(msl->base_va, + cur_idx * msl->hugepage_sz); + + if (alloc_page(cur, addr, size, socket, hi, msl_idx, cur_idx)) { + RTE_LOG(DEBUG, EAL, "attempted to allocate %i pages, but only %i were allocated\n", + n, i); + + /* if exact number of pages wasn't requested, stop */ + if (!exact) { + ret = i; + goto restore_numa; + } + if (ms) + memset(ms, 0, sizeof(struct rte_memseg*) * n); + ret = -1; + goto restore_numa; + } + if (ms) + ms[i] = cur; + + rte_fbarray_set_used(&msl->memseg_arr, cur_idx, true); + } + ret = n; + +restore_numa: +#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES + if (have_numa) + resotre_numa(&oldpolicy, oldmask); +#endif + return ret; +} + +struct rte_memseg * +eal_memalloc_alloc_page(uint64_t size, int socket) { + struct rte_memseg *ms; + if (eal_memalloc_alloc_page_bulk(&ms, 1, size, socket, true) < 0) + return NULL; + return ms; +} From patchwork Tue Dec 19 11:14:40 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32463 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 399A91B1C9; Tue, 19 Dec 2017 12:15:16 +0100 (CET) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 0A9811B01E for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="13553690" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga003.jf.intel.com with ESMTP; 19 Dec 2017 03:14:53 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEqb5003123; Tue, 19 Dec 2017 11:14:52 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEqnG010271; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEqli010267; Tue, 19 Dec 2017 11:14:52 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:40 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 13/23] eal: make use of dynamic memory allocation for init X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Add a new (non-legacy) memory init path for EAL. It uses the new dynamic allocation facilities, although it's only being run at startup. If no -m or --socket-mem switches were specified, the new init will not allocate anything, whereas if those switches were passed, appropriate amounts of pages would be requested, just like for legacy init. Since rte_malloc support for dynamic allocation comes in later patches, running DPDK without --socket-mem or -m switches will fail in this patch. Also, allocated pages will be physically discontiguous (or rather, they're not guaranteed to be physically contiguous - they may still be, by accident) unless IOVA_AS_VA mode is used. Signed-off-by: Anatoly Burakov --- lib/librte_eal/linuxapp/eal/eal_memory.c | 60 ++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index 59f6889..7cc4a55 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -68,6 +68,7 @@ #include #include "eal_private.h" +#include "eal_memalloc.h" #include "eal_internal_cfg.h" #include "eal_filesystem.h" #include "eal_hugepages.h" @@ -1322,6 +1323,61 @@ eal_legacy_hugepage_init(void) return -1; } +static int +eal_hugepage_init(void) { + struct hugepage_info used_hp[MAX_HUGEPAGE_SIZES]; + uint64_t memory[RTE_MAX_NUMA_NODES]; + int hp_sz_idx, socket_id; + + test_phys_addrs_available(); + + memset(used_hp, 0, sizeof(used_hp)); + + for (hp_sz_idx = 0; + hp_sz_idx < (int) internal_config.num_hugepage_sizes; + hp_sz_idx++) { + /* meanwhile, also initialize used_hp hugepage sizes in used_hp */ + struct hugepage_info *hpi; + hpi = &internal_config.hugepage_info[hp_sz_idx]; + used_hp[hp_sz_idx].hugepage_sz = hpi->hugepage_sz; + } + + /* make a copy of socket_mem, needed for balanced allocation. */ + for (hp_sz_idx = 0; hp_sz_idx < RTE_MAX_NUMA_NODES; hp_sz_idx++) + memory[hp_sz_idx] = internal_config.socket_mem[hp_sz_idx]; + + /* calculate final number of pages */ + if (calc_num_pages_per_socket(memory, + internal_config.hugepage_info, used_hp, + internal_config.num_hugepage_sizes) < 0) + return -1; + + for (int hp_sz_idx = 0; + hp_sz_idx < (int) internal_config.num_hugepage_sizes; + hp_sz_idx++) { + for (socket_id = 0; socket_id < RTE_MAX_NUMA_NODES; + socket_id++) { + struct hugepage_info *hpi = &used_hp[hp_sz_idx]; + unsigned num_pages = hpi->num_pages[socket_id]; + int num_pages_alloc; + + if (num_pages == 0) + continue; + + RTE_LOG(DEBUG, EAL, "Allocating %u pages of size %luM on socket %i\n", + num_pages, hpi->hugepage_sz >> 20, socket_id); + + num_pages_alloc = eal_memalloc_alloc_page_bulk(NULL, + num_pages, + hpi->hugepage_sz, socket_id, + true); + if (num_pages_alloc < 0) + return -1; + } + } + return 0; +} + /* * uses fstat to report the size of a file on disk */ @@ -1533,6 +1589,8 @@ int rte_eal_hugepage_init(void) { if (internal_config.legacy_mem) return eal_legacy_hugepage_init(); + else + return eal_hugepage_init(); return -1; } @@ -1540,6 +1598,8 @@ int rte_eal_hugepage_attach(void) { if (internal_config.legacy_mem) return eal_legacy_hugepage_attach(); + else + RTE_LOG(ERR, EAL, "Secondary processes aren't supported yet\n"); return -1; } From patchwork Tue Dec 19 11:14:41 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32460 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 4131A1B1B8; Tue, 19 Dec 2017 12:15:13 +0100 (CET) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id C2D831B021 for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="13553692" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga003.jf.intel.com with ESMTP; 19 Dec 2017 03:14:53 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEqtv003126; Tue, 19 Dec 2017 11:14:52 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEq7c010280; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEq1i010276; Tue, 19 Dec 2017 11:14:52 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:41 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 14/23] eal: add support for dynamic unmapping of pages X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" This isn't used anywhere yet, but the support is now there. Also, adding cleanup to allocation procedures, so that if we fail to allocate everything we asked for, we can free all of it back. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_memalloc.h | 3 + lib/librte_eal/linuxapp/eal/eal_memalloc.c | 131 ++++++++++++++++++++++++++++- 2 files changed, 133 insertions(+), 1 deletion(-) diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h index 59fd330..47e4367 100755 --- a/lib/librte_eal/common/eal_memalloc.h +++ b/lib/librte_eal/common/eal_memalloc.h @@ -44,4 +44,7 @@ int eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size, int socket, bool exact); +int +eal_memalloc_free_page(struct rte_memseg *ms); + #endif // EAL_MEMALLOC_H diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index 527c2f6..13172a0 100755 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -109,6 +109,18 @@ huge_recover_sigbus(void) } } +/* + * uses fstat to report the size of a file on disk + */ +static bool +is_zero_length(int fd) +{ + struct stat st; + if (fstat(fd, &st) < 0) + return false; + return st.st_blocks == 0; +} + #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES static bool prepare_numa(int *oldpolicy, struct bitmask *oldmask, int socket_id) { @@ -267,6 +279,61 @@ alloc_page(struct rte_memseg *ms, void *addr, uint64_t size, int socket_id, return ret; } +static int +free_page(struct rte_memseg *ms, struct hugepage_info *hi, unsigned list_idx, + unsigned seg_idx) { + uint64_t fa_offset; + char path[PATH_MAX]; + int fd; + + fa_offset = seg_idx * ms->hugepage_sz; + + if (internal_config.single_file_segments) { + eal_get_hugefile_path(path, sizeof(path), hi->hugedir, list_idx); + } else { + eal_get_hugefile_path(path, sizeof(path), hi->hugedir, + list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); + } + + munmap(ms->addr, ms->hugepage_sz); + + // TODO: race condition? + + if (mmap(ms->addr, ms->hugepage_sz, PROT_READ, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) == + MAP_FAILED) { + RTE_LOG(DEBUG, EAL, "couldn't unmap page\n"); + return -1; + } + + if (internal_config.single_file_segments) { + /* now, truncate or remove the original file */ + fd = open(path, O_RDWR, 0600); + if (fd < 0) { + RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", __func__, + strerror(errno)); + // TODO: proper error handling + return -1; + } + + if (fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + fa_offset, ms->hugepage_sz)) { + RTE_LOG(DEBUG, EAL, "Page deallocation failed: %s\n", + strerror(errno)); + } + if (is_zero_length(fd)) { + unlink(path); + } + close(fd); + } else { + unlink(path); + } + + memset(ms, 0, sizeof(*ms)); + + return 0; +} + int eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size, int socket, bool exact) { @@ -274,7 +341,7 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, struct rte_memseg_list *msl = NULL; void *addr; unsigned msl_idx; - int cur_idx, next_idx, end_idx, i, ret = 0; + int cur_idx, next_idx, start_idx, end_idx, i, j, ret = 0; #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES bool have_numa; int oldpolicy; @@ -366,6 +433,7 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, } end_idx = cur_idx + n; + start_idx = cur_idx; #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES have_numa = prepare_numa(&oldpolicy, oldmask, socket); @@ -387,6 +455,20 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, ret = i; goto restore_numa; } + RTE_LOG(DEBUG, EAL, "exact amount of pages was requested, so returning %i allocated pages\n", + i); + + /* clean up */ + for (j = start_idx; j < cur_idx; j++) { + struct rte_memseg *tmp; + struct rte_fbarray *arr = &msl->memseg_arr; + + tmp = rte_fbarray_get(arr, j); + if (free_page(tmp, hi, msl_idx, start_idx + j)) + rte_panic("Cannot free page\n"); + + rte_fbarray_set_used(arr, j, false); + } if (ms) memset(ms, 0, sizeof(struct rte_memseg*) * n); ret = -1; @@ -414,3 +496,50 @@ eal_memalloc_alloc_page(uint64_t size, int socket) { return NULL; return ms; } + +int +eal_memalloc_free_page(struct rte_memseg *ms) { + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl = NULL; + unsigned msl_idx, seg_idx; + struct hugepage_info *hi = NULL; + + /* dynamic free not supported in legacy mode */ + if (internal_config.legacy_mem) + return -1; + + for (int i = 0; i < (int) RTE_DIM(internal_config.hugepage_info); i++) { + if (ms->hugepage_sz == + internal_config.hugepage_info[i].hugepage_sz) { + hi = &internal_config.hugepage_info[i]; + break; + } + } + if (!hi) { + RTE_LOG(ERR, EAL, "Can't find relevant hugepage_info entry\n"); + return -1; + } + + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + uintptr_t start_addr, end_addr; + struct rte_memseg_list *cur = &mcfg->memsegs[msl_idx]; + + start_addr = (uintptr_t) cur->base_va; + end_addr = start_addr + + cur->memseg_arr.capacity * cur->hugepage_sz; + + if ((uintptr_t) ms->addr < start_addr || + (uintptr_t) ms->addr >= end_addr) { + continue; + } + msl = cur; + seg_idx = RTE_PTR_DIFF(ms->addr, start_addr) / ms->hugepage_sz; + break; + } + if (!msl) { + RTE_LOG(ERR, EAL, "Couldn't find memseg list\n"); + return -1; + } + rte_fbarray_set_used(&msl->memseg_arr, seg_idx, false); + return free_page(ms, hi, msl_idx, seg_idx); +} From patchwork Tue Dec 19 11:14:42 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32459 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 1D61C1B1B1; Tue, 19 Dec 2017 12:15:12 +0100 (CET) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id E226B1B022 for ; Tue, 19 Dec 2017 12:14:55 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="19495274" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga002.jf.intel.com with ESMTP; 19 Dec 2017 03:14:53 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEq4n003132; Tue, 19 Dec 2017 11:14:52 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEqO8010288; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEq8o010283; Tue, 19 Dec 2017 11:14:52 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:42 +0000 Message-Id: <23b3239ed0ad52d78d2c3a1fdb8dd19be69ecd51.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 15/23] eal: add API to check if memory is physically contiguous X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" This will be helpful down the line when we implement support for allocating physically contiguous memory. We can no longer guarantee physically contiguous memory unless we're in IOVA_AS_VA mode, but we can certainly try and see if we succeed. In addition, this would be useful for e.g. PMD's who may allocate chunks that are smaller than the pagesize, but they must not cross the page boundary, in which case we will be able to accommodate that request. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memalloc.c | 79 +++++++++++++++++++++++++++++ lib/librte_eal/common/eal_memalloc.h | 5 ++ lib/librte_eal/linuxapp/eal/Makefile | 1 + 3 files changed, 85 insertions(+) create mode 100755 lib/librte_eal/common/eal_common_memalloc.c diff --git a/lib/librte_eal/common/eal_common_memalloc.c b/lib/librte_eal/common/eal_common_memalloc.c new file mode 100755 index 0000000..395753a --- /dev/null +++ b/lib/librte_eal/common/eal_common_memalloc.c @@ -0,0 +1,79 @@ +/*- + * BSD LICENSE + * + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#include +#include +#include +#include +#include + +#include "eal_private.h" +#include "eal_internal_cfg.h" +#include "eal_memalloc.h" + +// TODO: secondary +// TODO: 32-bit + +bool +eal_memalloc_is_contig(const struct rte_memseg_list *msl, void *start, + size_t len) { + const struct rte_memseg *ms; + uint64_t page_sz; + void *end; + int start_page, end_page, cur_page; + rte_iova_t expected; + + /* for legacy memory, it's always contiguous */ + if (internal_config.legacy_mem) + return true; + + /* figure out how many pages we need to fit in current data */ + page_sz = msl->hugepage_sz; + end = RTE_PTR_ADD(start, len); + + start_page = RTE_PTR_DIFF(start, msl->base_va) / page_sz; + end_page = RTE_PTR_DIFF(end, msl->base_va) / page_sz; + + /* now, look for contiguous memory */ + ms = rte_fbarray_get(&msl->memseg_arr, start_page); + expected = ms->iova + page_sz; + + for (cur_page = start_page + 1; cur_page < end_page; + cur_page++, expected += page_sz) { + ms = rte_fbarray_get(&msl->memseg_arr, cur_page); + + if (ms->iova != expected) + return false; + } + + return true; +} diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h index 47e4367..04f9b72 100755 --- a/lib/librte_eal/common/eal_memalloc.h +++ b/lib/librte_eal/common/eal_memalloc.h @@ -36,6 +36,7 @@ #include #include +#include struct rte_memseg * eal_memalloc_alloc_page(uint64_t size, int socket); @@ -47,4 +48,8 @@ eal_memalloc_alloc_page_bulk(struct rte_memseg **ms, int n, uint64_t size, int eal_memalloc_free_page(struct rte_memseg *ms); +bool +eal_memalloc_is_contig(const struct rte_memseg_list *msl, void *start, + size_t len); + #endif // EAL_MEMALLOC_H diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile index 88f10e9..c1fc557 100644 --- a/lib/librte_eal/linuxapp/eal/Makefile +++ b/lib/librte_eal/linuxapp/eal/Makefile @@ -75,6 +75,7 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_timer.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_memzone.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_log.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_launch.c +SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_memalloc.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_memory.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_tailqs.c SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_errno.c From patchwork Tue Dec 19 11:14:43 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32469 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 703741B20D; Tue, 19 Dec 2017 12:15:26 +0100 (CET) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id BF8F81B01C for ; Tue, 19 Dec 2017 12:14:56 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="185440369" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga005.jf.intel.com with ESMTP; 19 Dec 2017 03:14:53 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEq5l003135; Tue, 19 Dec 2017 11:14:52 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEq7I010295; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEq8M010291; Tue, 19 Dec 2017 11:14:52 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:43 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 16/23] eal: enable dynamic memory allocation/free on malloc/free X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" This set of changes enables rte_malloc to allocate and free memory as needed. The way it works is, first malloc checks if there is enough memory already allocated to satisfy user's request. If there isn't, we try and allocate more memory. The reverse happens with free - we free an element, check its size (including free element merging due to adjacency) and see if it's bigger than hugepage size and that its start and end span a hugepage or more. Then we remove the area from malloc heap (adjusting element lengths where appropriate), and deallocate the page. For legacy mode, dynamic alloc/free is disabled. It is worth noting that memseg lists are being sorted by page size, and that we try our best to satisfy user's request. That is, if the user requests an element from a 2MB page memory, we will check if we can satisfy that request from existing memory, if not we try and allocate more 2MB pages. If that fails and user also specified a "size is hint" flag, we then check other page sizes and try to allocate from there. If that fails too, then, depending on flags, we may try allocating from other sockets. In other words, we try our best to give the user what they asked for, but going to other sockets is last resort - first we try to allocate more memory on the same socket. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memzone.c | 19 +- lib/librte_eal/common/malloc_elem.c | 96 +++++++++- lib/librte_eal/common/malloc_elem.h | 8 +- lib/librte_eal/common/malloc_heap.c | 280 +++++++++++++++++++++++++++-- lib/librte_eal/common/malloc_heap.h | 4 +- lib/librte_eal/common/rte_malloc.c | 24 +-- 6 files changed, 373 insertions(+), 58 deletions(-) diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c index f558ac2..c571145 100644 --- a/lib/librte_eal/common/eal_common_memzone.c +++ b/lib/librte_eal/common/eal_common_memzone.c @@ -132,7 +132,7 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, struct rte_memzone *mz; struct rte_mem_config *mcfg; size_t requested_len; - int socket, i; + int socket; /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; @@ -216,21 +216,8 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, socket = socket_id; /* allocate memory on heap */ - void *mz_addr = malloc_heap_alloc(&mcfg->malloc_heaps[socket], NULL, - requested_len, flags, align, bound); - - if ((mz_addr == NULL) && (socket_id == SOCKET_ID_ANY)) { - /* try other heaps */ - for (i = 0; i < RTE_MAX_NUMA_NODES; i++) { - if (socket == i) - continue; - - mz_addr = malloc_heap_alloc(&mcfg->malloc_heaps[i], - NULL, requested_len, flags, align, bound); - if (mz_addr != NULL) - break; - } - } + void *mz_addr = malloc_heap_alloc(NULL, requested_len, socket, flags, + align, bound); if (mz_addr == NULL) { rte_errno = ENOMEM; diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index ab09b94..48ac604 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -269,8 +269,8 @@ malloc_elem_free_list_insert(struct malloc_elem *elem) /* * Remove the specified element from its heap's free list. */ -static void -elem_free_list_remove(struct malloc_elem *elem) +void +malloc_elem_free_list_remove(struct malloc_elem *elem) { LIST_REMOVE(elem, free_list); } @@ -290,7 +290,7 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align, const size_t trailer_size = elem->size - old_elem_size - size - MALLOC_ELEM_OVERHEAD; - elem_free_list_remove(elem); + malloc_elem_free_list_remove(elem); if (trailer_size > MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) { /* split it, too much free space after elem */ @@ -363,7 +363,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) { erase = RTE_PTR_SUB(elem->next, MALLOC_ELEM_TRAILER_LEN); /* remove from free list, join to this one */ - elem_free_list_remove(elem->next); + malloc_elem_free_list_remove(elem->next); join_elem(elem, elem->next); /* erase header and trailer */ @@ -383,7 +383,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) { erase = RTE_PTR_SUB(elem, MALLOC_ELEM_TRAILER_LEN); /* remove from free list, join to this one */ - elem_free_list_remove(elem->prev); + malloc_elem_free_list_remove(elem->prev); new_elem = elem->prev; join_elem(new_elem, elem); @@ -402,7 +402,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) { * blocks either immediately before or immediately after newly freed block * are also free, the blocks are merged together. */ -int +struct malloc_elem * malloc_elem_free(struct malloc_elem *elem) { void *ptr; @@ -420,7 +420,87 @@ malloc_elem_free(struct malloc_elem *elem) memset(ptr, 0, data_len); - return 0; + return elem; +} + +/* assume all checks were already done */ +void +malloc_elem_hide_region(struct malloc_elem *elem, void *start, size_t len) { + size_t len_before, len_after; + struct malloc_elem *prev, *next; + void *end, *elem_end; + + end = RTE_PTR_ADD(start, len); + elem_end = RTE_PTR_ADD(elem, elem->size); + len_before = RTE_PTR_DIFF(start, elem); + len_after = RTE_PTR_DIFF(elem_end, end); + + prev = elem->prev; + next = elem->next; + + if (len_after >= MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) { + /* split after */ + struct malloc_elem *split_after = end; + + split_elem(elem, split_after); + + next = split_after; + + malloc_elem_free_list_insert(split_after); + } else if (len_after >= MALLOC_ELEM_HEADER_LEN) { + struct malloc_elem *pad_elem = end; + + /* shrink current element */ + elem->size -= len_after; + memset(pad_elem, 0, sizeof(*pad_elem)); + + /* copy next element's data to our pad */ + memcpy(pad_elem, next, sizeof(*pad_elem)); + + /* pad next element */ + next->state = ELEM_PAD; + next->pad = len_after; + + /* next element is busy, would've been merged otherwise */ + pad_elem->pad = len_after; + pad_elem->size += len_after; + } else if (len_after > 0) { + rte_panic("Unaligned element, heap is probably corrupt\n"); + } + + if (len_before >= MALLOC_ELEM_OVERHEAD + MIN_DATA_SIZE) { + /* split before */ + struct malloc_elem *split_before = start; + + split_elem(elem, split_before); + + prev = elem; + elem = split_before; + + malloc_elem_free_list_insert(prev); + } else if (len_before > 0) { + /* + * unlike with elements after current, here we don't need to + * pad elements, but rather just increase the size of previous + * element, copy the old header and and set up trailer. + */ + void *trailer = RTE_PTR_ADD(prev, + prev->size - MALLOC_ELEM_TRAILER_LEN); + struct malloc_elem *new_elem = start; + + memcpy(new_elem, elem, sizeof(*elem)); + new_elem->size -= len_before; + + prev->size += len_before; + set_trailer(prev); + + elem = new_elem; + + /* erase old trailer */ + memset(trailer, 0, MALLOC_ELEM_TRAILER_LEN); + } + + remove_elem(elem); } /* @@ -446,7 +526,7 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size) /* we now know the element fits, so remove from free list, * join the two */ - elem_free_list_remove(elem->next); + malloc_elem_free_list_remove(elem->next); join_elem(elem, elem->next); if (elem->size - new_size >= MIN_DATA_SIZE + MALLOC_ELEM_OVERHEAD) { diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index 330bddc..b47c55e 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -164,7 +164,7 @@ malloc_elem_alloc(struct malloc_elem *elem, size_t size, * blocks either immediately before or immediately after newly freed block * are also free, the blocks are merged together. */ -int +struct malloc_elem * malloc_elem_free(struct malloc_elem *elem); struct malloc_elem * @@ -177,6 +177,12 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem); int malloc_elem_resize(struct malloc_elem *elem, size_t size); +void +malloc_elem_hide_region(struct malloc_elem *elem, void *start, size_t len); + +void +malloc_elem_free_list_remove(struct malloc_elem *elem); + /* * Given an element size, compute its freelist index. */ diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index 5fa21fe..0d61704 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -49,8 +49,10 @@ #include #include #include +#include #include "eal_internal_cfg.h" +#include "eal_memalloc.h" #include "malloc_elem.h" #include "malloc_heap.h" @@ -151,46 +153,304 @@ find_suitable_element(struct malloc_heap *heap, size_t size, * scan fails. Once the new memseg is added, it re-scans and should return * the new element after releasing the lock. */ -void * -malloc_heap_alloc(struct malloc_heap *heap, - const char *type __attribute__((unused)), size_t size, unsigned flags, - size_t align, size_t bound) +static void * +heap_alloc(struct malloc_heap *heap, const char *type __rte_unused, size_t size, + unsigned flags, size_t align, size_t bound) { struct malloc_elem *elem; size = RTE_CACHE_LINE_ROUNDUP(size); align = RTE_CACHE_LINE_ROUNDUP(align); - rte_spinlock_lock(&heap->lock); - elem = find_suitable_element(heap, size, flags, align, bound); if (elem != NULL) { elem = malloc_elem_alloc(elem, size, align, bound); + /* increase heap's count of allocated elements */ heap->alloc_count++; } - rte_spinlock_unlock(&heap->lock); return elem == NULL ? NULL : (void *)(&elem[1]); } +static void * +try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl, + const char *type, size_t size, int socket, unsigned flags, + size_t align, size_t bound) { + struct malloc_elem *elem; + struct rte_memseg **ms; + size_t map_len; + void *map_addr; + int i, n_pages, allocd_pages; + void *ret; + + align = RTE_MAX(align, MALLOC_ELEM_HEADER_LEN); + map_len = RTE_ALIGN_CEIL(align + size + MALLOC_ELEM_TRAILER_LEN, + msl->hugepage_sz); + + n_pages = map_len / msl->hugepage_sz; + + /* we can't know in advance how many pages we'll need, so malloc */ + ms = malloc(sizeof(*ms) * n_pages); + + allocd_pages = eal_memalloc_alloc_page_bulk(ms, n_pages, + msl->hugepage_sz, socket, true); + + /* make sure we've allocated our pages... */ + if (allocd_pages != n_pages) + goto free_ms; + + map_addr = ms[0]->addr; + + /* add newly minted memsegs to malloc heap */ + elem = malloc_heap_add_memory(heap, msl, map_addr, map_len); + + RTE_LOG(DEBUG, EAL, "Heap on socket %d was expanded by %zdMB\n", + msl->socket_id, map_len >> 20ULL); + + /* try once more, as now we have allocated new memory */ + ret = heap_alloc(heap, type, size, flags, + align == 0 ? 1 : align, bound); + + if (ret == NULL) + goto free_elem; + + free(ms); + return ret; + +free_elem: + malloc_elem_free_list_remove(elem); + malloc_elem_hide_region(elem, map_addr, map_len); + heap->total_size -= map_len; + + RTE_LOG(DEBUG, EAL, "%s(): couldn't allocate, so shrinking heap on socket %d by %zdMB\n", + __func__, socket, map_len >> 20ULL); + + for (i = 0; i < n_pages; i++) { + eal_memalloc_free_page(ms[i]); + } +free_ms: + free(ms); + return NULL; +} + +static int +compare_pagesz(const void *a, const void *b) { + const struct rte_memseg_list *msla = a; + const struct rte_memseg_list *mslb = b; + + if (msla->hugepage_sz < mslb->hugepage_sz) + return 1; + if (msla->hugepage_sz > mslb->hugepage_sz) + return -1; + return 0; +} + +/* this will try lower page sizes first */ +static void * +heap_alloc_on_socket(const char *type, size_t size, int socket, + unsigned flags, size_t align, size_t bound) { + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct malloc_heap *heap = &mcfg->malloc_heaps[socket]; + struct rte_memseg_list *requested_msls[RTE_MAX_MEMSEG_LISTS]; + struct rte_memseg_list *other_msls[RTE_MAX_MEMSEG_LISTS]; + int i, n_other_msls = 0, n_requested_msls = 0; + bool size_hint = (flags & RTE_MEMZONE_SIZE_HINT_ONLY) > 0; + unsigned size_flags = flags & ~RTE_MEMZONE_SIZE_HINT_ONLY; + void *ret; + + rte_spinlock_lock(&(heap->lock)); + + /* for legacy mode, try once and with all flags */ + if (internal_config.legacy_mem) { + ret = heap_alloc(heap, type, size, flags, + align == 0 ? 1 : align, bound); + goto alloc_unlock; + } + + /* + * we do not pass the size hint here, because even if allocation fails, + * we may still be able to allocate memory from appropriate page sizes, + * we just need to request more memory first. + */ + ret = heap_alloc(heap, type, size, size_flags, align == 0 ? 1 : align, + bound); + if (ret != NULL) + goto alloc_unlock; + + memset(requested_msls, 0, sizeof(requested_msls)); + memset(other_msls, 0, sizeof(other_msls)); + + /* + * go through memseg list and take note of all the page sizes available, + * and if any of them were specifically requested by the user. + */ + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + struct rte_memseg_list *msl = &mcfg->memsegs[i]; + + if (msl->socket_id != socket) + continue; + + if (msl->base_va == NULL) + continue; + + /* if pages of specific size were requested */ + if (size_flags != 0 && check_hugepage_sz(size_flags, + msl->hugepage_sz)) { + requested_msls[n_requested_msls++] = msl; + } else if (size_flags == 0 || size_hint) { + other_msls[n_other_msls++] = msl; + } + } + + /* sort the lists, smallest first */ + qsort(requested_msls, n_requested_msls, sizeof(requested_msls[0]), + compare_pagesz); + qsort(other_msls, n_other_msls, sizeof(other_msls[0]), + compare_pagesz); + + for (i = 0; i < n_requested_msls; i++) { + struct rte_memseg_list *msl = requested_msls[i]; + + /* + * do not pass the size hint here, as user expects other page + * sizes first, before resorting to best effort allocation. + */ + ret = try_expand_heap(heap, msl, type, size, socket, size_flags, + align, bound); + if (ret != NULL) + goto alloc_unlock; + } + if (n_other_msls == 0) + goto alloc_unlock; + + /* now, try reserving with size hint */ + ret = heap_alloc(heap, type, size, flags, align == 0 ? 1 : align, + bound); + if (ret != NULL) + goto alloc_unlock; + + /* + * we still couldn't reserve memory, so try expanding heap with other + * page sizes, if there are any + */ + for (i = 0; i < n_other_msls; i++) { + struct rte_memseg_list *msl = other_msls[i]; + + ret = try_expand_heap(heap, msl, type, size, socket, flags, + align, bound); + if (ret != NULL) + goto alloc_unlock; + } +alloc_unlock: + rte_spinlock_unlock(&(heap->lock)); + return ret; +} + +void * +malloc_heap_alloc(const char *type, size_t size, int socket_arg, unsigned flags, + size_t align, size_t bound) { + int socket, i; + void *ret; + + /* return NULL if size is 0 or alignment is not power-of-2 */ + if (size == 0 || (align && !rte_is_power_of_2(align))) + return NULL; + + if (!rte_eal_has_hugepages()) + socket_arg = SOCKET_ID_ANY; + + if (socket_arg == SOCKET_ID_ANY) + socket = malloc_get_numa_socket(); + else + socket = socket_arg; + + /* Check socket parameter */ + if (socket >= RTE_MAX_NUMA_NODES) + return NULL; + + // TODO: add warning for alignments bigger than page size if not VFIO + + ret = heap_alloc_on_socket(type, size, socket, flags, align, bound); + if (ret != NULL || socket_arg != SOCKET_ID_ANY) + return ret; + + /* try other heaps */ + for (i = 0; i < (int) rte_num_sockets(); i++) { + if (i == socket) + continue; + ret = heap_alloc_on_socket(type, size, socket, flags, + align, bound); + if (ret != NULL) + return ret; + } + return NULL; +} + int malloc_heap_free(struct malloc_elem *elem) { struct malloc_heap *heap; - int ret; + void *start, *aligned_start, *end, *aligned_end; + size_t len, aligned_len; + const struct rte_memseg_list *msl; + int n_pages, page_idx, max_page_idx, ret; if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY) return -1; /* elem may be merged with previous element, so keep heap address */ heap = elem->heap; + msl = elem->msl; rte_spinlock_lock(&(heap->lock)); - ret = malloc_elem_free(elem); + elem = malloc_elem_free(elem); - rte_spinlock_unlock(&(heap->lock)); + /* anything after this is a bonus */ + ret = 0; + + /* ...of which we can't avail if we are in legacy mode */ + if (internal_config.legacy_mem) + goto free_unlock; + + /* check if we can free any memory back to the system */ + if (elem->size < msl->hugepage_sz) + goto free_unlock; + + /* probably, but let's make sure, as we may not be using up full page */ + start = elem; + len = elem->size; + aligned_start = RTE_PTR_ALIGN_CEIL(start, msl->hugepage_sz); + end = RTE_PTR_ADD(elem, len); + aligned_end = RTE_PTR_ALIGN_FLOOR(end, msl->hugepage_sz); + aligned_len = RTE_PTR_DIFF(aligned_end, aligned_start); + + /* can't free anything */ + if (aligned_len < msl->hugepage_sz) + goto free_unlock; + + malloc_elem_free_list_remove(elem); + + malloc_elem_hide_region(elem, (void*) aligned_start, aligned_len); + + /* we don't really care if we fail to deallocate memory */ + n_pages = aligned_len / msl->hugepage_sz; + page_idx = RTE_PTR_DIFF(aligned_start, msl->base_va) / msl->hugepage_sz; + max_page_idx = page_idx + n_pages; + + for (; page_idx < max_page_idx; page_idx++) { + struct rte_memseg *ms; + + ms = rte_fbarray_get(&msl->memseg_arr, page_idx); + eal_memalloc_free_page(ms); + heap->total_size -= msl->hugepage_sz; + } + + RTE_LOG(DEBUG, EAL, "Heap on socket %d was shrunk by %zdMB\n", + msl->socket_id, aligned_len >> 20ULL); +free_unlock: + rte_spinlock_unlock(&(heap->lock)); return ret; } diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h index df04dd8..3fcd14f 100644 --- a/lib/librte_eal/common/malloc_heap.h +++ b/lib/librte_eal/common/malloc_heap.h @@ -53,8 +53,8 @@ malloc_get_numa_socket(void) } void * -malloc_heap_alloc(struct malloc_heap *heap, const char *type, size_t size, - unsigned flags, size_t align, size_t bound); +malloc_heap_alloc(const char *type, size_t size, int socket, unsigned flags, + size_t align, size_t bound); int malloc_heap_free(struct malloc_elem *elem); diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c index 92cd7d8..dc3199a 100644 --- a/lib/librte_eal/common/rte_malloc.c +++ b/lib/librte_eal/common/rte_malloc.c @@ -68,9 +68,7 @@ void rte_free(void *addr) void * rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg) { - struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; - int socket, i; - void *ret; + int socket; /* return NULL if size is 0 or alignment is not power-of-2 */ if (size == 0 || (align && !rte_is_power_of_2(align))) @@ -88,24 +86,8 @@ rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg) if (socket >= RTE_MAX_NUMA_NODES) return NULL; - ret = malloc_heap_alloc(&mcfg->malloc_heaps[socket], type, - size, 0, align == 0 ? 1 : align, 0); - if (ret != NULL || socket_arg != SOCKET_ID_ANY) - return ret; - - /* try other heaps */ - for (i = 0; i < RTE_MAX_NUMA_NODES; i++) { - /* we already tried this one */ - if (i == socket) - continue; - - ret = malloc_heap_alloc(&mcfg->malloc_heaps[i], type, - size, 0, align == 0 ? 1 : align, 0); - if (ret != NULL) - return ret; - } - - return NULL; + return malloc_heap_alloc(type, size, socket_arg, 0, + align == 0 ? 1 : align, 0); } /* From patchwork Tue Dec 19 11:14:44 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32470 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 4ED2B1B211; Tue, 19 Dec 2017 12:15:28 +0100 (CET) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 9FA531B022 for ; Tue, 19 Dec 2017 12:14:56 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="3709900" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga008.fm.intel.com with ESMTP; 19 Dec 2017 03:14:53 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBErYk003138; Tue, 19 Dec 2017 11:14:53 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEqsb010302; Tue, 19 Dec 2017 11:14:52 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEqF7010298; Tue, 19 Dec 2017 11:14:52 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:44 +0000 Message-Id: <5affaa884964f8cd1025e991caf58c67783a52ab.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 17/23] eal: add backend support for contiguous memory allocation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" No major changes, just add some checks in a few key places, and a new parameter to pass around. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memzone.c | 16 +++-- lib/librte_eal/common/malloc_elem.c | 105 +++++++++++++++++++++++------ lib/librte_eal/common/malloc_elem.h | 6 +- lib/librte_eal/common/malloc_heap.c | 54 +++++++++------ lib/librte_eal/common/malloc_heap.h | 6 +- lib/librte_eal/common/rte_malloc.c | 38 +++++++---- 6 files changed, 158 insertions(+), 67 deletions(-) diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c index c571145..542ae90 100644 --- a/lib/librte_eal/common/eal_common_memzone.c +++ b/lib/librte_eal/common/eal_common_memzone.c @@ -127,7 +127,8 @@ find_heap_max_free_elem(int *s, unsigned align) static const struct rte_memzone * memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, - int socket_id, unsigned flags, unsigned align, unsigned bound) + int socket_id, unsigned flags, unsigned align, unsigned bound, + bool contig) { struct rte_memzone *mz; struct rte_mem_config *mcfg; @@ -217,7 +218,7 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, /* allocate memory on heap */ void *mz_addr = malloc_heap_alloc(NULL, requested_len, socket, flags, - align, bound); + align, bound, contig); if (mz_addr == NULL) { rte_errno = ENOMEM; @@ -251,7 +252,7 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, static const struct rte_memzone * rte_memzone_reserve_thread_safe(const char *name, size_t len, int socket_id, unsigned flags, unsigned align, - unsigned bound) + unsigned bound, bool contig) { struct rte_mem_config *mcfg; const struct rte_memzone *mz = NULL; @@ -262,7 +263,7 @@ rte_memzone_reserve_thread_safe(const char *name, size_t len, rte_rwlock_write_lock(&mcfg->mlock); mz = memzone_reserve_aligned_thread_unsafe( - name, len, socket_id, flags, align, bound); + name, len, socket_id, flags, align, bound, contig); rte_rwlock_write_unlock(&mcfg->mlock); @@ -279,7 +280,7 @@ rte_memzone_reserve_bounded(const char *name, size_t len, int socket_id, unsigned flags, unsigned align, unsigned bound) { return rte_memzone_reserve_thread_safe(name, len, socket_id, flags, - align, bound); + align, bound, false); } /* @@ -291,7 +292,7 @@ rte_memzone_reserve_aligned(const char *name, size_t len, int socket_id, unsigned flags, unsigned align) { return rte_memzone_reserve_thread_safe(name, len, socket_id, flags, - align, 0); + align, 0, false); } /* @@ -303,7 +304,8 @@ rte_memzone_reserve(const char *name, size_t len, int socket_id, unsigned flags) { return rte_memzone_reserve_thread_safe(name, len, socket_id, - flags, RTE_CACHE_LINE_SIZE, 0); + flags, RTE_CACHE_LINE_SIZE, 0, + false); } int diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 48ac604..a7d7cef 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -45,6 +45,7 @@ #include #include +#include "eal_memalloc.h" #include "malloc_elem.h" #include "malloc_heap.h" @@ -122,32 +123,83 @@ malloc_elem_insert(struct malloc_elem *elem) } /* + * Attempt to find enough physically contiguous memory in this block to store + * our data. Assume that element has at least enough space to fit in the data, + * so we just check the page addresses. + */ +static bool +elem_check_phys_contig(const struct rte_memseg_list *msl, void *start, + size_t size) { + uint64_t page_sz; + void *aligned_start, *end, *aligned_end; + size_t aligned_len; + + /* figure out how many pages we need to fit in current data */ + page_sz = msl->hugepage_sz; + aligned_start = RTE_PTR_ALIGN_FLOOR(start, page_sz); + end = RTE_PTR_ADD(start, size); + aligned_end = RTE_PTR_ALIGN_CEIL(end, page_sz); + + aligned_len = RTE_PTR_DIFF(aligned_end, aligned_start); + + return eal_memalloc_is_contig(msl, aligned_start, aligned_len); +} + +/* * calculate the starting point of where data of the requested size * and alignment would fit in the current element. If the data doesn't * fit, return NULL. */ static void * elem_start_pt(struct malloc_elem *elem, size_t size, unsigned align, - size_t bound) + size_t bound, bool contig) { - const size_t bmask = ~(bound - 1); - uintptr_t end_pt = (uintptr_t)elem + - elem->size - MALLOC_ELEM_TRAILER_LEN; - uintptr_t new_data_start = RTE_ALIGN_FLOOR((end_pt - size), align); - uintptr_t new_elem_start; - - /* check boundary */ - if ((new_data_start & bmask) != ((end_pt - 1) & bmask)) { - end_pt = RTE_ALIGN_FLOOR(end_pt, bound); - new_data_start = RTE_ALIGN_FLOOR((end_pt - size), align); - if (((end_pt - 1) & bmask) != (new_data_start & bmask)) - return NULL; - } + size_t elem_size = elem->size; - new_elem_start = new_data_start - MALLOC_ELEM_HEADER_LEN; + /* + * we're allocating from the end, so adjust the size of element by page + * size each time + */ + while (elem_size >= size) { + const size_t bmask = ~(bound - 1); + uintptr_t end_pt = (uintptr_t)elem + + elem_size - MALLOC_ELEM_TRAILER_LEN; + uintptr_t new_data_start = RTE_ALIGN_FLOOR((end_pt - size), align); + uintptr_t new_elem_start; + + /* check boundary */ + if ((new_data_start & bmask) != ((end_pt - 1) & bmask)) { + end_pt = RTE_ALIGN_FLOOR(end_pt, bound); + new_data_start = RTE_ALIGN_FLOOR((end_pt - size), align); + end_pt = new_data_start + size; + + if (((end_pt - 1) & bmask) != (new_data_start & bmask)) + return NULL; + } - /* if the new start point is before the exist start, it won't fit */ - return (new_elem_start < (uintptr_t)elem) ? NULL : (void *)new_elem_start; + new_elem_start = new_data_start - MALLOC_ELEM_HEADER_LEN; + + /* if the new start point is before the exist start, it won't fit */ + if (new_elem_start < (uintptr_t)elem) + return NULL; + + if (contig) { + size_t new_data_size = end_pt - new_data_start; + + /* + * if physical contiguousness was requested and we + * couldn't fit all data into one physically contiguous + * block, try again with lower addresses. + */ + if (!elem_check_phys_contig(elem->msl, + (void*) new_data_start, new_data_size)) { + elem_size -= align; + continue; + } + } + return (void *) new_elem_start; + } + return NULL; } /* @@ -156,9 +208,9 @@ elem_start_pt(struct malloc_elem *elem, size_t size, unsigned align, */ int malloc_elem_can_hold(struct malloc_elem *elem, size_t size, unsigned align, - size_t bound) + size_t bound, bool contig) { - return elem_start_pt(elem, size, align, bound) != NULL; + return elem_start_pt(elem, size, align, bound, contig) != NULL; } /* @@ -283,9 +335,10 @@ malloc_elem_free_list_remove(struct malloc_elem *elem) */ struct malloc_elem * malloc_elem_alloc(struct malloc_elem *elem, size_t size, unsigned align, - size_t bound) + size_t bound, bool contig) { - struct malloc_elem *new_elem = elem_start_pt(elem, size, align, bound); + struct malloc_elem *new_elem = elem_start_pt(elem, size, align, bound, + contig); const size_t old_elem_size = (uintptr_t)new_elem - (uintptr_t)elem; const size_t trailer_size = elem->size - old_elem_size - size - MALLOC_ELEM_OVERHEAD; @@ -508,9 +561,11 @@ malloc_elem_hide_region(struct malloc_elem *elem, void *start, size_t len) { * immediately after it in memory. */ int -malloc_elem_resize(struct malloc_elem *elem, size_t size) +malloc_elem_resize(struct malloc_elem *elem, size_t size, bool contig) { const size_t new_size = size + elem->pad + MALLOC_ELEM_OVERHEAD; + const size_t new_data_size = new_size - MALLOC_ELEM_OVERHEAD; + void *data_ptr = RTE_PTR_ADD(elem, MALLOC_ELEM_HEADER_LEN); /* if we request a smaller size, then always return ok */ if (elem->size >= new_size) @@ -523,6 +578,12 @@ malloc_elem_resize(struct malloc_elem *elem, size_t size) if (elem->size + elem->next->size < new_size) return -1; + /* if physical contiguousness was requested, check that as well */ + if (contig && !elem_check_phys_contig(elem->msl, + data_ptr, new_data_size)) { + return -1; + } + /* we now know the element fits, so remove from free list, * join the two */ diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index b47c55e..02d6bd7 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -149,7 +149,7 @@ malloc_elem_insert(struct malloc_elem *elem); */ int malloc_elem_can_hold(struct malloc_elem *elem, size_t size, - unsigned align, size_t bound); + unsigned align, size_t bound, bool contig); /* * reserve a block of data in an existing malloc_elem. If the malloc_elem @@ -157,7 +157,7 @@ malloc_elem_can_hold(struct malloc_elem *elem, size_t size, */ struct malloc_elem * malloc_elem_alloc(struct malloc_elem *elem, size_t size, - unsigned align, size_t bound); + unsigned align, size_t bound, bool contig); /* * free a malloc_elem block by adding it to the free list. If the @@ -175,7 +175,7 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem); * immediately after it in memory. */ int -malloc_elem_resize(struct malloc_elem *elem, size_t size); +malloc_elem_resize(struct malloc_elem *elem, size_t size, bool contig); void malloc_elem_hide_region(struct malloc_elem *elem, void *start, size_t len); diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index 0d61704..427f7c6 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -123,7 +123,7 @@ malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl, */ static struct malloc_elem * find_suitable_element(struct malloc_heap *heap, size_t size, - unsigned flags, size_t align, size_t bound) + unsigned flags, size_t align, size_t bound, bool contig) { size_t idx; struct malloc_elem *elem, *alt_elem = NULL; @@ -132,7 +132,8 @@ find_suitable_element(struct malloc_heap *heap, size_t size, idx < RTE_HEAP_NUM_FREELISTS; idx++) { for (elem = LIST_FIRST(&heap->free_head[idx]); !!elem; elem = LIST_NEXT(elem, free_list)) { - if (malloc_elem_can_hold(elem, size, align, bound)) { + if (malloc_elem_can_hold(elem, size, align, bound, + contig)) { if (check_hugepage_sz(flags, elem->msl->hugepage_sz)) return elem; if (alt_elem == NULL) @@ -155,16 +156,16 @@ find_suitable_element(struct malloc_heap *heap, size_t size, */ static void * heap_alloc(struct malloc_heap *heap, const char *type __rte_unused, size_t size, - unsigned flags, size_t align, size_t bound) + unsigned flags, size_t align, size_t bound, bool contig) { struct malloc_elem *elem; size = RTE_CACHE_LINE_ROUNDUP(size); align = RTE_CACHE_LINE_ROUNDUP(align); - elem = find_suitable_element(heap, size, flags, align, bound); + elem = find_suitable_element(heap, size, flags, align, bound, contig); if (elem != NULL) { - elem = malloc_elem_alloc(elem, size, align, bound); + elem = malloc_elem_alloc(elem, size, align, bound, contig); /* increase heap's count of allocated elements */ heap->alloc_count++; @@ -176,13 +177,13 @@ heap_alloc(struct malloc_heap *heap, const char *type __rte_unused, size_t size, static void * try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl, const char *type, size_t size, int socket, unsigned flags, - size_t align, size_t bound) { + size_t align, size_t bound, bool contig) { struct malloc_elem *elem; struct rte_memseg **ms; - size_t map_len; + size_t map_len, data_start_offset; void *map_addr; int i, n_pages, allocd_pages; - void *ret; + void *ret, *data_start; align = RTE_MAX(align, MALLOC_ELEM_HEADER_LEN); map_len = RTE_ALIGN_CEIL(align + size + MALLOC_ELEM_TRAILER_LEN, @@ -200,6 +201,16 @@ try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl, if (allocd_pages != n_pages) goto free_ms; + /* check if we wanted contiguous memory but didn't get it */ + data_start_offset = RTE_ALIGN(MALLOC_ELEM_HEADER_LEN, align); + data_start = RTE_PTR_ADD(ms[0]->addr, data_start_offset); + if (contig && !eal_memalloc_is_contig(msl, data_start, + n_pages * msl->hugepage_sz)) { + RTE_LOG(DEBUG, EAL, "%s(): couldn't allocate physically contiguous space\n", + __func__); + goto free_pages; + } + map_addr = ms[0]->addr; /* add newly minted memsegs to malloc heap */ @@ -210,7 +221,7 @@ try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl, /* try once more, as now we have allocated new memory */ ret = heap_alloc(heap, type, size, flags, - align == 0 ? 1 : align, bound); + align == 0 ? 1 : align, bound, contig); if (ret == NULL) goto free_elem; @@ -225,7 +236,7 @@ try_expand_heap(struct malloc_heap *heap, struct rte_memseg_list *msl, RTE_LOG(DEBUG, EAL, "%s(): couldn't allocate, so shrinking heap on socket %d by %zdMB\n", __func__, socket, map_len >> 20ULL); - +free_pages: for (i = 0; i < n_pages; i++) { eal_memalloc_free_page(ms[i]); } @@ -249,7 +260,7 @@ compare_pagesz(const void *a, const void *b) { /* this will try lower page sizes first */ static void * heap_alloc_on_socket(const char *type, size_t size, int socket, - unsigned flags, size_t align, size_t bound) { + unsigned flags, size_t align, size_t bound, bool contig) { struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; struct malloc_heap *heap = &mcfg->malloc_heaps[socket]; struct rte_memseg_list *requested_msls[RTE_MAX_MEMSEG_LISTS]; @@ -264,7 +275,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket, /* for legacy mode, try once and with all flags */ if (internal_config.legacy_mem) { ret = heap_alloc(heap, type, size, flags, - align == 0 ? 1 : align, bound); + align == 0 ? 1 : align, bound, contig); goto alloc_unlock; } @@ -274,7 +285,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket, * we just need to request more memory first. */ ret = heap_alloc(heap, type, size, size_flags, align == 0 ? 1 : align, - bound); + bound, contig); if (ret != NULL) goto alloc_unlock; @@ -317,7 +328,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket, * sizes first, before resorting to best effort allocation. */ ret = try_expand_heap(heap, msl, type, size, socket, size_flags, - align, bound); + align, bound, contig); if (ret != NULL) goto alloc_unlock; } @@ -326,7 +337,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket, /* now, try reserving with size hint */ ret = heap_alloc(heap, type, size, flags, align == 0 ? 1 : align, - bound); + bound, contig); if (ret != NULL) goto alloc_unlock; @@ -338,7 +349,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket, struct rte_memseg_list *msl = other_msls[i]; ret = try_expand_heap(heap, msl, type, size, socket, flags, - align, bound); + align, bound, contig); if (ret != NULL) goto alloc_unlock; } @@ -349,7 +360,7 @@ heap_alloc_on_socket(const char *type, size_t size, int socket, void * malloc_heap_alloc(const char *type, size_t size, int socket_arg, unsigned flags, - size_t align, size_t bound) { + size_t align, size_t bound, bool contig) { int socket, i; void *ret; @@ -371,7 +382,8 @@ malloc_heap_alloc(const char *type, size_t size, int socket_arg, unsigned flags, // TODO: add warning for alignments bigger than page size if not VFIO - ret = heap_alloc_on_socket(type, size, socket, flags, align, bound); + ret = heap_alloc_on_socket(type, size, socket, flags, align, bound, + contig); if (ret != NULL || socket_arg != SOCKET_ID_ANY) return ret; @@ -380,7 +392,7 @@ malloc_heap_alloc(const char *type, size_t size, int socket_arg, unsigned flags, if (i == socket) continue; ret = heap_alloc_on_socket(type, size, socket, flags, - align, bound); + align, bound, contig); if (ret != NULL) return ret; } @@ -455,7 +467,7 @@ malloc_heap_free(struct malloc_elem *elem) { } int -malloc_heap_resize(struct malloc_elem *elem, size_t size) { +malloc_heap_resize(struct malloc_elem *elem, size_t size, bool contig) { int ret; if (!malloc_elem_cookies_ok(elem) || elem->state != ELEM_BUSY) @@ -463,7 +475,7 @@ malloc_heap_resize(struct malloc_elem *elem, size_t size) { rte_spinlock_lock(&(elem->heap->lock)); - ret = malloc_elem_resize(elem, size); + ret = malloc_elem_resize(elem, size, contig); rte_spinlock_unlock(&(elem->heap->lock)); diff --git a/lib/librte_eal/common/malloc_heap.h b/lib/librte_eal/common/malloc_heap.h index 3fcd14f..e95b526 100644 --- a/lib/librte_eal/common/malloc_heap.h +++ b/lib/librte_eal/common/malloc_heap.h @@ -34,6 +34,8 @@ #ifndef MALLOC_HEAP_H_ #define MALLOC_HEAP_H_ +#include + #include #include @@ -54,13 +56,13 @@ malloc_get_numa_socket(void) void * malloc_heap_alloc(const char *type, size_t size, int socket, unsigned flags, - size_t align, size_t bound); + size_t align, size_t bound, bool contig); int malloc_heap_free(struct malloc_elem *elem); int -malloc_heap_resize(struct malloc_elem *elem, size_t size); +malloc_heap_resize(struct malloc_elem *elem, size_t size, bool contig); int malloc_heap_get_stats(struct malloc_heap *heap, diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c index dc3199a..623725e 100644 --- a/lib/librte_eal/common/rte_malloc.c +++ b/lib/librte_eal/common/rte_malloc.c @@ -62,12 +62,9 @@ void rte_free(void *addr) rte_panic("Fatal error: Invalid memory\n"); } -/* - * Allocate memory on specified heap. - */ -void * -rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg) -{ +static void * +malloc_socket(const char *type, size_t size, unsigned align, int socket_arg, + bool contig) { int socket; /* return NULL if size is 0 or alignment is not power-of-2 */ @@ -86,8 +83,16 @@ rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg) if (socket >= RTE_MAX_NUMA_NODES) return NULL; - return malloc_heap_alloc(type, size, socket_arg, 0, - align == 0 ? 1 : align, 0); + return malloc_heap_alloc(type, size, socket_arg, 0, align, 0, contig); +} + +/* + * Allocate memory on specified heap. + */ +void * +rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg) +{ + return malloc_socket(type, size, align, socket_arg, false); } /* @@ -138,8 +143,8 @@ rte_calloc(const char *type, size_t num, size_t size, unsigned align) /* * Resize allocated memory. */ -void * -rte_realloc(void *ptr, size_t size, unsigned align) +static void * +do_realloc(void *ptr, size_t size, unsigned align, bool contig) { if (ptr == NULL) return rte_malloc(NULL, size, align); @@ -151,12 +156,12 @@ rte_realloc(void *ptr, size_t size, unsigned align) size = RTE_CACHE_LINE_ROUNDUP(size), align = RTE_CACHE_LINE_ROUNDUP(align); /* check alignment matches first, and if ok, see if we can resize block */ if (RTE_PTR_ALIGN(ptr,align) == ptr && - malloc_heap_resize(elem, size) == 0) + malloc_heap_resize(elem, size, contig) == 0) return ptr; /* either alignment is off, or we have no room to expand, * so move data. */ - void *new_ptr = rte_malloc(NULL, size, align); + void *new_ptr = malloc_socket(NULL, size, align, SOCKET_ID_ANY, contig); if (new_ptr == NULL) return NULL; const unsigned old_size = elem->size - MALLOC_ELEM_OVERHEAD; @@ -166,6 +171,15 @@ rte_realloc(void *ptr, size_t size, unsigned align) return new_ptr; } +/* + * Resize allocated memory. + */ +void * +rte_realloc(void *ptr, size_t size, unsigned align) +{ + return do_realloc(ptr, size, align, false); +} + int rte_malloc_validate(const void *ptr, size_t *size) { From patchwork Tue Dec 19 11:14:45 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32465 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 6C51D1B1D6; Tue, 19 Dec 2017 12:15:19 +0100 (CET) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 611471D7 for ; Tue, 19 Dec 2017 12:14:56 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="14827542" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga001.fm.intel.com with ESMTP; 19 Dec 2017 03:14:53 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEr3D003141; Tue, 19 Dec 2017 11:14:53 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBErJN010309; Tue, 19 Dec 2017 11:14:53 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBErlh010305; Tue, 19 Dec 2017 11:14:53 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:45 +0000 Message-Id: <65d0cc5505897c256559e5788fea00777d85699a.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 18/23] eal: add rte_malloc support for allocating contiguous memory X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" This adds a new set of _contig API's to rte_malloc. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/include/rte_malloc.h | 181 +++++++++++++++++++++++++++++ lib/librte_eal/common/rte_malloc.c | 63 ++++++++++ 2 files changed, 244 insertions(+) diff --git a/lib/librte_eal/common/include/rte_malloc.h b/lib/librte_eal/common/include/rte_malloc.h index 5d4c11a..c132d33 100644 --- a/lib/librte_eal/common/include/rte_malloc.h +++ b/lib/librte_eal/common/include/rte_malloc.h @@ -242,6 +242,187 @@ void * rte_calloc_socket(const char *type, size_t num, size_t size, unsigned align, int socket); /** + * This function allocates memory from the huge-page area of memory. The memory + * is not cleared. In NUMA systems, the memory allocated resides on the same + * NUMA socket as the core that calls this function. + * + * @param type + * A string identifying the type of allocated objects (useful for debug + * purposes, such as identifying the cause of a memory leak). Can be NULL. + * @param size + * Size (in bytes) to be allocated. + * @param align + * If 0, the return is a pointer that is suitably aligned for any kind of + * variable (in the same manner as malloc()). + * Otherwise, the return is a pointer that is a multiple of *align*. In + * this case, it must be a power of two. (Minimum alignment is the + * cacheline size, i.e. 64-bytes) + * @return + * - NULL on error. Not enough memory, or invalid arguments (size is 0, + * align is not a power of two). + * - Otherwise, the pointer to the allocated object. + */ +void * +rte_malloc_contig(const char *type, size_t size, unsigned align); + +/** + * Allocate zero'ed memory from the heap. + * + * Equivalent to rte_malloc() except that the memory zone is + * initialised with zeros. In NUMA systems, the memory allocated resides on the + * same NUMA socket as the core that calls this function. + * + * @param type + * A string identifying the type of allocated objects (useful for debug + * purposes, such as identifying the cause of a memory leak). Can be NULL. + * @param size + * Size (in bytes) to be allocated. + * @param align + * If 0, the return is a pointer that is suitably aligned for any kind of + * variable (in the same manner as malloc()). + * Otherwise, the return is a pointer that is a multiple of *align*. In + * this case, it must obviously be a power of two. (Minimum alignment is the + * cacheline size, i.e. 64-bytes) + * @return + * - NULL on error. Not enough memory, or invalid arguments (size is 0, + * align is not a power of two). + * - Otherwise, the pointer to the allocated object. + */ +void * +rte_zmalloc_contig(const char *type, size_t size, unsigned align); + +/** + * Replacement function for calloc(), using huge-page memory. Memory area is + * initialised with zeros. In NUMA systems, the memory allocated resides on the + * same NUMA socket as the core that calls this function. + * + * @param type + * A string identifying the type of allocated objects (useful for debug + * purposes, such as identifying the cause of a memory leak). Can be NULL. + * @param num + * Number of elements to be allocated. + * @param size + * Size (in bytes) of a single element. + * @param align + * If 0, the return is a pointer that is suitably aligned for any kind of + * variable (in the same manner as malloc()). + * Otherwise, the return is a pointer that is a multiple of *align*. In + * this case, it must obviously be a power of two. (Minimum alignment is the + * cacheline size, i.e. 64-bytes) + * @return + * - NULL on error. Not enough memory, or invalid arguments (size is 0, + * align is not a power of two). + * - Otherwise, the pointer to the allocated object. + */ +void * +rte_calloc_contig(const char *type, size_t num, size_t size, unsigned align); + +/** + * Replacement function for realloc(), using huge-page memory. Reserved area + * memory is resized, preserving contents. In NUMA systems, the new area + * resides on the same NUMA socket as the old area. + * + * @param ptr + * Pointer to already allocated memory + * @param size + * Size (in bytes) of new area. If this is 0, memory is freed. + * @param align + * If 0, the return is a pointer that is suitably aligned for any kind of + * variable (in the same manner as malloc()). + * Otherwise, the return is a pointer that is a multiple of *align*. In + * this case, it must obviously be a power of two. (Minimum alignment is the + * cacheline size, i.e. 64-bytes) + * @return + * - NULL on error. Not enough memory, or invalid arguments (size is 0, + * align is not a power of two). + * - Otherwise, the pointer to the reallocated memory. + */ +void * +rte_realloc_contig(void *ptr, size_t size, unsigned align); + +/** + * This function allocates memory from the huge-page area of memory. The memory + * is not cleared. + * + * @param type + * A string identifying the type of allocated objects (useful for debug + * purposes, such as identifying the cause of a memory leak). Can be NULL. + * @param size + * Size (in bytes) to be allocated. + * @param align + * If 0, the return is a pointer that is suitably aligned for any kind of + * variable (in the same manner as malloc()). + * Otherwise, the return is a pointer that is a multiple of *align*. In + * this case, it must be a power of two. (Minimum alignment is the + * cacheline size, i.e. 64-bytes) + * @param socket + * NUMA socket to allocate memory on. If SOCKET_ID_ANY is used, this function + * will behave the same as rte_malloc(). + * @return + * - NULL on error. Not enough memory, or invalid arguments (size is 0, + * align is not a power of two). + * - Otherwise, the pointer to the allocated object. + */ +void * +rte_malloc_socket_contig(const char *type, size_t size, unsigned align, int socket); + +/** + * Allocate zero'ed memory from the heap. + * + * Equivalent to rte_malloc() except that the memory zone is + * initialised with zeros. + * + * @param type + * A string identifying the type of allocated objects (useful for debug + * purposes, such as identifying the cause of a memory leak). Can be NULL. + * @param size + * Size (in bytes) to be allocated. + * @param align + * If 0, the return is a pointer that is suitably aligned for any kind of + * variable (in the same manner as malloc()). + * Otherwise, the return is a pointer that is a multiple of *align*. In + * this case, it must obviously be a power of two. (Minimum alignment is the + * cacheline size, i.e. 64-bytes) + * @param socket + * NUMA socket to allocate memory on. If SOCKET_ID_ANY is used, this function + * will behave the same as rte_zmalloc(). + * @return + * - NULL on error. Not enough memory, or invalid arguments (size is 0, + * align is not a power of two). + * - Otherwise, the pointer to the allocated object. + */ +void * +rte_zmalloc_socket_contig(const char *type, size_t size, unsigned align, int socket); + +/** + * Replacement function for calloc(), using huge-page memory. Memory area is + * initialised with zeros. + * + * @param type + * A string identifying the type of allocated objects (useful for debug + * purposes, such as identifying the cause of a memory leak). Can be NULL. + * @param num + * Number of elements to be allocated. + * @param size + * Size (in bytes) of a single element. + * @param align + * If 0, the return is a pointer that is suitably aligned for any kind of + * variable (in the same manner as malloc()). + * Otherwise, the return is a pointer that is a multiple of *align*. In + * this case, it must obviously be a power of two. (Minimum alignment is the + * cacheline size, i.e. 64-bytes) + * @param socket + * NUMA socket to allocate memory on. If SOCKET_ID_ANY is used, this function + * will behave the same as rte_calloc(). + * @return + * - NULL on error. Not enough memory, or invalid arguments (size is 0, + * align is not a power of two). + * - Otherwise, the pointer to the allocated object. + */ +void * +rte_calloc_socket_contig(const char *type, size_t num, size_t size, unsigned align, int socket); + +/** * Frees the memory space pointed to by the provided pointer. * * This pointer must have been returned by a previous call to diff --git a/lib/librte_eal/common/rte_malloc.c b/lib/librte_eal/common/rte_malloc.c index 623725e..e8ad085 100644 --- a/lib/librte_eal/common/rte_malloc.c +++ b/lib/librte_eal/common/rte_malloc.c @@ -96,6 +96,15 @@ rte_malloc_socket(const char *type, size_t size, unsigned align, int socket_arg) } /* + * Allocate memory on specified heap. + */ +void * +rte_malloc_socket_contig(const char *type, size_t size, unsigned align, int socket_arg) +{ + return malloc_socket(type, size, align, socket_arg, true); +} + +/* * Allocate memory on default heap. */ void * @@ -105,6 +114,15 @@ rte_malloc(const char *type, size_t size, unsigned align) } /* + * Allocate memory on default heap. + */ +void * +rte_malloc_contig(const char *type, size_t size, unsigned align) +{ + return rte_malloc_socket_contig(type, size, align, SOCKET_ID_ANY); +} + +/* * Allocate zero'd memory on specified heap. */ void * @@ -114,6 +132,15 @@ rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket) } /* + * Allocate zero'd memory on specified heap. + */ +void * +rte_zmalloc_socket_contig(const char *type, size_t size, unsigned align, int socket) +{ + return rte_malloc_socket_contig(type, size, align, socket); +} + +/* * Allocate zero'd memory on default heap. */ void * @@ -123,6 +150,15 @@ rte_zmalloc(const char *type, size_t size, unsigned align) } /* + * Allocate zero'd memory on default heap. + */ +void * +rte_zmalloc_contig(const char *type, size_t size, unsigned align) +{ + return rte_zmalloc_socket_contig(type, size, align, SOCKET_ID_ANY); +} + +/* * Allocate zero'd memory on specified heap. */ void * @@ -132,6 +168,15 @@ rte_calloc_socket(const char *type, size_t num, size_t size, unsigned align, int } /* + * Allocate zero'd physically contiguous memory on specified heap. + */ +void * +rte_calloc_socket_contig(const char *type, size_t num, size_t size, unsigned align, int socket) +{ + return rte_zmalloc_socket_contig(type, num * size, align, socket); +} + +/* * Allocate zero'd memory on default heap. */ void * @@ -141,6 +186,15 @@ rte_calloc(const char *type, size_t num, size_t size, unsigned align) } /* + * Allocate zero'd physically contiguous memory on default heap. + */ +void * +rte_calloc_contig(const char *type, size_t num, size_t size, unsigned align) +{ + return rte_zmalloc_contig(type, num * size, align); +} + +/* * Resize allocated memory. */ static void * @@ -180,6 +234,15 @@ rte_realloc(void *ptr, size_t size, unsigned align) return do_realloc(ptr, size, align, false); } +/* + * Resize allocated physically contiguous memory. + */ +void * +rte_realloc_contig(void *ptr, size_t size, unsigned align) +{ + return do_realloc(ptr, size, align, true); +} + int rte_malloc_validate(const void *ptr, size_t *size) { From patchwork Tue Dec 19 11:14:46 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32471 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 154031B21C; Tue, 19 Dec 2017 12:15:30 +0100 (CET) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id D677E1B017 for ; Tue, 19 Dec 2017 12:14:56 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="185440372" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga005.jf.intel.com with ESMTP; 19 Dec 2017 03:14:53 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBErFd003144; Tue, 19 Dec 2017 11:14:53 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBErjP010316; Tue, 19 Dec 2017 11:14:53 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEr7G010312; Tue, 19 Dec 2017 11:14:53 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:46 +0000 Message-Id: <6b44fff0a623ad3343ae51b5aef3da9fb6e102fa.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 19/23] eal: enable reserving physically contiguous memzones X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" This adds a new set of _contig API's to rte_memzone. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memzone.c | 44 ++++++++ lib/librte_eal/common/include/rte_memzone.h | 158 ++++++++++++++++++++++++++++ 2 files changed, 202 insertions(+) diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c index 542ae90..a9a4bef 100644 --- a/lib/librte_eal/common/eal_common_memzone.c +++ b/lib/librte_eal/common/eal_common_memzone.c @@ -200,6 +200,12 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, socket_id = SOCKET_ID_ANY; if (len == 0) { + /* len == 0 is only allowed for non-contiguous zones */ + // TODO: technically, we can make it work, is it worth it? + if (contig) { + rte_errno = EINVAL; + return NULL; + } if (bound != 0) requested_len = bound; else { @@ -285,6 +291,19 @@ rte_memzone_reserve_bounded(const char *name, size_t len, int socket_id, /* * Return a pointer to a correctly filled memzone descriptor (with a + * specified alignment and boundary). If the allocation cannot be done, + * return NULL. + */ +const struct rte_memzone * +rte_memzone_reserve_bounded_contig(const char *name, size_t len, int socket_id, + unsigned flags, unsigned align, unsigned bound) +{ + return rte_memzone_reserve_thread_safe(name, len, socket_id, flags, + align, bound, true); +} + +/* + * Return a pointer to a correctly filled memzone descriptor (with a * specified alignment). If the allocation cannot be done, return NULL. */ const struct rte_memzone * @@ -296,6 +315,18 @@ rte_memzone_reserve_aligned(const char *name, size_t len, int socket_id, } /* + * Return a pointer to a correctly filled memzone descriptor (with a + * specified alignment). If the allocation cannot be done, return NULL. + */ +const struct rte_memzone * +rte_memzone_reserve_aligned_contig(const char *name, size_t len, int socket_id, + unsigned flags, unsigned align) +{ + return rte_memzone_reserve_thread_safe(name, len, socket_id, flags, + align, 0, true); +} + +/* * Return a pointer to a correctly filled memzone descriptor. If the * allocation cannot be done, return NULL. */ @@ -308,6 +339,19 @@ rte_memzone_reserve(const char *name, size_t len, int socket_id, false); } +/* + * Return a pointer to a correctly filled memzone descriptor. If the + * allocation cannot be done, return NULL. + */ +const struct rte_memzone * +rte_memzone_reserve_contig(const char *name, size_t len, int socket_id, + unsigned flags) +{ + return rte_memzone_reserve_thread_safe(name, len, socket_id, + flags, RTE_CACHE_LINE_SIZE, 0, + true); +} + int rte_memzone_free(const struct rte_memzone *mz) { diff --git a/lib/librte_eal/common/include/rte_memzone.h b/lib/librte_eal/common/include/rte_memzone.h index 6f0ba18..237fd31 100644 --- a/lib/librte_eal/common/include/rte_memzone.h +++ b/lib/librte_eal/common/include/rte_memzone.h @@ -257,6 +257,164 @@ const struct rte_memzone *rte_memzone_reserve_bounded(const char *name, unsigned flags, unsigned align, unsigned bound); /** + * Reserve a portion of physical memory. + * + * This function reserves some memory and returns a pointer to a + * correctly filled memzone descriptor. If the allocation cannot be + * done, return NULL. + * + * @param name + * The name of the memzone. If it already exists, the function will + * fail and return NULL. + * @param len + * The size of the memory to be reserved. If it + * is 0, the biggest contiguous zone will be reserved. + * @param socket_id + * The socket identifier in the case of + * NUMA. The value can be SOCKET_ID_ANY if there is no NUMA + * constraint for the reserved zone. + * @param flags + * The flags parameter is used to request memzones to be + * taken from specifically sized hugepages. + * - RTE_MEMZONE_2MB - Reserved from 2MB pages + * - RTE_MEMZONE_1GB - Reserved from 1GB pages + * - RTE_MEMZONE_16MB - Reserved from 16MB pages + * - RTE_MEMZONE_16GB - Reserved from 16GB pages + * - RTE_MEMZONE_256KB - Reserved from 256KB pages + * - RTE_MEMZONE_256MB - Reserved from 256MB pages + * - RTE_MEMZONE_512MB - Reserved from 512MB pages + * - RTE_MEMZONE_4GB - Reserved from 4GB pages + * - RTE_MEMZONE_SIZE_HINT_ONLY - Allow alternative page size to be used if + * the requested page size is unavailable. + * If this flag is not set, the function + * will return error on an unavailable size + * request. + * @return + * A pointer to a correctly-filled read-only memzone descriptor, or NULL + * on error. + * On error case, rte_errno will be set appropriately: + * - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure + * - E_RTE_SECONDARY - function was called from a secondary process instance + * - ENOSPC - the maximum number of memzones has already been allocated + * - EEXIST - a memzone with the same name already exists + * - ENOMEM - no appropriate memory area found in which to create memzone + * - EINVAL - invalid parameters + */ +const struct rte_memzone *rte_memzone_reserve_contig(const char *name, + size_t len, int socket_id, + unsigned flags); + +/** + * Reserve a portion of physical memory with alignment on a specified + * boundary. + * + * This function reserves some memory with alignment on a specified + * boundary, and returns a pointer to a correctly filled memzone + * descriptor. If the allocation cannot be done or if the alignment + * is not a power of 2, returns NULL. + * + * @param name + * The name of the memzone. If it already exists, the function will + * fail and return NULL. + * @param len + * The size of the memory to be reserved. If it + * is 0, the biggest contiguous zone will be reserved. + * @param socket_id + * The socket identifier in the case of + * NUMA. The value can be SOCKET_ID_ANY if there is no NUMA + * constraint for the reserved zone. + * @param flags + * The flags parameter is used to request memzones to be + * taken from specifically sized hugepages. + * - RTE_MEMZONE_2MB - Reserved from 2MB pages + * - RTE_MEMZONE_1GB - Reserved from 1GB pages + * - RTE_MEMZONE_16MB - Reserved from 16MB pages + * - RTE_MEMZONE_16GB - Reserved from 16GB pages + * - RTE_MEMZONE_256KB - Reserved from 256KB pages + * - RTE_MEMZONE_256MB - Reserved from 256MB pages + * - RTE_MEMZONE_512MB - Reserved from 512MB pages + * - RTE_MEMZONE_4GB - Reserved from 4GB pages + * - RTE_MEMZONE_SIZE_HINT_ONLY - Allow alternative page size to be used if + * the requested page size is unavailable. + * If this flag is not set, the function + * will return error on an unavailable size + * request. + * @param align + * Alignment for resulting memzone. Must be a power of 2. + * @return + * A pointer to a correctly-filled read-only memzone descriptor, or NULL + * on error. + * On error case, rte_errno will be set appropriately: + * - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure + * - E_RTE_SECONDARY - function was called from a secondary process instance + * - ENOSPC - the maximum number of memzones has already been allocated + * - EEXIST - a memzone with the same name already exists + * - ENOMEM - no appropriate memory area found in which to create memzone + * - EINVAL - invalid parameters + */ +const struct rte_memzone *rte_memzone_reserve_aligned_contig(const char *name, + size_t len, int socket_id, + unsigned flags, unsigned align); + +/** + * Reserve a portion of physical memory with specified alignment and + * boundary. + * + * This function reserves some memory with specified alignment and + * boundary, and returns a pointer to a correctly filled memzone + * descriptor. If the allocation cannot be done or if the alignment + * or boundary are not a power of 2, returns NULL. + * Memory buffer is reserved in a way, that it wouldn't cross specified + * boundary. That implies that requested length should be less or equal + * then boundary. + * + * @param name + * The name of the memzone. If it already exists, the function will + * fail and return NULL. + * @param len + * The size of the memory to be reserved. If it + * is 0, the biggest contiguous zone will be reserved. + * @param socket_id + * The socket identifier in the case of + * NUMA. The value can be SOCKET_ID_ANY if there is no NUMA + * constraint for the reserved zone. + * @param flags + * The flags parameter is used to request memzones to be + * taken from specifically sized hugepages. + * - RTE_MEMZONE_2MB - Reserved from 2MB pages + * - RTE_MEMZONE_1GB - Reserved from 1GB pages + * - RTE_MEMZONE_16MB - Reserved from 16MB pages + * - RTE_MEMZONE_16GB - Reserved from 16GB pages + * - RTE_MEMZONE_256KB - Reserved from 256KB pages + * - RTE_MEMZONE_256MB - Reserved from 256MB pages + * - RTE_MEMZONE_512MB - Reserved from 512MB pages + * - RTE_MEMZONE_4GB - Reserved from 4GB pages + * - RTE_MEMZONE_SIZE_HINT_ONLY - Allow alternative page size to be used if + * the requested page size is unavailable. + * If this flag is not set, the function + * will return error on an unavailable size + * request. + * @param align + * Alignment for resulting memzone. Must be a power of 2. + * @param bound + * Boundary for resulting memzone. Must be a power of 2 or zero. + * Zero value implies no boundary condition. + * @return + * A pointer to a correctly-filled read-only memzone descriptor, or NULL + * on error. + * On error case, rte_errno will be set appropriately: + * - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure + * - E_RTE_SECONDARY - function was called from a secondary process instance + * - ENOSPC - the maximum number of memzones has already been allocated + * - EEXIST - a memzone with the same name already exists + * - ENOMEM - no appropriate memory area found in which to create memzone + * - EINVAL - invalid parameters + */ +const struct rte_memzone *rte_memzone_reserve_bounded_contig(const char *name, + size_t len, int socket_id, + unsigned flags, unsigned align, unsigned bound); + +/** * Free a memzone. * * @param mz From patchwork Tue Dec 19 11:14:47 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32472 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 5F5431B208; Tue, 19 Dec 2017 12:15:31 +0100 (CET) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 18F171B01E for ; Tue, 19 Dec 2017 12:14:56 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="188049082" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga005.fm.intel.com with ESMTP; 19 Dec 2017 03:14:54 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBErag003147; Tue, 19 Dec 2017 11:14:53 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBErW3010327; Tue, 19 Dec 2017 11:14:53 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBErBC010319; Tue, 19 Dec 2017 11:14:53 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:47 +0000 Message-Id: <69a29e4ac2822d0c4b1f6c599b428977b2b25505.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 20/23] eal: make memzones use rte_fbarray X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" We greatly expand memzone list, and it makes some operations faster. Plus, it's there, so we might as well use it. As part of this commit, a potential memory leak is fixed (when we allocate a memzone but there's no room in config, we don't free it back), and there's a compile fix for ENA driver. Signed-off-by: Anatoly Burakov --- config/common_base | 2 +- drivers/net/ena/ena_ethdev.c | 10 +- lib/librte_eal/common/eal_common_memzone.c | 168 ++++++++++++++++------ lib/librte_eal/common/include/rte_eal_memconfig.h | 4 +- 4 files changed, 137 insertions(+), 47 deletions(-) diff --git a/config/common_base b/config/common_base index 9730d4c..cce464d 100644 --- a/config/common_base +++ b/config/common_base @@ -92,7 +92,7 @@ CONFIG_RTE_MAX_LCORE=128 CONFIG_RTE_MAX_NUMA_NODES=8 CONFIG_RTE_MAX_MEMSEG_LISTS=16 CONFIG_RTE_MAX_MEMSEG_PER_LIST=32768 -CONFIG_RTE_MAX_MEMZONE=2560 +CONFIG_RTE_MAX_MEMZONE=32768 CONFIG_RTE_MAX_TAILQ=32 CONFIG_RTE_ENABLE_ASSERT=n CONFIG_RTE_LOG_LEVEL=RTE_LOG_INFO diff --git a/drivers/net/ena/ena_ethdev.c b/drivers/net/ena/ena_ethdev.c index 22db895..aa37cad 100644 --- a/drivers/net/ena/ena_ethdev.c +++ b/drivers/net/ena/ena_ethdev.c @@ -249,11 +249,15 @@ static const struct eth_dev_ops ena_dev_ops = { static inline int ena_cpu_to_node(int cpu) { struct rte_config *config = rte_eal_get_configuration(); + const struct rte_fbarray *arr = &config->mem_config->memzones; + const struct rte_memzone *mz; - if (likely(cpu < RTE_MAX_MEMZONE)) - return config->mem_config->memzone[cpu].socket_id; + if (unlikely(cpu >= RTE_MAX_MEMZONE)) + return NUMA_NO_NODE; - return NUMA_NO_NODE; + mz = rte_fbarray_get(arr, cpu); + + return mz->socket_id; } static inline void ena_rx_mbuf_prepare(struct rte_mbuf *mbuf, diff --git a/lib/librte_eal/common/eal_common_memzone.c b/lib/librte_eal/common/eal_common_memzone.c index a9a4bef..58a4f25 100644 --- a/lib/librte_eal/common/eal_common_memzone.c +++ b/lib/librte_eal/common/eal_common_memzone.c @@ -58,20 +58,23 @@ static inline const struct rte_memzone * memzone_lookup_thread_unsafe(const char *name) { const struct rte_mem_config *mcfg; + const struct rte_fbarray *arr; const struct rte_memzone *mz; - unsigned i = 0; + int i = 0; /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; + arr = &mcfg->memzones; /* * the algorithm is not optimal (linear), but there are few * zones and this function should be called at init only */ - for (i = 0; i < RTE_MAX_MEMZONE; i++) { - mz = &mcfg->memzone[i]; - if (mz->addr != NULL && !strncmp(name, mz->name, RTE_MEMZONE_NAMESIZE)) - return &mcfg->memzone[i]; + while ((i = rte_fbarray_find_next_used(arr, i)) >= 0) { + mz = rte_fbarray_get(arr, i++); + if (mz->addr != NULL && + !strncmp(name, mz->name, RTE_MEMZONE_NAMESIZE)) + return mz; } return NULL; @@ -81,17 +84,44 @@ static inline struct rte_memzone * get_next_free_memzone(void) { struct rte_mem_config *mcfg; - unsigned i = 0; + struct rte_fbarray *arr; + int i = 0; /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; + arr = &mcfg->memzones; + + i = rte_fbarray_find_next_free(arr, 0); + if (i < 0) { + /* no space in config, so try expanding the list */ + int old_len = arr->len; + int new_len = old_len * 2; + new_len = RTE_MIN(new_len, arr->capacity); + + if (old_len == new_len) { + /* can't expand, the list is full */ + RTE_LOG(ERR, EAL, "%s(): no space in memzone list\n", + __func__); + return NULL; + } - for (i = 0; i < RTE_MAX_MEMZONE; i++) { - if (mcfg->memzone[i].addr == NULL) - return &mcfg->memzone[i]; - } + if (rte_fbarray_resize(arr, new_len)) { + RTE_LOG(ERR, EAL, "%s(): can't resize memzone list\n", + __func__); + return NULL; + } - return NULL; + /* ensure we have free space */ + i = rte_fbarray_find_next_free(arr, old_len); + + if (i < 0) { + RTE_LOG(ERR, EAL, "%s(): Cannot find room in config!\n", + __func__); + return NULL; + } + } + rte_fbarray_set_used(arr, i, true); + return rte_fbarray_get(arr, i); } /* This function will return the greatest free block if a heap has been @@ -132,14 +162,16 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, { struct rte_memzone *mz; struct rte_mem_config *mcfg; + struct rte_fbarray *arr; size_t requested_len; int socket; /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; + arr = &mcfg->memzones; /* no more room in config */ - if (mcfg->memzone_cnt >= RTE_MAX_MEMZONE) { + if (arr->count >= arr->capacity) { RTE_LOG(ERR, EAL, "%s(): No more room in config\n", __func__); rte_errno = ENOSPC; return NULL; @@ -231,19 +263,19 @@ memzone_reserve_aligned_thread_unsafe(const char *name, size_t len, return NULL; } - const struct malloc_elem *elem = malloc_elem_from_data(mz_addr); + struct malloc_elem *elem = malloc_elem_from_data(mz_addr); /* fill the zone in config */ mz = get_next_free_memzone(); if (mz == NULL) { - RTE_LOG(ERR, EAL, "%s(): Cannot find free memzone but there is room " - "in config!\n", __func__); + RTE_LOG(ERR, EAL, "%s(): Cannot find free memzone but there is room in config!\n", + __func__); rte_errno = ENOSPC; + malloc_heap_free(elem); return NULL; } - mcfg->memzone_cnt++; snprintf(mz->name, sizeof(mz->name), "%s", name); mz->iova = rte_malloc_virt2iova(mz_addr); mz->addr = mz_addr; @@ -356,6 +388,8 @@ int rte_memzone_free(const struct rte_memzone *mz) { struct rte_mem_config *mcfg; + struct rte_fbarray *arr; + struct rte_memzone *found_mz; int ret = 0; void *addr; unsigned idx; @@ -364,21 +398,22 @@ rte_memzone_free(const struct rte_memzone *mz) return -EINVAL; mcfg = rte_eal_get_configuration()->mem_config; + arr = &mcfg->memzones; rte_rwlock_write_lock(&mcfg->mlock); - idx = ((uintptr_t)mz - (uintptr_t)mcfg->memzone); - idx = idx / sizeof(struct rte_memzone); + idx = rte_fbarray_find_idx(arr, mz); + found_mz = rte_fbarray_get(arr, idx); - addr = mcfg->memzone[idx].addr; + addr = found_mz->addr; if (addr == NULL) ret = -EINVAL; - else if (mcfg->memzone_cnt == 0) { + else if (arr->count == 0) { rte_panic("%s(): memzone address not NULL but memzone_cnt is 0!\n", __func__); } else { - memset(&mcfg->memzone[idx], 0, sizeof(mcfg->memzone[idx])); - mcfg->memzone_cnt--; + memset(found_mz, 0, sizeof(*found_mz)); + rte_fbarray_set_used(arr, idx, false); } rte_rwlock_write_unlock(&mcfg->mlock); @@ -412,25 +447,71 @@ rte_memzone_lookup(const char *name) void rte_memzone_dump(FILE *f) { + struct rte_fbarray *arr; struct rte_mem_config *mcfg; - unsigned i = 0; + int i = 0; /* get pointer to global configuration */ mcfg = rte_eal_get_configuration()->mem_config; + arr = &mcfg->memzones; rte_rwlock_read_lock(&mcfg->mlock); /* dump all zones */ - for (i=0; imemzone[i].addr == NULL) - break; - fprintf(f, "Zone %u: name:<%s>, IO:0x%"PRIx64", len:0x%zx" + while ((i = rte_fbarray_find_next_used(arr, i)) >= 0) { + void *cur_addr, *mz_end; + struct rte_memzone *mz; + struct rte_memseg_list *msl = NULL; + struct rte_memseg *ms; + int msl_idx, ms_idx; + + mz = rte_fbarray_get(arr, i); + + /* + * memzones can span multiple physical pages, so dump addresses + * of all physical pages this memzone spans. + */ + + fprintf(f, "Zone %u: name:<%s>, len:0x%zx" ", virt:%p, socket_id:%"PRId32", flags:%"PRIx32"\n", i, - mcfg->memzone[i].name, - mcfg->memzone[i].iova, - mcfg->memzone[i].len, - mcfg->memzone[i].addr, - mcfg->memzone[i].socket_id, - mcfg->memzone[i].flags); + mz->name, + mz->len, + mz->addr, + mz->socket_id, + mz->flags); + + /* get pointer to appropriate memseg list */ + for (msl_idx = 0; msl_idx < RTE_MAX_MEMSEG_LISTS; msl_idx++) { + if (mcfg->memsegs[msl_idx].hugepage_sz != mz->hugepage_sz) + continue; + if (mcfg->memsegs[msl_idx].socket_id != mz->socket_id) + continue; + msl = &mcfg->memsegs[msl_idx]; + break; + } + if (!msl) { + RTE_LOG(DEBUG, EAL, "Skipping bad memzone\n"); + continue; + } + + cur_addr = RTE_PTR_ALIGN_FLOOR(mz->addr, mz->hugepage_sz); + mz_end = RTE_PTR_ADD(cur_addr, mz->len); + + ms_idx = RTE_PTR_DIFF(mz->addr, msl->base_va) / + msl->hugepage_sz; + ms = rte_fbarray_get(&msl->memseg_arr, ms_idx); + + fprintf(f, "physical pages used:\n"); + do { + fprintf(f, " addr: %p iova: 0x%" PRIx64 " len: 0x%" PRIx64 " len: 0x%" PRIx64 "\n", + cur_addr, ms->iova, ms->len, ms->hugepage_sz); + + /* advance VA to next page */ + cur_addr = RTE_PTR_ADD(cur_addr, ms->hugepage_sz); + + /* memzones occupy contiguous segments */ + ++ms; + } while (cur_addr < mz_end); + i++; } rte_rwlock_read_unlock(&mcfg->mlock); } @@ -459,9 +540,11 @@ rte_eal_memzone_init(void) rte_rwlock_write_lock(&mcfg->mlock); - /* delete all zones */ - mcfg->memzone_cnt = 0; - memset(mcfg->memzone, 0, sizeof(mcfg->memzone)); + if (rte_fbarray_alloc(&mcfg->memzones, "memzone", 256, + RTE_MAX_MEMZONE, sizeof(struct rte_memzone))) { + RTE_LOG(ERR, EAL, "Cannot allocate memzone list\n"); + return -1; + } rte_rwlock_write_unlock(&mcfg->mlock); @@ -473,14 +556,19 @@ void rte_memzone_walk(void (*func)(const struct rte_memzone *, void *), void *arg) { struct rte_mem_config *mcfg; - unsigned i; + struct rte_fbarray *arr; + int i; mcfg = rte_eal_get_configuration()->mem_config; + arr = &mcfg->memzones; + + i = 0; rte_rwlock_read_lock(&mcfg->mlock); - for (i=0; imemzone[i].addr != NULL) - (*func)(&mcfg->memzone[i], arg); + while ((i = rte_fbarray_find_next_used(arr, i)) > 0) { + struct rte_memzone *mz = rte_fbarray_get(arr, i); + (*func)(mz, arg); + i++; } rte_rwlock_read_unlock(&mcfg->mlock); } diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h b/lib/librte_eal/common/include/rte_eal_memconfig.h index c9b57a4..8f4cc34 100644 --- a/lib/librte_eal/common/include/rte_eal_memconfig.h +++ b/lib/librte_eal/common/include/rte_eal_memconfig.h @@ -86,10 +86,8 @@ struct rte_mem_config { rte_rwlock_t qlock; /**< used for tailq operation for thread safe. */ rte_rwlock_t mplock; /**< only used by mempool LIB for thread-safe. */ - uint32_t memzone_cnt; /**< Number of allocated memzones */ - /* memory segments and zones */ - struct rte_memzone memzone[RTE_MAX_MEMZONE]; /**< Memzone descriptors. */ + struct rte_fbarray memzones; /**< Memzone descriptors. */ struct rte_memseg_list memsegs[RTE_MAX_MEMSEG_LISTS]; /**< list of dynamic arrays holding memsegs */ From patchwork Tue Dec 19 11:14:48 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32466 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 2B1691B1B0; Tue, 19 Dec 2017 12:15:21 +0100 (CET) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 9028A1B019 for ; Tue, 19 Dec 2017 12:14:56 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="3023086" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga002.fm.intel.com with ESMTP; 19 Dec 2017 03:14:54 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBErgw003151; Tue, 19 Dec 2017 11:14:53 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBErAp010334; Tue, 19 Dec 2017 11:14:53 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBErkJ010330; Tue, 19 Dec 2017 11:14:53 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:48 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 21/23] mempool: add support for the new memory allocation methods X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" If a user has specified that the zone should have contiguous memory, use the new _contig allocation API's instead of normal ones. Otherwise, account for the fact that unless we're in IOVA_AS_VA mode, we cannot guarantee that the pages would be physically contiguous, so we calculate the memzone size and alignments as if we were getting the smallest page size available. Signed-off-by: Anatoly Burakov --- lib/librte_mempool/rte_mempool.c | 84 +++++++++++++++++++++++++++++++++++----- 1 file changed, 75 insertions(+), 9 deletions(-) diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c index d50dba4..4b9ab22 100644 --- a/lib/librte_mempool/rte_mempool.c +++ b/lib/librte_mempool/rte_mempool.c @@ -127,6 +127,26 @@ static unsigned optimize_object_size(unsigned obj_size) return new_obj_size * RTE_MEMPOOL_ALIGN; } +static size_t +get_min_page_size(void) { + const struct rte_mem_config *mcfg = + rte_eal_get_configuration()->mem_config; + int i; + size_t min_pagesz = SIZE_MAX; + + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + const struct rte_memseg_list *msl = &mcfg->memsegs[i]; + + if (msl->base_va == NULL) + continue; + + if (msl->hugepage_sz < min_pagesz) + min_pagesz = msl->hugepage_sz; + } + + return min_pagesz == SIZE_MAX ? (size_t) getpagesize() : min_pagesz; +} + static void mempool_add_elem(struct rte_mempool *mp, void *obj, rte_iova_t iova) { @@ -568,6 +588,7 @@ rte_mempool_populate_default(struct rte_mempool *mp) unsigned mz_id, n; unsigned int mp_flags; int ret; + bool force_contig, no_contig; /* mempool must not be populated */ if (mp->nb_mem_chunks != 0) @@ -582,10 +603,46 @@ rte_mempool_populate_default(struct rte_mempool *mp) /* update mempool capabilities */ mp->flags |= mp_flags; - if (rte_eal_has_hugepages()) { - pg_shift = 0; /* not needed, zone is physically contiguous */ + no_contig = mp->flags & MEMPOOL_F_NO_PHYS_CONTIG; + force_contig = mp->flags & MEMPOOL_F_CAPA_PHYS_CONTIG; + + /* + * there are several considerations for page size and page shift here. + * + * if we don't need our mempools to have physically contiguous objects, + * then just set page shift and page size to 0, because the user has + * indicated that there's no need to care about anything. + * + * if we do need contiguous objects, there is also an option to reserve + * the entire mempool memory as one contiguous block of memory, in + * which case the page shift and alignment wouldn't matter as well. + * + * if we require contiguous objects, but not necessarily the entire + * mempool reserved space to be contiguous, then there are two options. + * + * if our IO addresses are virtual, not actual physical (IOVA as VA + * case), then no page shift needed - our memory allocation will give us + * contiguous physical memory as far as the hardware is concerned, so + * act as if we're getting contiguous memory. + * + * if our IO addresses are physical, we may get memory from bigger + * pages, or we might get memory from smaller pages, and how much of it + * we require depends on whether we want bigger or smaller pages. + * However, requesting each and every memory size is too much work, so + * what we'll do instead is walk through the page sizes available, pick + * the smallest one and set up page shift to match that one. We will be + * wasting some space this way, but it's much nicer than looping around + * trying to reserve each and every page size. + */ + + if (no_contig || force_contig || rte_eal_iova_mode() == RTE_IOVA_VA) { pg_sz = 0; + pg_shift = 0; align = RTE_CACHE_LINE_SIZE; + } else if (rte_eal_has_hugepages()) { + pg_sz = get_min_page_size(); + pg_shift = rte_bsf32(pg_sz); + align = pg_sz; } else { pg_sz = getpagesize(); pg_shift = rte_bsf32(pg_sz); @@ -604,23 +661,32 @@ rte_mempool_populate_default(struct rte_mempool *mp) goto fail; } - mz = rte_memzone_reserve_aligned(mz_name, size, - mp->socket_id, mz_flags, align); - /* not enough memory, retry with the biggest zone we have */ - if (mz == NULL) - mz = rte_memzone_reserve_aligned(mz_name, 0, + if (force_contig) { + /* + * if contiguous memory for entire mempool memory was + * requested, don't try reserving again if we fail. + */ + mz = rte_memzone_reserve_aligned_contig(mz_name, size, + mp->socket_id, mz_flags, align); + } else { + mz = rte_memzone_reserve_aligned(mz_name, size, mp->socket_id, mz_flags, align); + /* not enough memory, retry with the biggest zone we have */ + if (mz == NULL) + mz = rte_memzone_reserve_aligned(mz_name, 0, + mp->socket_id, mz_flags, align); + } if (mz == NULL) { ret = -rte_errno; goto fail; } - if (mp->flags & MEMPOOL_F_NO_PHYS_CONTIG) + if (no_contig) iova = RTE_BAD_IOVA; else iova = mz->iova; - if (rte_eal_has_hugepages()) + if (rte_eal_has_hugepages() && force_contig) ret = rte_mempool_populate_iova(mp, mz->addr, iova, mz->len, rte_mempool_memchunk_mz_free, From patchwork Tue Dec 19 11:14:49 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32473 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id BDDA81B22B; Tue, 19 Dec 2017 12:15:32 +0100 (CET) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 2FEFA1B01F for ; Tue, 19 Dec 2017 12:14:57 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:56 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="160064549" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga004.jf.intel.com with ESMTP; 19 Dec 2017 03:14:54 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBErtE003154; Tue, 19 Dec 2017 11:14:53 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBErVr010341; Tue, 19 Dec 2017 11:14:53 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEriT010337; Tue, 19 Dec 2017 11:14:53 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net, Pawel Wodkowski Date: Tue, 19 Dec 2017 11:14:49 +0000 Message-Id: <2d5b9ea71a32658efa24766933d07b5d2e29f25f.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 22/23] vfio: allow to map other memory regions X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Currently it is not possible to use memory that is not owned by DPDK to perform DMA. This scenarion might be used in vhost applications (like SPDK) where guest send its own memory table. To fill this gap provide API to allow registering arbitrary address in VFIO container. Signed-off-by: Pawel Wodkowski Signed-off-by: Anatoly Burakov --- lib/librte_eal/linuxapp/eal/eal_vfio.c | 150 ++++++++++++++++++++++++++++----- lib/librte_eal/linuxapp/eal/eal_vfio.h | 11 +++ 2 files changed, 140 insertions(+), 21 deletions(-) diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c index 09dfc68..15d28ad 100644 --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c @@ -40,6 +40,7 @@ #include #include #include +#include #include "eal_filesystem.h" #include "eal_vfio.h" @@ -51,17 +52,35 @@ static struct vfio_config vfio_cfg; static int vfio_type1_dma_map(int); +static int vfio_type1_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int); static int vfio_spapr_dma_map(int); static int vfio_noiommu_dma_map(int); +static int vfio_noiommu_dma_mem_map(int, uint64_t, uint64_t, uint64_t, int); /* IOMMU types we support */ static const struct vfio_iommu_type iommu_types[] = { /* x86 IOMMU, otherwise known as type 1 */ - { RTE_VFIO_TYPE1, "Type 1", &vfio_type1_dma_map}, + { + .type_id = RTE_VFIO_TYPE1, + .name = "Type 1", + .dma_map_func = &vfio_type1_dma_map, + .dma_user_map_func = &vfio_type1_dma_mem_map + }, /* ppc64 IOMMU, otherwise known as spapr */ - { RTE_VFIO_SPAPR, "sPAPR", &vfio_spapr_dma_map}, + { + .type_id = RTE_VFIO_SPAPR, + .name = "sPAPR", + .dma_map_func = &vfio_spapr_dma_map, + .dma_user_map_func = NULL + // TODO: work with PPC64 people on enabling this, window size! + }, /* IOMMU-less mode */ - { RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map}, + { + .type_id = RTE_VFIO_NOIOMMU, + .name = "No-IOMMU", + .dma_map_func = &vfio_noiommu_dma_map, + .dma_user_map_func = &vfio_noiommu_dma_mem_map + }, }; int @@ -362,9 +381,10 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr, */ if (internal_config.process_type == RTE_PROC_PRIMARY && vfio_cfg.vfio_active_groups == 1) { + const struct vfio_iommu_type *t; + /* select an IOMMU type which we will be using */ - const struct vfio_iommu_type *t = - vfio_set_iommu_type(vfio_cfg.vfio_container_fd); + t = vfio_set_iommu_type(vfio_cfg.vfio_container_fd); if (!t) { RTE_LOG(ERR, EAL, " %s failed to select IOMMU type\n", @@ -382,6 +402,8 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr, clear_group(vfio_group_fd); return -1; } + + vfio_cfg.vfio_iommu_type = t; } } @@ -694,13 +716,52 @@ vfio_get_group_no(const char *sysfs_base, } static int +vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova, + uint64_t len, int do_map) +{ + struct vfio_iommu_type1_dma_map dma_map; + struct vfio_iommu_type1_dma_unmap dma_unmap; + int ret; + + if (do_map != 0) { + memset(&dma_map, 0, sizeof(dma_map)); + dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); + dma_map.vaddr = vaddr; + dma_map.size = len; + dma_map.iova = iova; + dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; + + ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map); + if (ret) { + RTE_LOG(ERR, EAL, " cannot set up DMA remapping, error %i (%s)\n", + errno, strerror(errno)); + return -1; + } + + } else { + memset(&dma_unmap, 0, sizeof(dma_unmap)); + dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap); + dma_unmap.size = len; + dma_unmap.iova = iova; + + ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap); + if (ret) { + RTE_LOG(ERR, EAL, " cannot clear DMA remapping, error %i (%s)\n", + errno, strerror(errno)); + return -1; + } + } + + return 0; +} + +static int vfio_type1_dma_map(int vfio_container_fd) { - int i, ret; + int i; /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { - struct vfio_iommu_type1_dma_map dma_map; const struct rte_memseg_list *msl; const struct rte_fbarray *arr; int ms_idx, next_idx; @@ -727,21 +788,9 @@ vfio_type1_dma_map(int vfio_container_fd) len = ms->hugepage_sz; hw_addr = ms->iova; - memset(&dma_map, 0, sizeof(dma_map)); - dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); - dma_map.vaddr = addr; - dma_map.size = len; - dma_map.iova = hw_addr; - dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; - - ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map); - - if (ret) { - RTE_LOG(ERR, EAL, " cannot set up DMA remapping, " - "error %i (%s)\n", errno, - strerror(errno)); + if (vfio_type1_dma_mem_map(vfio_container_fd, addr, + hw_addr, len, 1)) return -1; - } } } @@ -892,6 +941,49 @@ vfio_noiommu_dma_map(int __rte_unused vfio_container_fd) return 0; } +static int +vfio_noiommu_dma_mem_map(int __rte_unused vfio_container_fd, + uint64_t __rte_unused vaddr, + uint64_t __rte_unused iova, uint64_t __rte_unused len, + int __rte_unused do_map) +{ + /* No-IOMMU mode does not need DMA mapping */ + return 0; +} + +static int +vfio_dma_mem_map(uint64_t vaddr, uint64_t iova, uint64_t len, int do_map) +{ + const struct vfio_iommu_type *t = vfio_cfg.vfio_iommu_type; + + if (!t) { + RTE_LOG(ERR, EAL, " VFIO support not initialized\n"); + return -1; + } + + if (!t->dma_user_map_func) { + RTE_LOG(ERR, EAL, + " VFIO custom DMA region maping not supported by IOMMU %s\n", + t->name); + return -1; + } + + return t->dma_user_map_func(vfio_cfg.vfio_container_fd, vaddr, iova, + len, do_map); +} + +int +rte_iommu_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len) +{ + return vfio_dma_mem_map(vaddr, iova, len, 1); +} + +int +rte_iommu_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len) +{ + return vfio_dma_mem_map(vaddr, iova, len, 0); +} + int rte_vfio_noiommu_is_enabled(void) { @@ -911,4 +1003,20 @@ rte_vfio_noiommu_is_enabled(void) return ret; } +#else + +int +rte_iommu_dma_map(uint64_t __rte_unused vaddr, __rte_unused uint64_t iova, + __rte_unused uint64_t len) +{ + return 0; +} + +int +rte_iommu_dma_unmap(uint64_t __rte_unused vaddr, uint64_t __rte_unused iova, + __rte_unused uint64_t len) +{ + return 0; +} + #endif diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h index ba7892b..bb669f0 100644 --- a/lib/librte_eal/linuxapp/eal/eal_vfio.h +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h @@ -48,6 +48,7 @@ #ifdef VFIO_PRESENT +#include #include #define RTE_VFIO_TYPE1 VFIO_TYPE1_IOMMU @@ -139,6 +140,7 @@ struct vfio_config { int vfio_enabled; int vfio_container_fd; int vfio_active_groups; + const struct vfio_iommu_type *vfio_iommu_type; struct vfio_group vfio_groups[VFIO_MAX_GROUPS]; }; @@ -148,9 +150,18 @@ struct vfio_config { * */ typedef int (*vfio_dma_func_t)(int); +/* Custom memory region DMA mapping function prototype. + * Takes VFIO container fd, virtual address, phisical address, length and + * operation type (0 to unmap 1 for map) as a parameters. + * Returns 0 on success, -1 on error. + **/ +typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova, + uint64_t len, int do_map); + struct vfio_iommu_type { int type_id; const char *name; + vfio_dma_user_func_t dma_user_map_func; vfio_dma_func_t dma_map_func; }; From patchwork Tue Dec 19 11:14:50 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 32474 Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 4D9721B228; Tue, 19 Dec 2017 12:15:34 +0100 (CET) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 77E221D7 for ; Tue, 19 Dec 2017 12:14:57 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:56 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="4054829" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga006.jf.intel.com with ESMTP; 19 Dec 2017 03:14:54 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEr7c003157; Tue, 19 Dec 2017 11:14:53 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBErJY010348; Tue, 19 Dec 2017 11:14:53 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBErCp010344; Tue, 19 Dec 2017 11:14:53 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:50 +0000 Message-Id: <4d6dbd867adc051a3cf444d365c2aae604821f77.1513681966.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC v2 23/23] eal: map/unmap memory with VFIO when alloc/free pages X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Signed-off-by: Anatoly Burakov --- lib/librte_eal/linuxapp/eal/eal_memalloc.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index 13172a0..8b3f219 100755 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -61,6 +61,7 @@ #include #include #include +#include #include "eal_filesystem.h" #include "eal_internal_cfg.h" @@ -259,6 +260,11 @@ alloc_page(struct rte_memseg *ms, void *addr, uint64_t size, int socket_id, ms->iova = iova; ms->socket_id = socket_id; + /* map the segment so that VFIO has access to it */ + if (rte_iommu_dma_map(ms->addr_64, iova, size)) { + RTE_LOG(DEBUG, EAL, "Cannot register segment with VFIO\n"); + } + goto out; mapped: @@ -295,6 +301,11 @@ free_page(struct rte_memseg *ms, struct hugepage_info *hi, unsigned list_idx, list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); } + /* unmap the segment from VFIO */ + if (rte_iommu_dma_unmap(ms->addr_64, ms->iova, ms->len)) { + RTE_LOG(DEBUG, EAL, "Cannot unregister segment with VFIO\n"); + } + munmap(ms->addr, ms->hugepage_sz); // TODO: race condition?