On 13-Dec-18 9:55 PM, Jim Harris wrote:
> SPDK uses the rte_mem_event_callback_register API to
> create RDMA memory regions (MRs) for newly allocated regions
> of memory. This is used in both the SPDK NVMe-oF target
> and the NVMe-oF host driver.
Hi Jim,
Here and in other places - it appears that you're adding double spaces
after end of sentence. We generally don't do that in DPDK.
>
> DPDK creates internal malloc_elem structures for these
> allocated regions. As users malloc and free memory, DPDK
> will sometimes merge malloc_elems that originated from
> different allocations that were notified through the
> registered mem_event callback routine. This results
> in subsequent allocations that can span across multiple
> RDMA MRs. This requires SPDK to check each DPDK buffer to
> see if it crosses an MR boundary, and if so, would have to
> add considerable logic and complexity to describe that
> buffer before it can be accessed by the RNIC. It is somewhat
> analagous to rte_malloc returning a buffer that is not
> IOVA-contiguous.
>
> As a malloc_elem gets split and some of these elements
> get freed, it can also result in DPDK sending an
> RTE_MEM_EVENT_FREE notification for a subset of the
> original RTE_MEM_EVENT_ALLOC notification. This is also
> problematic for RDMA memory regions, since unregistering
> the memory region is all-or-nothing. It is not possible
> to unregister part of a memory region.
>
> To support these types of applications, this patch adds
> a new --match-allocations EAL init flag. When this
> flag is specified, malloc elements from different
> hugepage allocations will never be merged. Memory will
> also only be freed back to the system (with the requisite
> memory event callback) exactly as it was originally
> allocated.
>
> Since part of this patch is extending the size of struct
> malloc_elem, we also fix up the malloc autotests so they
> do not assume its size exactly fits in one cacheline.
>
> Signed-off-by: Jim Harris <james.r.harris@intel.com>
> ---
> doc/guides/prog_guide/env_abstraction_layer.rst | 13 +++++++++++++
> lib/librte_eal/common/eal_common_options.c | 6 ++++++
> lib/librte_eal/common/eal_internal_cfg.h | 1 +
> lib/librte_eal/common/eal_options.h | 2 ++
> lib/librte_eal/common/malloc_elem.c | 16 ++++++++++++----
> lib/librte_eal/common/malloc_elem.h | 6 +++++-
> lib/librte_eal/common/malloc_heap.c | 13 ++++++++++++-
> lib/librte_eal/linuxapp/eal/eal.c | 5 +++++
> test/test/test_malloc.c | 25 +++++++++++++++++++------
> 9 files changed, 75 insertions(+), 12 deletions(-)
>
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
> index 8b5d050c7..c4032161f 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -169,6 +169,19 @@ not allow acquiring or releasing hugepages from the system at runtime.
> If neither ``-m`` nor ``--socket-mem`` were specified, the entire available
> hugepage memory will be preallocated.
>
> ++ Hugepage allocation matching
> +
> +This behavior is enabled by specifying the ``--match-allocations`` command-line
> +switch to the EAL. This switch is mutually exclusive with ``--legacy-mem``.
I wouldn't say it's "mutually exclusive" as much as it's simply
unsupported because there's no dynamic memory allocation in legacy mode.
(it's also unsupported with --no-huge as well). It is also a Linux-only
flag.
Please also update doc/guides/linux_gsg/linux_eal_parameters.rst to have
this parameter. It would also be a good idea to add a note to
doc/guides/rel_notes/release_19.02.rst.
> +
> +Some applications using memory event callbacks may require that hugepages be
> +freed exactly as they were allocated. These applications may also require
> +that any allocation from the malloc heap not span across allocations
> +associated with two different memory event callbacks. Hugepage allocation
> +matching can be used by these types of applications to satisfy both of these
> +requirements. This can result in some increased memory usage which is
> +very dependent on the memory allocation patterns of the application.
> +
> + 32-bit support
>
> Additional restrictions are present when running in 32-bit mode. In dynamic
> diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c
> index e31eca5c0..034447929 100644
> --- a/lib/librte_eal/common/eal_common_options.c
> +++ b/lib/librte_eal/common/eal_common_options.c
> @@ -79,6 +79,7 @@ eal_long_options[] = {
> {OPT_VMWARE_TSC_MAP, 0, NULL, OPT_VMWARE_TSC_MAP_NUM },
> {OPT_LEGACY_MEM, 0, NULL, OPT_LEGACY_MEM_NUM },
> {OPT_SINGLE_FILE_SEGMENTS, 0, NULL, OPT_SINGLE_FILE_SEGMENTS_NUM},
> + {OPT_MATCH_ALLOCATIONS, 0, NULL, OPT_MATCH_ALLOCATIONS_NUM},
> {0, 0, NULL, 0 }
> };
>
> @@ -1424,6 +1425,11 @@ eal_check_common_options(struct internal_config *internal_cfg)
> "with --"OPT_IN_MEMORY"\n");
> return -1;
> }
> + if (internal_cfg->legacy_mem && internal_cfg->match_allocations) {
> + RTE_LOG(ERR, EAL, "Option --"OPT_LEGACY_MEM" is not compatible "
> + "with --"OPT_MATCH_ALLOCATIONS"\n");
> + return -1;
> + }
Probably should add a check for --no-huge as well.
Everything else looks OK to me, so, provided the above comments are
addressed,
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
@@ -169,6 +169,19 @@ not allow acquiring or releasing hugepages from the system at runtime.
If neither ``-m`` nor ``--socket-mem`` were specified, the entire available
hugepage memory will be preallocated.
++ Hugepage allocation matching
+
+This behavior is enabled by specifying the ``--match-allocations`` command-line
+switch to the EAL. This switch is mutually exclusive with ``--legacy-mem``.
+
+Some applications using memory event callbacks may require that hugepages be
+freed exactly as they were allocated. These applications may also require
+that any allocation from the malloc heap not span across allocations
+associated with two different memory event callbacks. Hugepage allocation
+matching can be used by these types of applications to satisfy both of these
+requirements. This can result in some increased memory usage which is
+very dependent on the memory allocation patterns of the application.
+
+ 32-bit support
Additional restrictions are present when running in 32-bit mode. In dynamic
@@ -79,6 +79,7 @@ eal_long_options[] = {
{OPT_VMWARE_TSC_MAP, 0, NULL, OPT_VMWARE_TSC_MAP_NUM },
{OPT_LEGACY_MEM, 0, NULL, OPT_LEGACY_MEM_NUM },
{OPT_SINGLE_FILE_SEGMENTS, 0, NULL, OPT_SINGLE_FILE_SEGMENTS_NUM},
+ {OPT_MATCH_ALLOCATIONS, 0, NULL, OPT_MATCH_ALLOCATIONS_NUM},
{0, 0, NULL, 0 }
};
@@ -1424,6 +1425,11 @@ eal_check_common_options(struct internal_config *internal_cfg)
"with --"OPT_IN_MEMORY"\n");
return -1;
}
+ if (internal_cfg->legacy_mem && internal_cfg->match_allocations) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_LEGACY_MEM" is not compatible "
+ "with --"OPT_MATCH_ALLOCATIONS"\n");
+ return -1;
+ }
return 0;
}
@@ -54,6 +54,7 @@ struct internal_config {
volatile uint64_t socket_limit[RTE_MAX_NUMA_NODES]; /**< limit amount of memory per socket */
uintptr_t base_virtaddr; /**< base address to try and reserve memory from */
volatile unsigned legacy_mem;
+ volatile unsigned match_allocations; /**< true to free hugepages exactly as allocated */
/**< true to enable legacy memory behavior (no dynamic allocation,
* IOVA-contiguous segments).
*/
@@ -65,6 +65,8 @@ enum {
OPT_SINGLE_FILE_SEGMENTS_NUM,
#define OPT_IOVA_MODE "iova-mode"
OPT_IOVA_MODE_NUM,
+#define OPT_MATCH_ALLOCATIONS "match-allocations"
+ OPT_MATCH_ALLOCATIONS_NUM,
OPT_LONG_MAX_NUM
};
@@ -110,7 +110,8 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
*/
void
malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
- struct rte_memseg_list *msl, size_t size)
+ struct rte_memseg_list *msl, size_t size,
+ struct malloc_elem *orig_elem, size_t orig_size)
{
elem->heap = heap;
elem->msl = msl;
@@ -120,6 +121,8 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
elem->state = ELEM_FREE;
elem->size = size;
elem->pad = 0;
+ elem->orig_elem = orig_elem;
+ elem->orig_size = orig_size;
set_header(elem);
set_trailer(elem);
}
@@ -278,7 +281,8 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem;
const size_t new_elem_size = elem->size - old_elem_size;
- malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size);
+ malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
+ elem->orig_elem, elem->orig_size);
split_pt->prev = elem;
split_pt->next = next_elem;
if (next_elem)
@@ -317,14 +321,18 @@ static int
next_elem_is_adjacent(struct malloc_elem *elem)
{
return elem->next == RTE_PTR_ADD(elem, elem->size) &&
- elem->next->msl == elem->msl;
+ elem->next->msl == elem->msl &&
+ (!internal_config.match_allocations ||
+ elem->orig_elem == elem->next->orig_elem);
}
static int
prev_elem_is_adjacent(struct malloc_elem *elem)
{
return elem == RTE_PTR_ADD(elem->prev, elem->prev->size) &&
- elem->prev->msl == elem->msl;
+ elem->prev->msl == elem->msl &&
+ (!internal_config.match_allocations ||
+ elem->orig_elem == elem->prev->orig_elem);
}
/*
@@ -32,6 +32,8 @@ struct malloc_elem {
volatile enum elem_state state;
uint32_t pad;
size_t size;
+ struct malloc_elem *orig_elem;
+ size_t orig_size;
#ifdef RTE_MALLOC_DEBUG
uint64_t header_cookie; /* Cookie marking start of data */
/* trailer cookie at start + size */
@@ -116,7 +118,9 @@ void
malloc_elem_init(struct malloc_elem *elem,
struct malloc_heap *heap,
struct rte_memseg_list *msl,
- size_t size);
+ size_t size,
+ struct malloc_elem *orig_elem,
+ size_t orig_size);
void
malloc_elem_insert(struct malloc_elem *elem);
@@ -94,7 +94,7 @@ malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
{
struct malloc_elem *elem = start;
- malloc_elem_init(elem, heap, msl, len);
+ malloc_elem_init(elem, heap, msl, len, elem, len);
malloc_elem_insert(elem);
@@ -857,6 +857,13 @@ malloc_heap_free(struct malloc_elem *elem)
if (elem->size < page_sz)
goto free_unlock;
+ /* if user requested to match allocations, the sizes must match - if not,
+ * we will defer freeing these hugepages until the entire original allocation
+ * can be freed
+ */
+ if (internal_config.match_allocations && elem->size != elem->orig_size)
+ goto free_unlock;
+
/* probably, but let's make sure, as we may not be using up full page */
start = elem;
len = elem->size;
@@ -1259,6 +1266,10 @@ rte_eal_malloc_heap_init(void)
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
unsigned int i;
+ if (internal_config.match_allocations) {
+ RTE_LOG(DEBUG, EAL, "Hugepages will be freed exactly as allocated.\n");
+ }
+
if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
/* assign min socket ID to external heaps */
mcfg->next_socket_id = EXTERNAL_HEAP_MIN_SOCKET_ID;
@@ -431,6 +431,7 @@ eal_usage(const char *prgname)
" --"OPT_VFIO_INTR" Interrupt mode for VFIO (legacy|msi|msix)\n"
" --"OPT_LEGACY_MEM" Legacy memory mode (no dynamic allocation, contiguous segments)\n"
" --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
+ " --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n"
"\n");
/* Allow the application to print its usage message too if hook is set */
if ( rte_application_usage_hook ) {
@@ -699,6 +700,10 @@ eal_parse_args(int argc, char **argv)
strdup(optarg);
break;
+ case OPT_MATCH_ALLOCATIONS_NUM:
+ internal_config.match_allocations = 1;
+ break;
+
default:
if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
RTE_LOG(ERR, EAL, "Option %c is not supported "
@@ -262,13 +262,26 @@ test_multi_alloc_statistics(void)
struct rte_malloc_socket_stats pre_stats, post_stats ,first_stats, second_stats;
size_t size = 2048;
int align = 1024;
-#ifndef RTE_MALLOC_DEBUG
- int trailer_size = 0;
-#else
- int trailer_size = RTE_CACHE_LINE_SIZE;
-#endif
- int overhead = RTE_CACHE_LINE_SIZE + trailer_size;
+ int overhead = 0;
+
+ /* Dynamically calculate the overhead by allocating one cacheline and
+ * then comparing what was allocated from the heap.
+ */
+ rte_malloc_get_socket_stats(socket, &pre_stats);
+
+ void *dummy = rte_malloc_socket(NULL, RTE_CACHE_LINE_SIZE, 0, socket);
+ if (dummy == NULL)
+ return -1;
+
+ rte_malloc_get_socket_stats(socket, &post_stats);
+
+ /* after subtracting cache line, remainder is overhead */
+ overhead = post_stats.heap_allocsz_bytes - pre_stats.heap_allocsz_bytes;
+ overhead -= RTE_CACHE_LINE_SIZE;
+
+ rte_free(dummy);
+ /* Now start the real tests */
rte_malloc_get_socket_stats(socket, &pre_stats);
void *p1 = rte_malloc_socket("stats", size , align, socket);