[RFC,v2] eal/linux: add support for fast virt/iova translation
Checks
Commit Message
This patch maps hugepage memory such that address translation from
virtual to iova or vice versa can be done by simple addition/
subtraction of a constant value without any page table walks.
A new '--const-translate' EAL option is added to enable this mode.
The following example describes how this works:
Say you have a system with 4 huge pages that are 1G each and the
physical addresses are 10, 11, 17 and 22G. If we map 13G of virtual
address space, that will be enough to cover all of the huge page
physical addresses.
If the VA region starts at 1G, all of the hugepage PAs can
be mapped into that region as shown below under Proposed
heading. For comparison, existing mapping that would be
done in legacy mode is shown under the Current heading.
Proposed Current (Legacy)
VA | PA VA | PA
----+---- ----+----
1G | 10G 1G | 10G
2G | 11G 2G | 11G
3G | - 3G | -
4G | - 4G | 17G
5G | - 5G | -
6G | - 6G | 22G
7G | -
8G | 17G
9G | -
10G | -
11G | -
12G | -
13G | 22G
So in this example, we have a fixed offset of 9G to translate
between VA to PA or vice versa.This works whether the huge
pages happen to be allocated statically (legacy mode) or
dynamically.
The unused VA address space from 3G-7G and 9G-12G can be
unmapped in just two unmap calls.
This patch applies to legacy-mem mode only.
---
doc/guides/linux_gsg/eal_args.include.rst | 5 +
.../prog_guide/env_abstraction_layer.rst | 10 ++
lib/eal/common/eal_common_options.c | 6 +
lib/eal/common/eal_internal_cfg.h | 2 +
lib/eal/common/eal_options.h | 2 +
lib/eal/include/rte_memory.h | 28 +++
lib/eal/linux/eal.c | 6 +
lib/eal/linux/eal_memory.c | 165 +++++++++++++++++-
8 files changed, 218 insertions(+), 6 deletions(-)
Comments
> From: Don Wallwork [mailto:donw@xsightlabs.com]
> Sent: Wednesday, 14 September 2022 23.12
>
> This patch maps hugepage memory such that address translation from
> virtual to iova or vice versa can be done by simple addition/
> subtraction of a constant value without any page table walks.
>
> A new '--const-translate' EAL option is added to enable this mode.
>
> The following example describes how this works:
>
> Say you have a system with 4 huge pages that are 1G each and the
> physical addresses are 10, 11, 17 and 22G. If we map 13G of virtual
> address space, that will be enough to cover all of the huge page
> physical addresses.
>
> If the VA region starts at 1G, all of the hugepage PAs can
> be mapped into that region as shown below under Proposed
> heading. For comparison, existing mapping that would be
> done in legacy mode is shown under the Current heading.
>
> Proposed Current (Legacy)
>
> VA | PA VA | PA
> ----+---- ----+----
> 1G | 10G 1G | 10G
> 2G | 11G 2G | 11G
> 3G | - 3G | -
> 4G | - 4G | 17G
> 5G | - 5G | -
> 6G | - 6G | 22G
> 7G | -
> 8G | 17G
> 9G | -
> 10G | -
> 11G | -
> 12G | -
> 13G | 22G
>
> So in this example, we have a fixed offset of 9G to translate
> between VA to PA or vice versa.This works whether the huge
> pages happen to be allocated statically (legacy mode) or
> dynamically.
>
> The unused VA address space from 3G-7G and 9G-12G can be
> unmapped in just two unmap calls.
>
> This patch applies to legacy-mem mode only.
> ---
[...]
> diff --git a/lib/eal/include/rte_memory.h
> b/lib/eal/include/rte_memory.h
> index 68b069fd04..c87777ca01 100644
> --- a/lib/eal/include/rte_memory.h
> +++ b/lib/eal/include/rte_memory.h
> @@ -134,6 +134,34 @@ rte_iova_t rte_mem_virt2iova(const void *virt);
> void *
> rte_mem_iova2virt(rte_iova_t iova);
>
> +/**
> + * Get IO virtual address of any mapped virtual address in the current
> process.
> + *
> + * @note This function provides a fast implementation of virtual to
> physical
> + * addresses that does not walk any page tables. Suitable for
> use in
> + * data plane threads.
> + *
> + * @param virt
> + * The virtual address.
> + * @return
> + * The IO address or RTE_BAD_IOVA on error.
> + */
> +rte_iova_t rte_mem_fast_virt2iova(const void *virt);
> +
> +/**
> + * Get virtual memory address corresponding to iova address.
> + *
> + * @note This function provides a fast implementation of physical to
> virtual to
> + * addresses. Suitable for use in data plane threads.
> + *
> + * @param iova
> + * The iova address.
> + * @return
> + * Virtual address corresponding to iova address (or NULL if address
> does not
> + * exist within DPDK memory map).
> + */
> +void *rte_mem_fast_iova2virt(rte_iova_t iova);
> +
> /**
> * Get memseg to which a particular virtual address belongs.
> *
rte_mem_fast_virt2iova() and rte_mem_fast_iova2virt() serve the same purpose as rte_mem_virt2iova() and rte_mem_iova2virt(). Can't this alternative algorithm be implemented in the existing functions, or is there a reason for adding separate functions?
Minor detail about names: The DPDK community has a general aversion against using "fast" in function names - because everything in DPDK is supposed to be fast. If they need to be separate, perhaps use "_fp" (for fast path) at the end of their names, like the Trace library's RTE_TRACE_POINT_FP() variant of the RTE_TRACE_POINT() macro.
If these two functions need to be really fast, they could be inline instead of proper functions. In my opinion, they are short enough to allow inlining. And, if merged with the original functions, the fast code path could be inline and fall back to calling the original, slower functions (which would then need to be renamed). Others might disagree about inlining. Reusing the existing names might also probably break the ABI.
From a high level perspective - and referring to my previous argument about optimizations not being features [1] - I wonder if this patch should be a build time option instead of a runtime feature?
@Don: My preference for build time options is somewhat controversial, so don't waste time converting to a build time option before the community has provided feedback on the subject.
[1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D870EB@smartserver.smartshare.dk/
[...]
> diff --git a/lib/eal/linux/eal_memory.c b/lib/eal/linux/eal_memory.c
> index c890c42106..3fefb3dc9d 100644
> --- a/lib/eal/linux/eal_memory.c
> +++ b/lib/eal/linux/eal_memory.c
> @@ -148,6 +148,47 @@ rte_mem_virt2iova(const void *virtaddr)
> return rte_mem_virt2phy(virtaddr);
> }
>
> +static void *const_va_pa_delta;
This variable is primarily used as an address offset, not a pointer, so consider making it "uintptr_t" instead of "void *".
> +
> +#ifdef RTE_MEM_SANITY_CHECK
> +#define __rte_mem_validate(v) rte_mem_validate(v)
> +
> +static int rte_mem_validate(const void *virtaddr)
> +{
> + if (!rte_mem_virt2memseg(virt, NULL)) {
> + RTE_LOG(ERR, EAL, "Invalid virtual address %p\n",
> virtaddr);
> + return -1;
> + }
> + return 0;
> +}
> +#else
> +#define __rte_mem_validate(v) 0
> +#endif
> +
> +rte_iova_t rte_mem_fast_virt2iova(const void *virtaddr)
> +{
> + if (rte_eal_iova_mode() == RTE_IOVA_VA)
> + return (uintptr_t)virtaddr;
> +
> + if (__rte_mem_validate(virtaddr) != 0)
> + return RTE_BAD_IOVA;
> +
> + return (rte_iova_t)((uintptr_t)virtaddr -
> (uintptr_t)const_va_pa_delta);
You can probably use RTE_PTR_SUB() instead.
> +}
> +
> +void *rte_mem_fast_iova2virt(rte_iova_t iova)
> +{
> + if (rte_eal_iova_mode() == RTE_IOVA_VA)
> + return (void *)(uintptr_t)iova;
> +
> + void *virtaddr = (void *)((uintptr_t)const_va_pa_delta + iova);
You can probably use RTE_PTR_ADD() instead.
> +
> + if (__rte_mem_validate(virtaddr) != 0)
> + return NULL;
> +
> + return virtaddr;
> +}
> +
On 9/15/2022 3:40 AM, Morten Brørup wrote:
>> From: Don Wallwork [mailto:donw@xsightlabs.com]
>> Sent: Wednesday, 14 September 2022 23.12
>>
>> This patch maps hugepage memory such that address translation from
>> virtual to iova or vice versa can be done by simple addition/
>> subtraction of a constant value without any page table walks.
>>
>> A new '--const-translate' EAL option is added to enable this mode.
[...]
>> diff --git a/lib/eal/include/rte_memory.h
>> b/lib/eal/include/rte_memory.h
>> index 68b069fd04..c87777ca01 100644
>> --- a/lib/eal/include/rte_memory.h
>> +++ b/lib/eal/include/rte_memory.h
>> @@ -134,6 +134,34 @@ rte_iova_t rte_mem_virt2iova(const void *virt);
>> void *
>> rte_mem_iova2virt(rte_iova_t iova);
>>
>> +/**
>> + * Get IO virtual address of any mapped virtual address in the current
>> process.
>> + *
>> + * @note This function provides a fast implementation of virtual to
>> physical
>> + * addresses that does not walk any page tables. Suitable for
>> use in
>> + * data plane threads.
>> + *
>> + * @param virt
>> + * The virtual address.
>> + * @return
>> + * The IO address or RTE_BAD_IOVA on error.
>> + */
>> +rte_iova_t rte_mem_fast_virt2iova(const void *virt);
>> +
>> +/**
>> + * Get virtual memory address corresponding to iova address.
>> + *
>> + * @note This function provides a fast implementation of physical to
>> virtual to
>> + * addresses. Suitable for use in data plane threads.
>> + *
>> + * @param iova
>> + * The iova address.
>> + * @return
>> + * Virtual address corresponding to iova address (or NULL if address
>> does not
>> + * exist within DPDK memory map).
>> + */
>> +void *rte_mem_fast_iova2virt(rte_iova_t iova);
>> +
>> /**
>> * Get memseg to which a particular virtual address belongs.
>> *
> rte_mem_fast_virt2iova() and rte_mem_fast_iova2virt() serve the same purpose as rte_mem_virt2iova() and rte_mem_iova2virt(). Can't this alternative algorithm be implemented in the existing functions, or is there a reason for adding separate functions?
It is necessary to keep a version of the translation functions that does
the page table walks since that method must be used at init time. My
intention was to add new functions that are safe for use in fast path code.
>
> Minor detail about names: The DPDK community has a general aversion against using "fast" in function names - because everything in DPDK is supposed to be fast. If they need to be separate, perhaps use "_fp" (for fast path) at the end of their names, like the Trace library's RTE_TRACE_POINT_FP() variant of the RTE_TRACE_POINT() macro.
Thought that might be the case.. I can change to _fp.
>
> If these two functions need to be really fast, they could be inline instead of proper functions. In my opinion, they are short enough to allow inlining. And, if merged with the original functions, the fast code path could be inline and fall back to calling the original, slower functions (which would then need to be renamed). Others might disagree about inlining. Reusing the existing names might also probably break the ABI.
Agree that this could be inlined, but probably best not to merge with
the page walk version since it would not be desirable for fast path code
to fall back to that mode. Also since this patch implements support for
Linux EAL only, the inline functions may need to be added to a Linux
specific header.
>
> From a high level perspective - and referring to my previous argument about optimizations not being features [1] - I wonder if this patch should be a build time option instead of a runtime feature?
> @Don: My preference for build time options is somewhat controversial, so don't waste time converting to a build time option before the community has provided feedback on the subject.
>
> [1]: https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Finbox.dpdk.org%2Fdev%2F98CBD80474FA8B44BF855DF32C47DC35D870EB%40smartserver.smartshare.dk%2F&data=05%7C01%7Cdonw%40xsightlabs.com%7C27c0102845d042a17bb408da96ed96f7%7C646a3e3483ea42739177ab01923abaa9%7C0%7C0%7C637988244443228274%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bW9ZmyZ9Ap9rAG5juYGEUgvCNTCJ96Q1IwPf8ExaEGI%3D&reserved=0
>
> [...]
>
>> diff --git a/lib/eal/linux/eal_memory.c b/lib/eal/linux/eal_memory.c
>> index c890c42106..3fefb3dc9d 100644
>> --- a/lib/eal/linux/eal_memory.c
>> +++ b/lib/eal/linux/eal_memory.c
>> @@ -148,6 +148,47 @@ rte_mem_virt2iova(const void *virtaddr)
>> return rte_mem_virt2phy(virtaddr);
>> }
>>
>> +static void *const_va_pa_delta;
> This variable is primarily used as an address offset, not a pointer, so consider making it "uintptr_t" instead of "void *".
Good point.
>> +
>> +#ifdef RTE_MEM_SANITY_CHECK
>> +#define __rte_mem_validate(v) rte_mem_validate(v)
>> +
>> +static int rte_mem_validate(const void *virtaddr)
>> +{
>> + if (!rte_mem_virt2memseg(virt, NULL)) {
>> + RTE_LOG(ERR, EAL, "Invalid virtual address %p\n",
>> virtaddr);
>> + return -1;
>> + }
>> + return 0;
>> +}
>> +#else
>> +#define __rte_mem_validate(v) 0
>> +#endif
>> +
>> +rte_iova_t rte_mem_fast_virt2iova(const void *virtaddr)
>> +{
>> + if (rte_eal_iova_mode() == RTE_IOVA_VA)
>> + return (uintptr_t)virtaddr;
>> +
>> + if (__rte_mem_validate(virtaddr) != 0)
>> + return RTE_BAD_IOVA;
>> +
>> + return (rte_iova_t)((uintptr_t)virtaddr -
>> (uintptr_t)const_va_pa_delta);
> You can probably use RTE_PTR_SUB() instead.
Will do.
>
>> +}
>> +
>> +void *rte_mem_fast_iova2virt(rte_iova_t iova)
>> +{
>> + if (rte_eal_iova_mode() == RTE_IOVA_VA)
>> + return (void *)(uintptr_t)iova;
>> +
>> + void *virtaddr = (void *)((uintptr_t)const_va_pa_delta + iova);
> You can probably use RTE_PTR_ADD() instead.
Agreed.
-Don
>
>> +
>> + if (__rte_mem_validate(virtaddr) != 0)
>> + return NULL;
>> +
>> + return virtaddr;
>> +}
>> +
> From: Don Wallwork [mailto:donw@xsightlabs.com]
> Sent: Thursday, 15 September 2022 14.14
>
> On 9/15/2022 3:40 AM, Morten Brørup wrote:
> >> From: Don Wallwork [mailto:donw@xsightlabs.com]
> >> Sent: Wednesday, 14 September 2022 23.12
> >>
> >> This patch maps hugepage memory such that address translation from
> >> virtual to iova or vice versa can be done by simple addition/
> >> subtraction of a constant value without any page table walks.
> >>
> >> A new '--const-translate' EAL option is added to enable this mode.
[...]
> > rte_mem_fast_virt2iova() and rte_mem_fast_iova2virt() serve the same
> purpose as rte_mem_virt2iova() and rte_mem_iova2virt(). Can't this
> alternative algorithm be implemented in the existing functions, or is
> there a reason for adding separate functions?
> It is necessary to keep a version of the translation functions that
> does
> the page table walks since that method must be used at init time. My
> intention was to add new functions that are safe for use in fast path
> code.
> >
> > Minor detail about names: The DPDK community has a general aversion
> against using "fast" in function names - because everything in DPDK is
> supposed to be fast. If they need to be separate, perhaps use "_fp"
> (for fast path) at the end of their names, like the Trace library's
> RTE_TRACE_POINT_FP() variant of the RTE_TRACE_POINT() macro.
> Thought that might be the case.. I can change to _fp.
That should fly - at least there's a reference to point at. :-)
> >
> > If these two functions need to be really fast, they could be inline
> instead of proper functions. In my opinion, they are short enough to
> allow inlining. And, if merged with the original functions, the fast
> code path could be inline and fall back to calling the original, slower
> functions (which would then need to be renamed). Others might disagree
> about inlining. Reusing the existing names might also probably break
> the ABI.
> Agree that this could be inlined, but probably best not to merge with
> the page walk version since it would not be desirable for fast path
> code
> to fall back to that mode.
I have no strong opinion regarding merging.
And since I don't know how much these functions are going to be used, I have no strong opinion about inlining.
> Also since this patch implements support for
> Linux EAL only, the inline functions may need to be added to a Linux
> specific header.
These fast path functions (whether merged with their slow cousins or not) will be part of the public API, so fallback implementations for other operating systems need to be implemented. Making them inline or proper functions shouldn't have a big impact on the development effort.
I must admit that the way the DPDK EAL implements inline functions for various architectures seems a bit weird to me; being based largely on copy-paste, rather than a clear separation of declaration and implementation. I guess that the DPDK EAL suffers from the same weirdness for various operating systems.
> >
> > From a high level perspective - and referring to my previous
> argument about optimizations not being features [1] - I wonder if this
> patch should be a build time option instead of a runtime feature?
Making it a build time option would probably also implicitly answer some of the questions about how to implement the fast path functions.
> > @Don: My preference for build time options is somewhat controversial,
> so don't waste time converting to a build time option before the
> community has provided feedback on the subject.
> >
> > [1]:
> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Finbox.
> dpdk.org%2Fdev%2F98CBD80474FA8B44BF855DF32C47DC35D870EB%40smartserver.s
> martshare.dk%2F&data=05%7C01%7Cdonw%40xsightlabs.com%7C27c0102845d0
> 42a17bb408da96ed96f7%7C646a3e3483ea42739177ab01923abaa9%7C0%7C0%7C63798
> 8244443228274%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luM
> zIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bW9ZmyZ9Ap9
> rAG5juYGEUgvCNTCJ96Q1IwPf8ExaEGI%3D&reserved=0
Dmitry/Anatoly-
Any comments on this RFC?
FWIW, I did some investigation of adding support for this in non-legacy
mode
and it seemed to me that this may be at odds with the dynmem approach.
My understanding is that with dynmem, the VA space is entirely contiguous,
but IOVA addresses are not. Would it create problems if non-legacy VA
space
were non-contiguous?
Obviously the memory management code is at the heart of DPDK, so I don't
want to make a lot of changes if the approach is not acceptable to the
community.
My goal in providing a legacy only RFC is to give some context for a
discussion
of whether support for this change seems feasible for dynamic mode.
Thanks,
Don
On 9/14/2022 5:12 PM, Don Wallwork wrote:
> This patch maps hugepage memory such that address translation from
> virtual to iova or vice versa can be done by simple addition/
> subtraction of a constant value without any page table walks.
>
> A new '--const-translate' EAL option is added to enable this mode.
>
> The following example describes how this works:
>
> Say you have a system with 4 huge pages that are 1G each and the
> physical addresses are 10, 11, 17 and 22G. If we map 13G of virtual
> address space, that will be enough to cover all of the huge page
> physical addresses.
>
> If the VA region starts at 1G, all of the hugepage PAs can
> be mapped into that region as shown below under Proposed
> heading. For comparison, existing mapping that would be
> done in legacy mode is shown under the Current heading.
>
> Proposed Current (Legacy)
>
> VA | PA VA | PA
> ----+---- ----+----
> 1G | 10G 1G | 10G
> 2G | 11G 2G | 11G
> 3G | - 3G | -
> 4G | - 4G | 17G
> 5G | - 5G | -
> 6G | - 6G | 22G
> 7G | -
> 8G | 17G
> 9G | -
> 10G | -
> 11G | -
> 12G | -
> 13G | 22G
>
> So in this example, we have a fixed offset of 9G to translate
> between VA to PA or vice versa.This works whether the huge
> pages happen to be allocated statically (legacy mode) or
> dynamically.
>
> The unused VA address space from 3G-7G and 9G-12G can be
> unmapped in just two unmap calls.
>
> This patch applies to legacy-mem mode only.
> ---
> doc/guides/linux_gsg/eal_args.include.rst | 5 +
> .../prog_guide/env_abstraction_layer.rst | 10 ++
> lib/eal/common/eal_common_options.c | 6 +
> lib/eal/common/eal_internal_cfg.h | 2 +
> lib/eal/common/eal_options.h | 2 +
> lib/eal/include/rte_memory.h | 28 +++
> lib/eal/linux/eal.c | 6 +
> lib/eal/linux/eal_memory.c | 165 +++++++++++++++++-
> 8 files changed, 218 insertions(+), 6 deletions(-)
>
> diff --git a/doc/guides/linux_gsg/eal_args.include.rst b/doc/guides/linux_gsg/eal_args.include.rst
> index 9cfbf7de84..cb3d554f5a 100644
> --- a/doc/guides/linux_gsg/eal_args.include.rst
> +++ b/doc/guides/linux_gsg/eal_args.include.rst
> @@ -122,6 +122,11 @@ Memory-related options
> to system pthread stack size unless the optional size (in kbytes) is
> specified.
>
> +* ``--const-translate``
> +
> + Prepare hugepage memory such that the offset between any hugepage virtual
> + address is a constant offset from physical address and vice versa.
> +
> Debugging options
> ~~~~~~~~~~~~~~~~~
>
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
> index 67842ae272..c7ffada7ef 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -350,6 +350,16 @@ if the optional size parameter is not specified.
> hugepage worker thread stacks given the same thread stack size and
> loading conditions.
>
> +Constant Address Translation
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +When the ``--const-translate`` EAL option is specified, hugepage
> +memory is initialized to provide a constant offset between hugepage
> +virtual and physical addresses.
> +
> +This allows device drivers to quickly translate from both virtual to
> +physical and physical to virtual addresses for any hugepage address.
> +
> Support for Externally Allocated Memory
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
> index 4c2def0155..ba4151a65c 100644
> --- a/lib/eal/common/eal_common_options.c
> +++ b/lib/eal/common/eal_common_options.c
> @@ -104,6 +104,7 @@ eal_long_options[] = {
> {OPT_NO_TELEMETRY, 0, NULL, OPT_NO_TELEMETRY_NUM },
> {OPT_FORCE_MAX_SIMD_BITWIDTH, 1, NULL, OPT_FORCE_MAX_SIMD_BITWIDTH_NUM},
> {OPT_HUGE_WORKER_STACK, 2, NULL, OPT_HUGE_WORKER_STACK_NUM },
> + {OPT_CONST_TRANSLATE, 0, NULL, OPT_CONST_TRANSLATE_NUM },
>
> {0, 0, NULL, 0 }
> };
> @@ -2086,6 +2087,11 @@ eal_check_common_options(struct internal_config *internal_cfg)
> "be specified together with --"OPT_NO_HUGE"\n");
> return -1;
> }
> + if (internal_cfg->no_hugetlbfs && internal_cfg->const_translate) {
> + RTE_LOG(ERR, EAL, "Option --"OPT_CONST_TRANSLATE" cannot "
> + "be specified together with --"OPT_NO_HUGE"\n");
> + return -1;
> + }
> if (internal_conf->force_socket_limits && internal_conf->legacy_mem) {
> RTE_LOG(ERR, EAL, "Option --"OPT_SOCKET_LIMIT
> " is only supported in non-legacy memory mode\n");
> diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
> index 167ec501fa..d4a034d823 100644
> --- a/lib/eal/common/eal_internal_cfg.h
> +++ b/lib/eal/common/eal_internal_cfg.h
> @@ -103,6 +103,8 @@ struct internal_config {
> struct simd_bitwidth max_simd_bitwidth;
> /**< max simd bitwidth path to use */
> size_t huge_worker_stack_size; /**< worker thread stack size */
> + volatile unsigned const_translate;
> + /**< true to enable constant VA->PA, PA->VA address translation */
> };
>
> void eal_reset_internal_config(struct internal_config *internal_cfg);
> diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
> index 3cc9cb6412..98cd22fd32 100644
> --- a/lib/eal/common/eal_options.h
> +++ b/lib/eal/common/eal_options.h
> @@ -89,6 +89,8 @@ enum {
> OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
> #define OPT_HUGE_WORKER_STACK "huge-worker-stack"
> OPT_HUGE_WORKER_STACK_NUM,
> +#define OPT_CONST_TRANSLATE "const-translate"
> + OPT_CONST_TRANSLATE_NUM,
>
> OPT_LONG_MAX_NUM
> };
> diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
> index 68b069fd04..c87777ca01 100644
> --- a/lib/eal/include/rte_memory.h
> +++ b/lib/eal/include/rte_memory.h
> @@ -134,6 +134,34 @@ rte_iova_t rte_mem_virt2iova(const void *virt);
> void *
> rte_mem_iova2virt(rte_iova_t iova);
>
> +/**
> + * Get IO virtual address of any mapped virtual address in the current process.
> + *
> + * @note This function provides a fast implementation of virtual to physical
> + * addresses that does not walk any page tables. Suitable for use in
> + * data plane threads.
> + *
> + * @param virt
> + * The virtual address.
> + * @return
> + * The IO address or RTE_BAD_IOVA on error.
> + */
> +rte_iova_t rte_mem_fast_virt2iova(const void *virt);
> +
> +/**
> + * Get virtual memory address corresponding to iova address.
> + *
> + * @note This function provides a fast implementation of physical to virtual to
> + * addresses. Suitable for use in data plane threads.
> + *
> + * @param iova
> + * The iova address.
> + * @return
> + * Virtual address corresponding to iova address (or NULL if address does not
> + * exist within DPDK memory map).
> + */
> +void *rte_mem_fast_iova2virt(rte_iova_t iova);
> +
> /**
> * Get memseg to which a particular virtual address belongs.
> *
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 37d29643a5..6e4e60d8c1 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -455,6 +455,8 @@ eal_usage(const char *prgname)
> " Allocate worker thread stacks from hugepage memory.\n"
> " Size is in units of kbytes and defaults to system\n"
> " thread stack size if not specified.\n"
> + " --"OPT_CONST_TRANSLATE" Constant delta between hugepage "
> + "physical and virtual addresses\n"
> "\n");
> /* Allow the application to print its usage message too if hook is set */
> if (hook) {
> @@ -767,6 +769,10 @@ eal_parse_args(int argc, char **argv)
> }
> break;
>
> + case OPT_CONST_TRANSLATE_NUM:
> + internal_conf->const_translate = 1;
> + break;
> +
> default:
> if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
> RTE_LOG(ERR, EAL, "Option %c is not supported "
> diff --git a/lib/eal/linux/eal_memory.c b/lib/eal/linux/eal_memory.c
> index c890c42106..3fefb3dc9d 100644
> --- a/lib/eal/linux/eal_memory.c
> +++ b/lib/eal/linux/eal_memory.c
> @@ -148,6 +148,47 @@ rte_mem_virt2iova(const void *virtaddr)
> return rte_mem_virt2phy(virtaddr);
> }
>
> +static void *const_va_pa_delta;
> +
> +#ifdef RTE_MEM_SANITY_CHECK
> +#define __rte_mem_validate(v) rte_mem_validate(v)
> +
> +static int rte_mem_validate(const void *virtaddr)
> +{
> + if (!rte_mem_virt2memseg(virt, NULL)) {
> + RTE_LOG(ERR, EAL, "Invalid virtual address %p\n", virtaddr);
> + return -1;
> + }
> + return 0;
> +}
> +#else
> +#define __rte_mem_validate(v) 0
> +#endif
> +
> +rte_iova_t rte_mem_fast_virt2iova(const void *virtaddr)
> +{
> + if (rte_eal_iova_mode() == RTE_IOVA_VA)
> + return (uintptr_t)virtaddr;
> +
> + if (__rte_mem_validate(virtaddr) != 0)
> + return RTE_BAD_IOVA;
> +
> + return (rte_iova_t)((uintptr_t)virtaddr - (uintptr_t)const_va_pa_delta);
> +}
> +
> +void *rte_mem_fast_iova2virt(rte_iova_t iova)
> +{
> + if (rte_eal_iova_mode() == RTE_IOVA_VA)
> + return (void *)(uintptr_t)iova;
> +
> + void *virtaddr = (void *)((uintptr_t)const_va_pa_delta + iova);
> +
> + if (__rte_mem_validate(virtaddr) != 0)
> + return NULL;
> +
> + return virtaddr;
> +}
> +
> /*
> * For each hugepage in hugepg_tbl, fill the physaddr value. We find
> * it by browsing the /proc/self/pagemap special file.
> @@ -664,10 +705,9 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
> uint64_t page_sz;
> size_t memseg_len;
> int socket_id;
> -#ifndef RTE_ARCH_64
> const struct internal_config *internal_conf =
> eal_get_internal_configuration();
> -#endif
> +
> page_sz = hugepages[seg_start].size;
> socket_id = hugepages[seg_start].socket_id;
> seg_len = seg_end - seg_start;
> @@ -691,6 +731,12 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
> ms_idx = rte_fbarray_find_next_n_free(arr, 0,
> seg_len + (empty ? 0 : 1));
>
> + if (internal_conf->const_translate &&
> + internal_conf->legacy_mem &&
> + rte_eal_iova_mode() == RTE_IOVA_PA &&
> + ms_idx != 0)
> + continue;
> +
> /* memseg list is full? */
> if (ms_idx < 0)
> continue;
> @@ -735,7 +781,12 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
> return -1;
> }
> memseg_len = (size_t)page_sz;
> - addr = RTE_PTR_ADD(msl->base_va, ms_idx * memseg_len);
> + if (internal_conf->const_translate &&
> + internal_conf->legacy_mem &&
> + rte_eal_iova_mode() == RTE_IOVA_PA)
> + addr = RTE_PTR_ADD(const_va_pa_delta, hfile->physaddr);
> + else
> + addr = RTE_PTR_ADD(msl->base_va, ms_idx * memseg_len);
>
> /* we know this address is already mmapped by memseg list, so
> * using MAP_FIXED here is safe
> @@ -1085,6 +1136,98 @@ huge_recover_sigbus(void)
> }
> }
>
> +static int
> +remap_hugepages_const_xlate(struct hugepage_file *hugepage, int n_pages,
> + int nr_hugepages)
> +{
> + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
> + int i, remap_failed = 0;
> + void *addr;
> +
> + /* Adjust VA bases in memory segment lists to enable constant
> + * va->pa and pa->va address translation
> + */
> + if (rte_eal_iova_mode() == RTE_IOVA_PA) {
> + RTE_LOG(INFO, EAL,
> + "Enabling constant address translation support...\n");
> +
> + /* Allocate virtual address space to cover the full
> + * range of huge page physical addresses
> + */
> + size_t va_mem_sz =
> + hugepage[nr_hugepages - 1].physaddr +
> + hugepage[nr_hugepages - 1].size -
> + hugepage[0].physaddr;
> + size_t page_sz = 0;
> +
> + for (i = 0; i < nr_hugepages; i++)
> + if (hugepage[i].size > page_sz)
> + page_sz = hugepage[i].size;
> +
> + void *va_base =
> + eal_get_virtual_area(NULL, &va_mem_sz, page_sz, 0, 0);
> +
> + if (va_base == NULL) {
> + RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
> + return -ENOMEM;
> + }
> + const_va_pa_delta = RTE_PTR_ADD(va_base, -hugepage[0].physaddr);
> +
> + /* Unmap gaps in virtual address space when there are gaps
> + * between huge page physical addresses
> + */
> + for (i = 1; i < nr_hugepages; i++) {
> + size_t gap_sz = hugepage[i].physaddr -
> + (hugepage[i-1].physaddr + hugepage[i-1].size);
> +
> + if (gap_sz) {
> + addr = RTE_PTR_ADD(const_va_pa_delta,
> + hugepage[i-1].physaddr +
> + hugepage[i-1].size);
> +
> + if (munmap(addr, gap_sz) != 0)
> + RTE_LOG(ERR, EAL, "Gap unmap failed\n");
> + }
> + }
> + }
> +
> + /* remap all pages we do need into memseg list VA space, so that those
> + * pages become first-class citizens in DPDK memory subsystem
> + */
> + if (remap_needed_hugepages(hugepage, n_pages)) {
> + RTE_LOG(ERR, EAL,
> + "Couldn't remap hugepage files into memseg lists\n");
> + remap_failed = 1;
> + }
> +
> + /* Unmap the existing virtual address space in each MSL with
> + * allocated pages. Modify MSL base_va to be the VA of the
> + * first page of the segment list. Adjust the msl->len to the
> + * length of the address space consumed by the msl.
> + */
> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
> + struct rte_memseg_list *msl = &mcfg->memsegs[i];
> + struct rte_memseg *ms;
> +
> + /* skip inactive lists */
> + if (msl->base_va == NULL)
> + continue;
> +
> + /* skip lists where there are no pages allocated */
> + if (!msl->memseg_arr.count)
> + continue;
> +
> + /* release current VA space */
> + munmap(msl->base_va, msl->len);
> +
> + /* assign new VA base and len */
> + ms = rte_fbarray_get(&msl->memseg_arr, 0);
> + msl->base_va = ms->addr;
> + msl->len = (msl->page_sz * msl->memseg_arr.count);
> + }
> + return remap_failed;
> +}
> +
> /*
> * Prepare physical memory mapping: fill configuration structure with
> * these infos, return 0 on success.
> @@ -1413,9 +1556,19 @@ eal_legacy_hugepage_init(void)
> /* remap all pages we do need into memseg list VA space, so that those
> * pages become first-class citizens in DPDK memory subsystem
> */
> - if (remap_needed_hugepages(hugepage, nr_hugefiles)) {
> - RTE_LOG(ERR, EAL, "Couldn't remap hugepage files into memseg lists\n");
> - goto fail;
> + if (internal_conf->const_translate) {
> + if (remap_hugepages_const_xlate(hugepage, nr_hugefiles,
> + nr_hugepages)) {
> + RTE_LOG(ERR, EAL,
> + "Couldn't remap hugepage files into memseg lists\n");
> + goto fail;
> + }
> + } else {
> + if (remap_needed_hugepages(hugepage, nr_hugefiles)) {
> + RTE_LOG(ERR, EAL,
> + "Couldn't remap hugepage files into memseg lists\n");
> + goto fail;
> + }
> }
>
> /* free the hugepage backing files */
@@ -122,6 +122,11 @@ Memory-related options
to system pthread stack size unless the optional size (in kbytes) is
specified.
+* ``--const-translate``
+
+ Prepare hugepage memory such that the offset between any hugepage virtual
+ address is a constant offset from physical address and vice versa.
+
Debugging options
~~~~~~~~~~~~~~~~~
@@ -350,6 +350,16 @@ if the optional size parameter is not specified.
hugepage worker thread stacks given the same thread stack size and
loading conditions.
+Constant Address Translation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When the ``--const-translate`` EAL option is specified, hugepage
+memory is initialized to provide a constant offset between hugepage
+virtual and physical addresses.
+
+This allows device drivers to quickly translate from both virtual to
+physical and physical to virtual addresses for any hugepage address.
+
Support for Externally Allocated Memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -104,6 +104,7 @@ eal_long_options[] = {
{OPT_NO_TELEMETRY, 0, NULL, OPT_NO_TELEMETRY_NUM },
{OPT_FORCE_MAX_SIMD_BITWIDTH, 1, NULL, OPT_FORCE_MAX_SIMD_BITWIDTH_NUM},
{OPT_HUGE_WORKER_STACK, 2, NULL, OPT_HUGE_WORKER_STACK_NUM },
+ {OPT_CONST_TRANSLATE, 0, NULL, OPT_CONST_TRANSLATE_NUM },
{0, 0, NULL, 0 }
};
@@ -2086,6 +2087,11 @@ eal_check_common_options(struct internal_config *internal_cfg)
"be specified together with --"OPT_NO_HUGE"\n");
return -1;
}
+ if (internal_cfg->no_hugetlbfs && internal_cfg->const_translate) {
+ RTE_LOG(ERR, EAL, "Option --"OPT_CONST_TRANSLATE" cannot "
+ "be specified together with --"OPT_NO_HUGE"\n");
+ return -1;
+ }
if (internal_conf->force_socket_limits && internal_conf->legacy_mem) {
RTE_LOG(ERR, EAL, "Option --"OPT_SOCKET_LIMIT
" is only supported in non-legacy memory mode\n");
@@ -103,6 +103,8 @@ struct internal_config {
struct simd_bitwidth max_simd_bitwidth;
/**< max simd bitwidth path to use */
size_t huge_worker_stack_size; /**< worker thread stack size */
+ volatile unsigned const_translate;
+ /**< true to enable constant VA->PA, PA->VA address translation */
};
void eal_reset_internal_config(struct internal_config *internal_cfg);
@@ -89,6 +89,8 @@ enum {
OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
#define OPT_HUGE_WORKER_STACK "huge-worker-stack"
OPT_HUGE_WORKER_STACK_NUM,
+#define OPT_CONST_TRANSLATE "const-translate"
+ OPT_CONST_TRANSLATE_NUM,
OPT_LONG_MAX_NUM
};
@@ -134,6 +134,34 @@ rte_iova_t rte_mem_virt2iova(const void *virt);
void *
rte_mem_iova2virt(rte_iova_t iova);
+/**
+ * Get IO virtual address of any mapped virtual address in the current process.
+ *
+ * @note This function provides a fast implementation of virtual to physical
+ * addresses that does not walk any page tables. Suitable for use in
+ * data plane threads.
+ *
+ * @param virt
+ * The virtual address.
+ * @return
+ * The IO address or RTE_BAD_IOVA on error.
+ */
+rte_iova_t rte_mem_fast_virt2iova(const void *virt);
+
+/**
+ * Get virtual memory address corresponding to iova address.
+ *
+ * @note This function provides a fast implementation of physical to virtual to
+ * addresses. Suitable for use in data plane threads.
+ *
+ * @param iova
+ * The iova address.
+ * @return
+ * Virtual address corresponding to iova address (or NULL if address does not
+ * exist within DPDK memory map).
+ */
+void *rte_mem_fast_iova2virt(rte_iova_t iova);
+
/**
* Get memseg to which a particular virtual address belongs.
*
@@ -455,6 +455,8 @@ eal_usage(const char *prgname)
" Allocate worker thread stacks from hugepage memory.\n"
" Size is in units of kbytes and defaults to system\n"
" thread stack size if not specified.\n"
+ " --"OPT_CONST_TRANSLATE" Constant delta between hugepage "
+ "physical and virtual addresses\n"
"\n");
/* Allow the application to print its usage message too if hook is set */
if (hook) {
@@ -767,6 +769,10 @@ eal_parse_args(int argc, char **argv)
}
break;
+ case OPT_CONST_TRANSLATE_NUM:
+ internal_conf->const_translate = 1;
+ break;
+
default:
if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
RTE_LOG(ERR, EAL, "Option %c is not supported "
@@ -148,6 +148,47 @@ rte_mem_virt2iova(const void *virtaddr)
return rte_mem_virt2phy(virtaddr);
}
+static void *const_va_pa_delta;
+
+#ifdef RTE_MEM_SANITY_CHECK
+#define __rte_mem_validate(v) rte_mem_validate(v)
+
+static int rte_mem_validate(const void *virtaddr)
+{
+ if (!rte_mem_virt2memseg(virt, NULL)) {
+ RTE_LOG(ERR, EAL, "Invalid virtual address %p\n", virtaddr);
+ return -1;
+ }
+ return 0;
+}
+#else
+#define __rte_mem_validate(v) 0
+#endif
+
+rte_iova_t rte_mem_fast_virt2iova(const void *virtaddr)
+{
+ if (rte_eal_iova_mode() == RTE_IOVA_VA)
+ return (uintptr_t)virtaddr;
+
+ if (__rte_mem_validate(virtaddr) != 0)
+ return RTE_BAD_IOVA;
+
+ return (rte_iova_t)((uintptr_t)virtaddr - (uintptr_t)const_va_pa_delta);
+}
+
+void *rte_mem_fast_iova2virt(rte_iova_t iova)
+{
+ if (rte_eal_iova_mode() == RTE_IOVA_VA)
+ return (void *)(uintptr_t)iova;
+
+ void *virtaddr = (void *)((uintptr_t)const_va_pa_delta + iova);
+
+ if (__rte_mem_validate(virtaddr) != 0)
+ return NULL;
+
+ return virtaddr;
+}
+
/*
* For each hugepage in hugepg_tbl, fill the physaddr value. We find
* it by browsing the /proc/self/pagemap special file.
@@ -664,10 +705,9 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
uint64_t page_sz;
size_t memseg_len;
int socket_id;
-#ifndef RTE_ARCH_64
const struct internal_config *internal_conf =
eal_get_internal_configuration();
-#endif
+
page_sz = hugepages[seg_start].size;
socket_id = hugepages[seg_start].socket_id;
seg_len = seg_end - seg_start;
@@ -691,6 +731,12 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
ms_idx = rte_fbarray_find_next_n_free(arr, 0,
seg_len + (empty ? 0 : 1));
+ if (internal_conf->const_translate &&
+ internal_conf->legacy_mem &&
+ rte_eal_iova_mode() == RTE_IOVA_PA &&
+ ms_idx != 0)
+ continue;
+
/* memseg list is full? */
if (ms_idx < 0)
continue;
@@ -735,7 +781,12 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
return -1;
}
memseg_len = (size_t)page_sz;
- addr = RTE_PTR_ADD(msl->base_va, ms_idx * memseg_len);
+ if (internal_conf->const_translate &&
+ internal_conf->legacy_mem &&
+ rte_eal_iova_mode() == RTE_IOVA_PA)
+ addr = RTE_PTR_ADD(const_va_pa_delta, hfile->physaddr);
+ else
+ addr = RTE_PTR_ADD(msl->base_va, ms_idx * memseg_len);
/* we know this address is already mmapped by memseg list, so
* using MAP_FIXED here is safe
@@ -1085,6 +1136,98 @@ huge_recover_sigbus(void)
}
}
+static int
+remap_hugepages_const_xlate(struct hugepage_file *hugepage, int n_pages,
+ int nr_hugepages)
+{
+ struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+ int i, remap_failed = 0;
+ void *addr;
+
+ /* Adjust VA bases in memory segment lists to enable constant
+ * va->pa and pa->va address translation
+ */
+ if (rte_eal_iova_mode() == RTE_IOVA_PA) {
+ RTE_LOG(INFO, EAL,
+ "Enabling constant address translation support...\n");
+
+ /* Allocate virtual address space to cover the full
+ * range of huge page physical addresses
+ */
+ size_t va_mem_sz =
+ hugepage[nr_hugepages - 1].physaddr +
+ hugepage[nr_hugepages - 1].size -
+ hugepage[0].physaddr;
+ size_t page_sz = 0;
+
+ for (i = 0; i < nr_hugepages; i++)
+ if (hugepage[i].size > page_sz)
+ page_sz = hugepage[i].size;
+
+ void *va_base =
+ eal_get_virtual_area(NULL, &va_mem_sz, page_sz, 0, 0);
+
+ if (va_base == NULL) {
+ RTE_LOG(ERR, EAL, "Cannot reserve memory\n");
+ return -ENOMEM;
+ }
+ const_va_pa_delta = RTE_PTR_ADD(va_base, -hugepage[0].physaddr);
+
+ /* Unmap gaps in virtual address space when there are gaps
+ * between huge page physical addresses
+ */
+ for (i = 1; i < nr_hugepages; i++) {
+ size_t gap_sz = hugepage[i].physaddr -
+ (hugepage[i-1].physaddr + hugepage[i-1].size);
+
+ if (gap_sz) {
+ addr = RTE_PTR_ADD(const_va_pa_delta,
+ hugepage[i-1].physaddr +
+ hugepage[i-1].size);
+
+ if (munmap(addr, gap_sz) != 0)
+ RTE_LOG(ERR, EAL, "Gap unmap failed\n");
+ }
+ }
+ }
+
+ /* remap all pages we do need into memseg list VA space, so that those
+ * pages become first-class citizens in DPDK memory subsystem
+ */
+ if (remap_needed_hugepages(hugepage, n_pages)) {
+ RTE_LOG(ERR, EAL,
+ "Couldn't remap hugepage files into memseg lists\n");
+ remap_failed = 1;
+ }
+
+ /* Unmap the existing virtual address space in each MSL with
+ * allocated pages. Modify MSL base_va to be the VA of the
+ * first page of the segment list. Adjust the msl->len to the
+ * length of the address space consumed by the msl.
+ */
+ for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+ struct rte_memseg_list *msl = &mcfg->memsegs[i];
+ struct rte_memseg *ms;
+
+ /* skip inactive lists */
+ if (msl->base_va == NULL)
+ continue;
+
+ /* skip lists where there are no pages allocated */
+ if (!msl->memseg_arr.count)
+ continue;
+
+ /* release current VA space */
+ munmap(msl->base_va, msl->len);
+
+ /* assign new VA base and len */
+ ms = rte_fbarray_get(&msl->memseg_arr, 0);
+ msl->base_va = ms->addr;
+ msl->len = (msl->page_sz * msl->memseg_arr.count);
+ }
+ return remap_failed;
+}
+
/*
* Prepare physical memory mapping: fill configuration structure with
* these infos, return 0 on success.
@@ -1413,9 +1556,19 @@ eal_legacy_hugepage_init(void)
/* remap all pages we do need into memseg list VA space, so that those
* pages become first-class citizens in DPDK memory subsystem
*/
- if (remap_needed_hugepages(hugepage, nr_hugefiles)) {
- RTE_LOG(ERR, EAL, "Couldn't remap hugepage files into memseg lists\n");
- goto fail;
+ if (internal_conf->const_translate) {
+ if (remap_hugepages_const_xlate(hugepage, nr_hugefiles,
+ nr_hugepages)) {
+ RTE_LOG(ERR, EAL,
+ "Couldn't remap hugepage files into memseg lists\n");
+ goto fail;
+ }
+ } else {
+ if (remap_needed_hugepages(hugepage, nr_hugefiles)) {
+ RTE_LOG(ERR, EAL,
+ "Couldn't remap hugepage files into memseg lists\n");
+ goto fail;
+ }
}
/* free the hugepage backing files */