[v2,2/5] mem: use address hint for mapping hugepages

Message ID 1535719857-19092-3-git-send-email-alejandro.lucero@netronome.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series use IOVAs check based on DMA mask |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Alejandro Lucero Aug. 31, 2018, 12:50 p.m. UTC
  Linux kernel uses a really high address as starting address for
serving mmaps calls. If there exist addressing limitations and
IOVA mode is VA, this starting address is likely too high for
those devices. However, it is possible to use a lower address in
the process virtual address space as with 64 bits there is a lot
of available space.

This patch adds an address hint as starting address for 64 bits
systems.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
---
 lib/librte_eal/common/eal_common_memory.c | 35 ++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)
  

Comments

Anatoly Burakov Oct. 3, 2018, 12:50 p.m. UTC | #1
On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> Linux kernel uses a really high address as starting address for
> serving mmaps calls. If there exist addressing limitations and
> IOVA mode is VA, this starting address is likely too high for
> those devices. However, it is possible to use a lower address in
> the process virtual address space as with 64 bits there is a lot
> of available space.
> 
> This patch adds an address hint as starting address for 64 bits
> systems.
> 
> Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
> ---

<snip>

>   
>   		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_READ,
>   				mmap_flags, -1, 0);
> +
>   		if (mapped_addr == MAP_FAILED && allow_shrink)

Unintended whitespace change?

>   			*size -= page_sz;
> -	} while (allow_shrink && mapped_addr == MAP_FAILED && *size > 0);
> +
> +		if (mapped_addr != MAP_FAILED && addr_is_hint &&
> +		    mapped_addr != requested_addr) {
> +			/* hint was not used. Try with another offset */
> +			munmap(mapped_addr, map_sz);
> +			mapped_addr = MAP_FAILED;
> +			next_baseaddr = RTE_PTR_ADD(next_baseaddr, 0x100000000);

Why not increment by page size? Sure, it could take some more time to 
allocate, but will result in less wasted memory.
  
Alejandro Lucero Oct. 4, 2018, 11:43 a.m. UTC | #2
On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> > Linux kernel uses a really high address as starting address for
> > serving mmaps calls. If there exist addressing limitations and
> > IOVA mode is VA, this starting address is likely too high for
> > those devices. However, it is possible to use a lower address in
> > the process virtual address space as with 64 bits there is a lot
> > of available space.
> >
> > This patch adds an address hint as starting address for 64 bits
> > systems.
> >
> > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
> > ---
>
> <snip>
>
> >
> >               mapped_addr = mmap(requested_addr, (size_t)map_sz,
> PROT_READ,
> >                               mmap_flags, -1, 0);
> > +
> >               if (mapped_addr == MAP_FAILED && allow_shrink)
>
> Unintended whitespace change?
>
>
Yes. I'll fix it.


> >                       *size -= page_sz;
> > -     } while (allow_shrink && mapped_addr == MAP_FAILED && *size > 0);
> > +
> > +             if (mapped_addr != MAP_FAILED && addr_is_hint &&
> > +                 mapped_addr != requested_addr) {
> > +                     /* hint was not used. Try with another offset */
> > +                     munmap(mapped_addr, map_sz);
> > +                     mapped_addr = MAP_FAILED;
> > +                     next_baseaddr = RTE_PTR_ADD(next_baseaddr,
> 0x100000000);
>
> Why not increment by page size? Sure, it could take some more time to
> allocate, but will result in less wasted memory.
>
>
I though the same or even using smaller increments than hugepage size.
Increment the address in such amount does not mean we are wasting memory
but just leaving space if some mmap fails. I think it is better to leave as
much as space as possible just in case the data allocated in the conflicted
area would need to grow in the future.


> --
> Thanks,
> Anatoly
>
  
Anatoly Burakov Oct. 4, 2018, 12:08 p.m. UTC | #3
On 04-Oct-18 12:43 PM, Alejandro Lucero wrote:
> 
> 
> On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly 
> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> 
>     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
>      > Linux kernel uses a really high address as starting address for
>      > serving mmaps calls. If there exist addressing limitations and
>      > IOVA mode is VA, this starting address is likely too high for
>      > those devices. However, it is possible to use a lower address in
>      > the process virtual address space as with 64 bits there is a lot
>      > of available space.
>      >
>      > This patch adds an address hint as starting address for 64 bits
>      > systems.
>      >
>      > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com
>     <mailto:alejandro.lucero@netronome.com>>
>      > ---
> 
>     <snip>
> 
>      >
>      >               mapped_addr = mmap(requested_addr, (size_t)map_sz,
>     PROT_READ,
>      >                               mmap_flags, -1, 0);
>      > +
>      >               if (mapped_addr == MAP_FAILED && allow_shrink)
> 
>     Unintended whitespace change?
> 
> 
> Yes. I'll fix it.
> 
>      >                       *size -= page_sz;
>      > -     } while (allow_shrink && mapped_addr == MAP_FAILED && *size
>      > 0);
>      > +
>      > +             if (mapped_addr != MAP_FAILED && addr_is_hint &&
>      > +                 mapped_addr != requested_addr) {
>      > +                     /* hint was not used. Try with another
>     offset */
>      > +                     munmap(mapped_addr, map_sz);
>      > +                     mapped_addr = MAP_FAILED;
>      > +                     next_baseaddr = RTE_PTR_ADD(next_baseaddr,
>     0x100000000);
> 
>     Why not increment by page size? Sure, it could take some more time to
>     allocate, but will result in less wasted memory.
> 
> 
> I though the same or even using smaller increments than hugepage size. 
> Increment the address in such amount does not mean we are wasting memory 
> but just leaving space if some mmap fails. I think it is better to leave 
> as much as space as possible just in case the data allocated in the 
> conflicted area would need to grow in the future.

Not sure i follow. Could you give an example of a scenario where leaving 
huge chunks of memory free would be preferable to just adding page size 
and starting from page-size-aligned address next time we allocate?

> 
>     -- 
>     Thanks,
>     Anatoly
>
  
Alejandro Lucero Oct. 4, 2018, 1:15 p.m. UTC | #4
On Thu, Oct 4, 2018 at 1:08 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 04-Oct-18 12:43 PM, Alejandro Lucero wrote:
> >
> >
> > On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> >
> >     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> >      > Linux kernel uses a really high address as starting address for
> >      > serving mmaps calls. If there exist addressing limitations and
> >      > IOVA mode is VA, this starting address is likely too high for
> >      > those devices. However, it is possible to use a lower address in
> >      > the process virtual address space as with 64 bits there is a lot
> >      > of available space.
> >      >
> >      > This patch adds an address hint as starting address for 64 bits
> >      > systems.
> >      >
> >      > Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com
> >     <mailto:alejandro.lucero@netronome.com>>
> >      > ---
> >
> >     <snip>
> >
> >      >
> >      >               mapped_addr = mmap(requested_addr, (size_t)map_sz,
> >     PROT_READ,
> >      >                               mmap_flags, -1, 0);
> >      > +
> >      >               if (mapped_addr == MAP_FAILED && allow_shrink)
> >
> >     Unintended whitespace change?
> >
> >
> > Yes. I'll fix it.
> >
> >      >                       *size -= page_sz;
> >      > -     } while (allow_shrink && mapped_addr == MAP_FAILED && *size
> >      > 0);
> >      > +
> >      > +             if (mapped_addr != MAP_FAILED && addr_is_hint &&
> >      > +                 mapped_addr != requested_addr) {
> >      > +                     /* hint was not used. Try with another
> >     offset */
> >      > +                     munmap(mapped_addr, map_sz);
> >      > +                     mapped_addr = MAP_FAILED;
> >      > +                     next_baseaddr = RTE_PTR_ADD(next_baseaddr,
> >     0x100000000);
> >
> >     Why not increment by page size? Sure, it could take some more time to
> >     allocate, but will result in less wasted memory.
> >
> >
> > I though the same or even using smaller increments than hugepage size.
> > Increment the address in such amount does not mean we are wasting memory
> > but just leaving space if some mmap fails. I think it is better to leave
> > as much as space as possible just in case the data allocated in the
> > conflicted area would need to grow in the future.
>
> Not sure i follow. Could you give an example of a scenario where leaving
> huge chunks of memory free would be preferable to just adding page size
> and starting from page-size-aligned address next time we allocate?
>
>
Usually there is nothing at 4GB address in 64 bit processes, usually the
text section being the first process region mapped and currently at far
higher than 4GB. If there is something mapped there before executing the
EAL hugepage/memory initialization code, not sure what it will be for, but
maybe it needs to grow using contiguous virtual addresses. As I say, no
idea what this could be used for, but the shorter the space when trying
again in this code, the less likely that flexibility could be there.

Maybe making the increment smaller just makes sense for virtual address
space randomization for security reasons.

Anyway, there is a lot of space with 64 bits, and, IMHO, this should not be
a problem while the increment is negligible against 64 bits address space
size, and 4GB are so negligible in this case as 4 bytes are to 4GB.


> >
> >     --
> >     Thanks,
> >     Anatoly
> >
>
>
> --
> Thanks,
> Anatoly
>
  
Anatoly Burakov Oct. 4, 2018, 3:43 p.m. UTC | #5
On 04-Oct-18 2:15 PM, Alejandro Lucero wrote:
> 
> 
> On Thu, Oct 4, 2018 at 1:08 PM Burakov, Anatoly 
> <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> 
>     On 04-Oct-18 12:43 PM, Alejandro Lucero wrote:
>      >
>      >
>      > On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly
>      > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>
>     <mailto:anatoly.burakov@intel.com
>     <mailto:anatoly.burakov@intel.com>>> wrote:
>      >
>      >     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
>      >      > Linux kernel uses a really high address as starting
>     address for
>      >      > serving mmaps calls. If there exist addressing limitations and
>      >      > IOVA mode is VA, this starting address is likely too high for
>      >      > those devices. However, it is possible to use a lower
>     address in
>      >      > the process virtual address space as with 64 bits there is
>     a lot
>      >      > of available space.
>      >      >
>      >      > This patch adds an address hint as starting address for 64
>     bits
>      >      > systems.
>      >      >
>      >      > Signed-off-by: Alejandro Lucero
>     <alejandro.lucero@netronome.com <mailto:alejandro.lucero@netronome.com>
>      >     <mailto:alejandro.lucero@netronome.com
>     <mailto:alejandro.lucero@netronome.com>>>
>      >      > ---
>      >
>      >     <snip>
>      >
>      >      >
>      >      >               mapped_addr = mmap(requested_addr,
>     (size_t)map_sz,
>      >     PROT_READ,
>      >      >                               mmap_flags, -1, 0);
>      >      > +
>      >      >               if (mapped_addr == MAP_FAILED && allow_shrink)
>      >
>      >     Unintended whitespace change?
>      >
>      >
>      > Yes. I'll fix it.
>      >
>      >      >                       *size -= page_sz;
>      >      > -     } while (allow_shrink && mapped_addr == MAP_FAILED
>     && *size
>      >      > 0);
>      >      > +
>      >      > +             if (mapped_addr != MAP_FAILED && addr_is_hint &&
>      >      > +                 mapped_addr != requested_addr) {
>      >      > +                     /* hint was not used. Try with another
>      >     offset */
>      >      > +                     munmap(mapped_addr, map_sz);
>      >      > +                     mapped_addr = MAP_FAILED;
>      >      > +                     next_baseaddr =
>     RTE_PTR_ADD(next_baseaddr,
>      >     0x100000000);
>      >
>      >     Why not increment by page size? Sure, it could take some more
>     time to
>      >     allocate, but will result in less wasted memory.
>      >
>      >
>      > I though the same or even using smaller increments than hugepage
>     size.
>      > Increment the address in such amount does not mean we are wasting
>     memory
>      > but just leaving space if some mmap fails. I think it is better
>     to leave
>      > as much as space as possible just in case the data allocated in the
>      > conflicted area would need to grow in the future.
> 
>     Not sure i follow. Could you give an example of a scenario where
>     leaving
>     huge chunks of memory free would be preferable to just adding page size
>     and starting from page-size-aligned address next time we allocate?
> 
> 
> Usually there is nothing at 4GB address in 64 bit processes, usually the 
> text section being the first process region mapped and currently at far 
> higher than 4GB. If there is something mapped there before executing the 
> EAL hugepage/memory initialization code, not sure what it will be for, 
> but maybe it needs to grow using contiguous virtual addresses. As I say, 
> no idea what this could be used for, but the shorter the space when 
> trying again in this code, the less likely that flexibility could be there.

But you're already leaving holes there, what difference does it make? I 
mean, it's not important, i'm just not sure why the arbitrary 
0x100000000 increment instead of page size. Most of the calls into this 
function are from init code, and with init code we're usually calling 
this function quite a few times in succession (especially during memseg 
list allocations), so you are skipping space that could've been used for 
that.

(btw if you are to use this constant, it should be a macro, not a raw 
constant)
  
Alejandro Lucero Oct. 4, 2018, 5:58 p.m. UTC | #6
On Thu, Oct 4, 2018 at 4:43 PM Burakov, Anatoly <anatoly.burakov@intel.com>
wrote:

> On 04-Oct-18 2:15 PM, Alejandro Lucero wrote:
> >
> >
> > On Thu, Oct 4, 2018 at 1:08 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>> wrote:
> >
> >     On 04-Oct-18 12:43 PM, Alejandro Lucero wrote:
> >      >
> >      >
> >      > On Wed, Oct 3, 2018 at 1:50 PM Burakov, Anatoly
> >      > <anatoly.burakov@intel.com <mailto:anatoly.burakov@intel.com>
> >     <mailto:anatoly.burakov@intel.com
> >     <mailto:anatoly.burakov@intel.com>>> wrote:
> >      >
> >      >     On 31-Aug-18 1:50 PM, Alejandro Lucero wrote:
> >      >      > Linux kernel uses a really high address as starting
> >     address for
> >      >      > serving mmaps calls. If there exist addressing limitations
> and
> >      >      > IOVA mode is VA, this starting address is likely too high
> for
> >      >      > those devices. However, it is possible to use a lower
> >     address in
> >      >      > the process virtual address space as with 64 bits there is
> >     a lot
> >      >      > of available space.
> >      >      >
> >      >      > This patch adds an address hint as starting address for 64
> >     bits
> >      >      > systems.
> >      >      >
> >      >      > Signed-off-by: Alejandro Lucero
> >     <alejandro.lucero@netronome.com <mailto:
> alejandro.lucero@netronome.com>
> >      >     <mailto:alejandro.lucero@netronome.com
> >     <mailto:alejandro.lucero@netronome.com>>>
> >      >      > ---
> >      >
> >      >     <snip>
> >      >
> >      >      >
> >      >      >               mapped_addr = mmap(requested_addr,
> >     (size_t)map_sz,
> >      >     PROT_READ,
> >      >      >                               mmap_flags, -1, 0);
> >      >      > +
> >      >      >               if (mapped_addr == MAP_FAILED &&
> allow_shrink)
> >      >
> >      >     Unintended whitespace change?
> >      >
> >      >
> >      > Yes. I'll fix it.
> >      >
> >      >      >                       *size -= page_sz;
> >      >      > -     } while (allow_shrink && mapped_addr == MAP_FAILED
> >     && *size
> >      >      > 0);
> >      >      > +
> >      >      > +             if (mapped_addr != MAP_FAILED &&
> addr_is_hint &&
> >      >      > +                 mapped_addr != requested_addr) {
> >      >      > +                     /* hint was not used. Try with
> another
> >      >     offset */
> >      >      > +                     munmap(mapped_addr, map_sz);
> >      >      > +                     mapped_addr = MAP_FAILED;
> >      >      > +                     next_baseaddr =
> >     RTE_PTR_ADD(next_baseaddr,
> >      >     0x100000000);
> >      >
> >      >     Why not increment by page size? Sure, it could take some more
> >     time to
> >      >     allocate, but will result in less wasted memory.
> >      >
> >      >
> >      > I though the same or even using smaller increments than hugepage
> >     size.
> >      > Increment the address in such amount does not mean we are wasting
> >     memory
> >      > but just leaving space if some mmap fails. I think it is better
> >     to leave
> >      > as much as space as possible just in case the data allocated in
> the
> >      > conflicted area would need to grow in the future.
> >
> >     Not sure i follow. Could you give an example of a scenario where
> >     leaving
> >     huge chunks of memory free would be preferable to just adding page
> size
> >     and starting from page-size-aligned address next time we allocate?
> >
> >
> > Usually there is nothing at 4GB address in 64 bit processes, usually the
> > text section being the first process region mapped and currently at far
> > higher than 4GB. If there is something mapped there before executing the
> > EAL hugepage/memory initialization code, not sure what it will be for,
> > but maybe it needs to grow using contiguous virtual addresses. As I say,
> > no idea what this could be used for, but the shorter the space when
> > trying again in this code, the less likely that flexibility could be
> there.
>
> But you're already leaving holes there, what difference does it make? I
> mean, it's not important, i'm just not sure why the arbitrary
> 0x100000000 increment instead of page size. Most of the calls into this
> function are from init code, and with init code we're usually calling
> this function quite a few times in succession (especially during memseg
> list allocations), so you are skipping space that could've been used for
> that.
>
>
Note that the increment is pagesize if there is no problem and the 4GB
increment is just used if that specific address fails.
I'm not against change this to always use hugepage size instead and it
seems my previous comment did not convince you. So I'll change that because
I can not sustain my case without any real data. :-)



> (btw if you are to use this constant, it should be a macro, not a raw
> constant)
>
> --
> Thanks,
> Anatoly
>
  

Patch

diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index bdd8f44..97378b1 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -37,6 +37,23 @@ 
 static void *next_baseaddr;
 static uint64_t system_page_sz;
 
+#ifdef RTE_ARCH_64
+/*
+ * Linux kernel uses a really high address as starting address for serving
+ * mmaps calls. If there exists addressing limitations and IOVA mode is VA,
+ * this starting address is likely too high for those devices. However, it
+ * is possible to use a lower address in the process virtual address space
+ * as with 64 bits there is a lot of available space.
+ *
+ * Current known limitations are 39 or 40 bits. Setting the starting address
+ * at 4GB implies there are 508GB or 1020GB for mapping the available
+ * hugepages. This is likely enough for most systems, although a device with
+ * addressing limitations should call rte_eal_check_dma_mask for ensuring all
+ * memory is within supported range.
+ */
+static uint64_t baseaddr = 0x100000000;
+#endif
+
 void *
 eal_get_virtual_area(void *requested_addr, size_t *size,
 		size_t page_sz, int flags, int mmap_flags)
@@ -60,6 +77,11 @@ 
 			rte_eal_process_type() == RTE_PROC_PRIMARY)
 		next_baseaddr = (void *) internal_config.base_virtaddr;
 
+#ifdef RTE_ARCH_64
+	if (next_baseaddr == NULL && internal_config.base_virtaddr == 0 &&
+			rte_eal_process_type() == RTE_PROC_PRIMARY)
+		next_baseaddr = (void *) baseaddr;
+#endif
 	if (requested_addr == NULL && next_baseaddr != NULL) {
 		requested_addr = next_baseaddr;
 		requested_addr = RTE_PTR_ALIGN(requested_addr, page_sz);
@@ -89,9 +111,20 @@ 
 
 		mapped_addr = mmap(requested_addr, (size_t)map_sz, PROT_READ,
 				mmap_flags, -1, 0);
+
 		if (mapped_addr == MAP_FAILED && allow_shrink)
 			*size -= page_sz;
-	} while (allow_shrink && mapped_addr == MAP_FAILED && *size > 0);
+
+		if (mapped_addr != MAP_FAILED && addr_is_hint &&
+		    mapped_addr != requested_addr) {
+			/* hint was not used. Try with another offset */
+			munmap(mapped_addr, map_sz);
+			mapped_addr = MAP_FAILED;
+			next_baseaddr = RTE_PTR_ADD(next_baseaddr, 0x100000000);
+			requested_addr = next_baseaddr;
+		}
+	} while ((allow_shrink || addr_is_hint) &&
+		 mapped_addr == MAP_FAILED && *size > 0);
 
 	/* align resulting address - if map failed, we will ignore the value
 	 * anyway, so no need to add additional checks.