[v4] mempool: remove memory wastage on non x86
diff mbox series

Message ID 20200114210603.1836160-1-jerinj@marvell.com
State Accepted, archived
Delegated to: Thomas Monjalon
Headers show
Series
  • [v4] mempool: remove memory wastage on non x86
Related show

Checks

Context Check Description
ci/travis-robot success Travis build: passed
ci/Intel-compilation success Compilation OK
ci/iol-nxp-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-testing fail Testing issues
ci/iol-intel-Performance success Performance Testing PASS
ci/checkpatch success coding style OK

Commit Message

Jerin Jacob Kollanukkaran Jan. 14, 2020, 9:06 p.m. UTC
From: Jerin Jacob <jerinj@marvell.com>

The existing optimize_object_size() function address the memory object
alignment constraint on x86 for better performance.

Different (micro) architecture may have different memory alignment
constraint for better performance and it not the same as the existing
optimize_object_size().

Some use, XOR(kind of CRC) scheme to enable DRAM channel distribution
based on the address and some may have a different formula.

Introducing arch_mem_object_align() function to abstract
the difference between different (micro) architectures to avoid
wasting memory for mempool object alignment for the architecture
that it is not required to do so.

Details on the amount of memory saving:

Currently, arm64 based architectures use the default (nchan=4,
nrank=1). The worst case is for an object whose size (including mempool
header) is 2 cache lines, where it is optimized to 3 cache lines (+50%).

Examples for cache lines size = 64:
  orig     optimized
  64    -> 64           +0%
  128   -> 192          +50%
  192   -> 192          +0%
  256   -> 320          +25%
  320   -> 320          +0%
  384   -> 448          +16%
  ...
  2304  -> 2368         +2.7%  (~mbuf size)

Additional details:
https://www.mail-archive.com/dev@dpdk.org/msg149157.html

Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
v4:

- s/X86/x86 in the doc/guides/prog_guide/mempool_lib.rst file (David)
- Remove Fixes and CC: stable@dpdk.org as well, as this is just a
  improvement.(David)
- Udpate the patch heading to remove checkpatch warning due to above
  change.

v3:
- Change comment for MEMPOOL_F_NO_SPREAD flag as
" Spreading among memory channels not required." (Stephen Hemminger)

v2:
- Changed the return type of arch_mem_object_align() to "unsigned int" from
  "unsigned" to fix the checkpatch issues (Olivier Matz)
- Updated the comments for MEMPOOL_F_NO_SPREAD (Olivier Matz)
- Update the git comments to share the memory saving details.

 doc/guides/prog_guide/mempool_lib.rst |  6 +++---
 lib/librte_mempool/rte_mempool.c      | 17 +++++++++++++----
 lib/librte_mempool/rte_mempool.h      |  3 ++-
 3 files changed, 18 insertions(+), 8 deletions(-)

Comments

Jerin Jacob Jan. 16, 2020, 1:10 p.m. UTC | #1
On Wed, Jan 15, 2020 at 2:35 AM <jerinj@marvell.com> wrote:
>
> From: Jerin Jacob <jerinj@marvell.com>
>
> The existing optimize_object_size() function address the memory object
> alignment constraint on x86 for better performance.
>
> Different (micro) architecture may have different memory alignment
> constraint for better performance and it not the same as the existing
> optimize_object_size().
>
> Some use, XOR(kind of CRC) scheme to enable DRAM channel distribution
> based on the address and some may have a different formula.
>
> Introducing arch_mem_object_align() function to abstract
> the difference between different (micro) architectures to avoid
> wasting memory for mempool object alignment for the architecture
> that it is not required to do so.
>
> Details on the amount of memory saving:
>
> Currently, arm64 based architectures use the default (nchan=4,
> nrank=1). The worst case is for an object whose size (including mempool
> header) is 2 cache lines, where it is optimized to 3 cache lines (+50%).
>
> Examples for cache lines size = 64:
>   orig     optimized
>   64    -> 64           +0%
>   128   -> 192          +50%
>   192   -> 192          +0%
>   256   -> 320          +25%
>   320   -> 320          +0%
>   384   -> 448          +16%
>   ...
>   2304  -> 2368         +2.7%  (~mbuf size)
>
> Additional details:
> https://www.mail-archive.com/dev@dpdk.org/msg149157.html
>
> Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>

Ping for merge.


> ---
> v4:
>
> - s/X86/x86 in the doc/guides/prog_guide/mempool_lib.rst file (David)
> - Remove Fixes and CC: stable@dpdk.org as well, as this is just a
>   improvement.(David)
> - Udpate the patch heading to remove checkpatch warning due to above
>   change.
>
> v3:
> - Change comment for MEMPOOL_F_NO_SPREAD flag as
> " Spreading among memory channels not required." (Stephen Hemminger)
>
> v2:
> - Changed the return type of arch_mem_object_align() to "unsigned int" from
>   "unsigned" to fix the checkpatch issues (Olivier Matz)
> - Updated the comments for MEMPOOL_F_NO_SPREAD (Olivier Matz)
> - Update the git comments to share the memory saving details.
>
>  doc/guides/prog_guide/mempool_lib.rst |  6 +++---
>  lib/librte_mempool/rte_mempool.c      | 17 +++++++++++++----
>  lib/librte_mempool/rte_mempool.h      |  3 ++-
>  3 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/doc/guides/prog_guide/mempool_lib.rst b/doc/guides/prog_guide/mempool_lib.rst
> index 3bb84b0a6..f8b430d65 100644
> --- a/doc/guides/prog_guide/mempool_lib.rst
> +++ b/doc/guides/prog_guide/mempool_lib.rst
> @@ -27,10 +27,10 @@ In debug mode (CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG is enabled),
>  statistics about get from/put in the pool are stored in the mempool structure.
>  Statistics are per-lcore to avoid concurrent access to statistics counters.
>
> -Memory Alignment Constraints
> -----------------------------
> +Memory Alignment Constraints on x86 architecture
> +------------------------------------------------
>
> -Depending on hardware memory configuration, performance can be greatly improved by adding a specific padding between objects.
> +Depending on hardware memory configuration on X86 architecture, performance can be greatly improved by adding a specific padding between objects.
>  The objective is to ensure that the beginning of each object starts on a different channel and rank in memory so that all channels are equally loaded.
>
>  This is particularly true for packet buffers when doing L3 forwarding or flow classification.
> diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
> index 78d8eb941..1909998e8 100644
> --- a/lib/librte_mempool/rte_mempool.c
> +++ b/lib/librte_mempool/rte_mempool.c
> @@ -45,6 +45,7 @@ EAL_REGISTER_TAILQ(rte_mempool_tailq)
>  #define CALC_CACHE_FLUSHTHRESH(c)      \
>         ((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
>
> +#if defined(RTE_ARCH_X86)
>  /*
>   * return the greatest common divisor between a and b (fast algorithm)
>   *
> @@ -74,12 +75,13 @@ static unsigned get_gcd(unsigned a, unsigned b)
>  }
>
>  /*
> - * Depending on memory configuration, objects addresses are spread
> + * Depending on memory configuration on x86 arch, objects addresses are spread
>   * between channels and ranks in RAM: the pool allocator will add
>   * padding between objects. This function return the new size of the
>   * object.
>   */
> -static unsigned optimize_object_size(unsigned obj_size)
> +static unsigned int
> +arch_mem_object_align(unsigned int obj_size)
>  {
>         unsigned nrank, nchan;
>         unsigned new_obj_size;
> @@ -99,6 +101,13 @@ static unsigned optimize_object_size(unsigned obj_size)
>                 new_obj_size++;
>         return new_obj_size * RTE_MEMPOOL_ALIGN;
>  }
> +#else
> +static unsigned int
> +arch_mem_object_align(unsigned int obj_size)
> +{
> +       return obj_size;
> +}
> +#endif
>
>  struct pagesz_walk_arg {
>         int socket_id;
> @@ -234,8 +243,8 @@ rte_mempool_calc_obj_size(uint32_t elt_size, uint32_t flags,
>          */
>         if ((flags & MEMPOOL_F_NO_SPREAD) == 0) {
>                 unsigned new_size;
> -               new_size = optimize_object_size(sz->header_size + sz->elt_size +
> -                       sz->trailer_size);
> +               new_size = arch_mem_object_align
> +                           (sz->header_size + sz->elt_size + sz->trailer_size);
>                 sz->trailer_size = new_size - sz->header_size - sz->elt_size;
>         }
>
> diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> index f81152af9..f48c94def 100644
> --- a/lib/librte_mempool/rte_mempool.h
> +++ b/lib/librte_mempool/rte_mempool.h
> @@ -260,7 +260,8 @@ struct rte_mempool {
>  #endif
>  }  __rte_cache_aligned;
>
> -#define MEMPOOL_F_NO_SPREAD      0x0001 /**< Do not spread among memory channels. */
> +#define MEMPOOL_F_NO_SPREAD      0x0001
> +/**< Spreading among memory channels not required. */
>  #define MEMPOOL_F_NO_CACHE_ALIGN 0x0002 /**< Do not align objs on cache lines.*/
>  #define MEMPOOL_F_SP_PUT         0x0004 /**< Default put is "single-producer".*/
>  #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
> --
> 2.24.1
>
Olivier Matz Jan. 20, 2020, 12:24 p.m. UTC | #2
On Thu, Jan 16, 2020 at 06:40:23PM +0530, Jerin Jacob wrote:
> On Wed, Jan 15, 2020 at 2:35 AM <jerinj@marvell.com> wrote:
> >
> > From: Jerin Jacob <jerinj@marvell.com>
> >
> > The existing optimize_object_size() function address the memory object
> > alignment constraint on x86 for better performance.
> >
> > Different (micro) architecture may have different memory alignment
> > constraint for better performance and it not the same as the existing
> > optimize_object_size().
> >
> > Some use, XOR(kind of CRC) scheme to enable DRAM channel distribution
> > based on the address and some may have a different formula.
> >
> > Introducing arch_mem_object_align() function to abstract
> > the difference between different (micro) architectures to avoid
> > wasting memory for mempool object alignment for the architecture
> > that it is not required to do so.
> >
> > Details on the amount of memory saving:
> >
> > Currently, arm64 based architectures use the default (nchan=4,
> > nrank=1). The worst case is for an object whose size (including mempool
> > header) is 2 cache lines, where it is optimized to 3 cache lines (+50%).
> >
> > Examples for cache lines size = 64:
> >   orig     optimized
> >   64    -> 64           +0%
> >   128   -> 192          +50%
> >   192   -> 192          +0%
> >   256   -> 320          +25%
> >   320   -> 320          +0%
> >   384   -> 448          +16%
> >   ...
> >   2304  -> 2368         +2.7%  (~mbuf size)
> >
> > Additional details:
> > https://www.mail-archive.com/dev@dpdk.org/msg149157.html
> >
> > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> 
> Ping for merge.

Acked-by: Olivier Matz <olivier.matz@6wind.com>
Thomas Monjalon Jan. 20, 2020, 12:29 p.m. UTC | #3
20/01/2020 13:24, Olivier Matz:
> On Thu, Jan 16, 2020 at 06:40:23PM +0530, Jerin Jacob wrote:
> > On Wed, Jan 15, 2020 at 2:35 AM <jerinj@marvell.com> wrote:
> > >
> > > From: Jerin Jacob <jerinj@marvell.com>
> > >
> > > The existing optimize_object_size() function address the memory object
> > > alignment constraint on x86 for better performance.
> > >
> > > Different (micro) architecture may have different memory alignment
> > > constraint for better performance and it not the same as the existing
> > > optimize_object_size().
> > >
> > > Some use, XOR(kind of CRC) scheme to enable DRAM channel distribution
> > > based on the address and some may have a different formula.
> > >
> > > Introducing arch_mem_object_align() function to abstract
> > > the difference between different (micro) architectures to avoid
> > > wasting memory for mempool object alignment for the architecture
> > > that it is not required to do so.
> > >
> > > Details on the amount of memory saving:
> > >
> > > Currently, arm64 based architectures use the default (nchan=4,
> > > nrank=1). The worst case is for an object whose size (including mempool
> > > header) is 2 cache lines, where it is optimized to 3 cache lines (+50%).
> > >
> > > Examples for cache lines size = 64:
> > >   orig     optimized
> > >   64    -> 64           +0%
> > >   128   -> 192          +50%
> > >   192   -> 192          +0%
> > >   256   -> 320          +25%
> > >   320   -> 320          +0%
> > >   384   -> 448          +16%
> > >   ...
> > >   2304  -> 2368         +2.7%  (~mbuf size)
> > >
> > > Additional details:
> > > https://www.mail-archive.com/dev@dpdk.org/msg149157.html
> > >
> > > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > 
> > Ping for merge.
> 
> Acked-by: Olivier Matz <olivier.matz@6wind.com>

Applied, thanks

Patch
diff mbox series

diff --git a/doc/guides/prog_guide/mempool_lib.rst b/doc/guides/prog_guide/mempool_lib.rst
index 3bb84b0a6..f8b430d65 100644
--- a/doc/guides/prog_guide/mempool_lib.rst
+++ b/doc/guides/prog_guide/mempool_lib.rst
@@ -27,10 +27,10 @@  In debug mode (CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG is enabled),
 statistics about get from/put in the pool are stored in the mempool structure.
 Statistics are per-lcore to avoid concurrent access to statistics counters.
 
-Memory Alignment Constraints
-----------------------------
+Memory Alignment Constraints on x86 architecture
+------------------------------------------------
 
-Depending on hardware memory configuration, performance can be greatly improved by adding a specific padding between objects.
+Depending on hardware memory configuration on X86 architecture, performance can be greatly improved by adding a specific padding between objects.
 The objective is to ensure that the beginning of each object starts on a different channel and rank in memory so that all channels are equally loaded.
 
 This is particularly true for packet buffers when doing L3 forwarding or flow classification.
diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index 78d8eb941..1909998e8 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -45,6 +45,7 @@  EAL_REGISTER_TAILQ(rte_mempool_tailq)
 #define CALC_CACHE_FLUSHTHRESH(c)	\
 	((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))
 
+#if defined(RTE_ARCH_X86)
 /*
  * return the greatest common divisor between a and b (fast algorithm)
  *
@@ -74,12 +75,13 @@  static unsigned get_gcd(unsigned a, unsigned b)
 }
 
 /*
- * Depending on memory configuration, objects addresses are spread
+ * Depending on memory configuration on x86 arch, objects addresses are spread
  * between channels and ranks in RAM: the pool allocator will add
  * padding between objects. This function return the new size of the
  * object.
  */
-static unsigned optimize_object_size(unsigned obj_size)
+static unsigned int
+arch_mem_object_align(unsigned int obj_size)
 {
 	unsigned nrank, nchan;
 	unsigned new_obj_size;
@@ -99,6 +101,13 @@  static unsigned optimize_object_size(unsigned obj_size)
 		new_obj_size++;
 	return new_obj_size * RTE_MEMPOOL_ALIGN;
 }
+#else
+static unsigned int
+arch_mem_object_align(unsigned int obj_size)
+{
+	return obj_size;
+}
+#endif
 
 struct pagesz_walk_arg {
 	int socket_id;
@@ -234,8 +243,8 @@  rte_mempool_calc_obj_size(uint32_t elt_size, uint32_t flags,
 	 */
 	if ((flags & MEMPOOL_F_NO_SPREAD) == 0) {
 		unsigned new_size;
-		new_size = optimize_object_size(sz->header_size + sz->elt_size +
-			sz->trailer_size);
+		new_size = arch_mem_object_align
+			    (sz->header_size + sz->elt_size + sz->trailer_size);
 		sz->trailer_size = new_size - sz->header_size - sz->elt_size;
 	}
 
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index f81152af9..f48c94def 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -260,7 +260,8 @@  struct rte_mempool {
 #endif
 }  __rte_cache_aligned;
 
-#define MEMPOOL_F_NO_SPREAD      0x0001 /**< Do not spread among memory channels. */
+#define MEMPOOL_F_NO_SPREAD      0x0001
+/**< Spreading among memory channels not required. */
 #define MEMPOOL_F_NO_CACHE_ALIGN 0x0002 /**< Do not align objs on cache lines.*/
 #define MEMPOOL_F_SP_PUT         0x0004 /**< Default put is "single-producer".*/
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/