[RFC] mempool: CPU cache aligning mempool driver accesses

Message ID 98CBD80474FA8B44BF855DF32C47DC35E9EFD4@smartserver.smartshare.dk (mailing list archive)
State New
Delegated to: Thomas Monjalon
Headers
Series [RFC] mempool: CPU cache aligning mempool driver accesses |

Checks

Context Check Description
ci/loongarch-compilation warning apply patch failure
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS

Commit Message

Morten Brørup Nov. 4, 2023, 5:29 p.m. UTC
  I tried a little experiment, which gave a 25 % improvement in mempool
perf tests for long bursts (n_get_bulk=32 n_put_bulk=32 n_keep=512
constant_n=0) on a Xeon E5-2620 v4 based system.

This is the concept:

If all accesses to the mempool driver goes through the mempool cache,
we can ensure that these bulk load/stores are always CPU cache aligned,
by using cache->size when loading/storing to the mempool driver.

Furthermore, it is rumored that most applications use the default
mempool cache size, so if the driver tests for that specific value,
it can use rte_memcpy(src,dst,N) with N known at build time, allowing
optimal performance for copying the array of objects.

Unfortunately, I need to change the flush threshold from 1.5 to 2 to
be able to always use cache->size when loading/storing to the mempool
driver.

What do you think?

PS: If we can't get rid of the mempool cache size threshold factor,
we really need to expose it through public APIs. A job for another day.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
  

Comments

Bruce Richardson Nov. 6, 2023, 9:45 a.m. UTC | #1
On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Brørup wrote:
> I tried a little experiment, which gave a 25 % improvement in mempool
> perf tests for long bursts (n_get_bulk=32 n_put_bulk=32 n_keep=512
> constant_n=0) on a Xeon E5-2620 v4 based system.
> 
> This is the concept:
> 
> If all accesses to the mempool driver goes through the mempool cache,
> we can ensure that these bulk load/stores are always CPU cache aligned,
> by using cache->size when loading/storing to the mempool driver.
> 
> Furthermore, it is rumored that most applications use the default
> mempool cache size, so if the driver tests for that specific value,
> it can use rte_memcpy(src,dst,N) with N known at build time, allowing
> optimal performance for copying the array of objects.
> 
> Unfortunately, I need to change the flush threshold from 1.5 to 2 to
> be able to always use cache->size when loading/storing to the mempool
> driver.
> 
> What do you think?
> 
> PS: If we can't get rid of the mempool cache size threshold factor,
> we really need to expose it through public APIs. A job for another day.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
Interesting, thanks.

Out of interest, is there any different in performance you observe if using
regular libc memcpy vs rte_memcpy for the ring copies? Since the copy
amount is constant, a regular memcpy call should be expanded by the
compiler itself, and so should be pretty efficient.

/Bruce
  
Morten Brørup Nov. 6, 2023, 10:29 a.m. UTC | #2
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Monday, 6 November 2023 10.45
> 
> On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Brørup wrote:
> > I tried a little experiment, which gave a 25 % improvement in mempool
> > perf tests for long bursts (n_get_bulk=32 n_put_bulk=32 n_keep=512
> > constant_n=0) on a Xeon E5-2620 v4 based system.
> >
> > This is the concept:
> >
> > If all accesses to the mempool driver goes through the mempool cache,
> > we can ensure that these bulk load/stores are always CPU cache
> aligned,
> > by using cache->size when loading/storing to the mempool driver.
> >
> > Furthermore, it is rumored that most applications use the default
> > mempool cache size, so if the driver tests for that specific value,
> > it can use rte_memcpy(src,dst,N) with N known at build time, allowing
> > optimal performance for copying the array of objects.
> >
> > Unfortunately, I need to change the flush threshold from 1.5 to 2 to
> > be able to always use cache->size when loading/storing to the mempool
> > driver.
> >
> > What do you think?
> >
> > PS: If we can't get rid of the mempool cache size threshold factor,
> > we really need to expose it through public APIs. A job for another
> day.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> Interesting, thanks.
> 
> Out of interest, is there any different in performance you observe if
> using
> regular libc memcpy vs rte_memcpy for the ring copies? Since the copy
> amount is constant, a regular memcpy call should be expanded by the
> compiler itself, and so should be pretty efficient.

I ran some tests without patching rte_ring_elem_pvt.h, i.e. without introducing the constant-size copy loop. I got the majority of the performance gain at this point.

At this point, both pointers are CPU cache aligned when refilling the mempool cache, and the destination pointer is CPU cache aligned when draining the mempool cache.

In other words: When refilling the mempool cache, it is both loading and storing entire CPU cache lines. And when draining, it is storing entire CPU cache lines.


Adding the fixed-size copy loop provided an additional performance gain. I didn't test other constant-size copy methods than rte_memcpy.

rte_memcpy should have optimal conditions in this patch, because N is known to be 512 * 8 = 4 KiB at build time. Furthermore, both pointers are CPU cache aligned when refilling the mempool cache, and the destination pointer is CPU cache aligned when draining the mempool cache. I don't recall if pointer alignment matters for rte_memcpy, though.

The memcpy in libc (or more correctly: intrinsic to the compiler) will do non-temporal copying for large sizes, and I don't know what that threshold is, so I think rte_memcpy is the safe bet here. Especially if someone builds DPDK with a larger mempool cache size than 512 objects.

On the other hand, non-temporal access to the objects in the ring might be beneficial if the ring is so large that they go cold before the application loads them from the ring again.
  
Morten Brørup Nov. 9, 2023, 10:45 a.m. UTC | #3
+TO: Andrew, mempool maintainer

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Monday, 6 November 2023 11.29
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Monday, 6 November 2023 10.45
> >
> > On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Brørup wrote:
> > > I tried a little experiment, which gave a 25 % improvement in
> mempool
> > > perf tests for long bursts (n_get_bulk=32 n_put_bulk=32 n_keep=512
> > > constant_n=0) on a Xeon E5-2620 v4 based system.
> > >
> > > This is the concept:
> > >
> > > If all accesses to the mempool driver goes through the mempool
> cache,
> > > we can ensure that these bulk load/stores are always CPU cache
> > aligned,
> > > by using cache->size when loading/storing to the mempool driver.
> > >
> > > Furthermore, it is rumored that most applications use the default
> > > mempool cache size, so if the driver tests for that specific value,
> > > it can use rte_memcpy(src,dst,N) with N known at build time,
> allowing
> > > optimal performance for copying the array of objects.
> > >
> > > Unfortunately, I need to change the flush threshold from 1.5 to 2
> to
> > > be able to always use cache->size when loading/storing to the
> mempool
> > > driver.
> > >
> > > What do you think?

It's the concept of accessing the underlying mempool in entire cache lines I am seeking feedback for.

The provided code is just an example, mainly for testing performance of the concept.

> > >
> > > PS: If we can't get rid of the mempool cache size threshold factor,
> > > we really need to expose it through public APIs. A job for another
> > day.

The concept that a mempool per-lcore cache can hold more objects than its size is extremely weird, and certainly unexpected by any normal developer. And thus it is likely to cause runtime errors for applications designing tightly sized mempools.

So, if we move forward with this RFC, I propose eliminating the threshold factor, so the mempool per-lcore caches cannot hold more objects than their size.
When doing this, we might also choose to double RTE_MEMPOOL_CACHE_MAX_SIZE, to prevent any performance degradation.

> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > Interesting, thanks.
> >
> > Out of interest, is there any different in performance you observe if
> > using
> > regular libc memcpy vs rte_memcpy for the ring copies? Since the copy
> > amount is constant, a regular memcpy call should be expanded by the
> > compiler itself, and so should be pretty efficient.
> 
> I ran some tests without patching rte_ring_elem_pvt.h, i.e. without
> introducing the constant-size copy loop. I got the majority of the
> performance gain at this point.
> 
> At this point, both pointers are CPU cache aligned when refilling the
> mempool cache, and the destination pointer is CPU cache aligned when
> draining the mempool cache.
> 
> In other words: When refilling the mempool cache, it is both loading
> and storing entire CPU cache lines. And when draining, it is storing
> entire CPU cache lines.
> 
> 
> Adding the fixed-size copy loop provided an additional performance
> gain. I didn't test other constant-size copy methods than rte_memcpy.
> 
> rte_memcpy should have optimal conditions in this patch, because N is
> known to be 512 * 8 = 4 KiB at build time. Furthermore, both pointers
> are CPU cache aligned when refilling the mempool cache, and the
> destination pointer is CPU cache aligned when draining the mempool
> cache. I don't recall if pointer alignment matters for rte_memcpy,
> though.
> 
> The memcpy in libc (or more correctly: intrinsic to the compiler) will
> do non-temporal copying for large sizes, and I don't know what that
> threshold is, so I think rte_memcpy is the safe bet here. Especially if
> someone builds DPDK with a larger mempool cache size than 512 objects.
> 
> On the other hand, non-temporal access to the objects in the ring might
> be beneficial if the ring is so large that they go cold before the
> application loads them from the ring again.
  

Patch

diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 7a7a9bf6db..b21033209b 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -48,7 +48,7 @@  static void
 mempool_event_callback_invoke(enum rte_mempool_event event,
                              struct rte_mempool *mp);

-#define CACHE_FLUSHTHRESH_MULTIPLIER 1.5
+#define CACHE_FLUSHTHRESH_MULTIPLIER 2
 #define CALC_CACHE_FLUSHTHRESH(c)      \
        ((typeof(c))((c) * CACHE_FLUSHTHRESH_MULTIPLIER))

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index df87cd231e..76efeff59e 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1014,7 +1014,7 @@  typedef void (rte_mempool_ctor_t)(struct rte_mempool *, void *);
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
  *   per-lcore object cache. This argument must be lower or equal to
- *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose
+ *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 2. It is advised to choose
  *   cache_size to have "n modulo cache_size == 0": if this is
  *   not the case, some elements will always stay in the pool and will
  *   never be used. The access to the per-lcore table is of course
@@ -1373,24 +1373,24 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
        RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);

        /* The request itself is too big for the cache */
-       if (unlikely(n > cache->flushthresh))
+       if (unlikely(n > cache->size))
                goto driver_enqueue_stats_incremented;

        /*
         * The cache follows the following algorithm:
         *   1. If the objects cannot be added to the cache without crossing
-        *      the flush threshold, flush the cache to the backend.
+        *      the flush threshold, flush a fixed amount of the cache to the backend.
         *   2. Add the objects to the cache.
         */

        if (cache->len + n <= cache->flushthresh) {
                cache_objs = &cache->objs[cache->len];
-               cache->len += n;
        } else {
-               cache_objs = &cache->objs[0];
-               rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
-               cache->len = n;
+               cache->len -= cache->size;
+               cache_objs = &cache->objs[cache->len];
+               rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->size);
        }
+       cache->len += n;

        /* Add the objects to the cache. */
        rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
@@ -1547,13 +1547,13 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
                return 0;
        }

-       /* if dequeue below would overflow mem allocated for cache */
-       if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+       /* More remaining than the cache size */
+       if (unlikely(remaining > cache->size))
                goto driver_dequeue;

-       /* Fill the cache from the backend; fetch size + remaining objects. */
+       /* Fill the cache from the backend; fetch size objects. */
        ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
-                       cache->size + remaining);
+                       cache->size);
        if (unlikely(ret < 0)) {
                /*
                 * We are buffer constrained, and not able to allocate
@@ -1565,11 +1565,11 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
        }

        /* Satisfy the remaining part of the request from the filled cache. */
-       cache_objs = &cache->objs[cache->size + remaining];
+       cache_objs = &cache->objs[cache->size];
        for (index = 0; index < remaining; index++)
                *obj_table++ = *--cache_objs;

-       cache->len = cache->size;
+       cache->len = cache->size - remaining;

        RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
        RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 4b80f58980..2b10b76fc1 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -10,6 +10,9 @@ 
 #ifndef _RTE_RING_ELEM_PVT_H_
 #define _RTE_RING_ELEM_PVT_H_

+#include <rte_config.h>
+#include <rte_memcpy.h>
+
 #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 120000)
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wstringop-overflow"
@@ -24,6 +27,12 @@  __rte_ring_enqueue_elems_32(struct rte_ring *r, const uint32_t size,
        uint32_t *ring = (uint32_t *)&r[1];
        const uint32_t *obj = (const uint32_t *)obj_table;
        if (likely(idx + n <= size)) {
+#ifdef RTE_ARCH_32
+               if (n == RTE_MEMPOOL_CACHE_MAX_SIZE) {
+                       rte_memcpy(&ring[idx], obj_table, RTE_MEMPOOL_CACHE_MAX_SIZE * sizeof(uint32_t));
+                       return;
+               }
+#endif
                for (i = 0; i < (n & ~0x7); i += 8, idx += 8) {
                        ring[idx] = obj[i];
                        ring[idx + 1] = obj[i + 1];
@@ -69,6 +78,12 @@  __rte_ring_enqueue_elems_64(struct rte_ring *r, uint32_t prod_head,
        uint64_t *ring = (uint64_t *)&r[1];
        const unaligned_uint64_t *obj = (const unaligned_uint64_t *)obj_table;
        if (likely(idx + n <= size)) {
+#ifdef RTE_ARCH_64
+               if (n == RTE_MEMPOOL_CACHE_MAX_SIZE) {
+                       rte_memcpy(&ring[idx], obj_table, RTE_MEMPOOL_CACHE_MAX_SIZE * sizeof(uint64_t));
+                       return;
+               }
+#endif
                for (i = 0; i < (n & ~0x3); i += 4, idx += 4) {
                        ring[idx] = obj[i];
                        ring[idx + 1] = obj[i + 1];
@@ -158,6 +173,12 @@  __rte_ring_dequeue_elems_32(struct rte_ring *r, const uint32_t size,
        uint32_t *ring = (uint32_t *)&r[1];
        uint32_t *obj = (uint32_t *)obj_table;
        if (likely(idx + n <= size)) {
+#ifdef RTE_ARCH_32
+               if (n == RTE_MEMPOOL_CACHE_MAX_SIZE) {
+                       rte_memcpy(obj_table, &ring[idx], RTE_MEMPOOL_CACHE_MAX_SIZE * sizeof(uint32_t));
+                       return;
+               }
+#endif
                for (i = 0; i < (n & ~0x7); i += 8, idx += 8) {
                        obj[i] = ring[idx];
                        obj[i + 1] = ring[idx + 1];
@@ -203,6 +224,12 @@  __rte_ring_dequeue_elems_64(struct rte_ring *r, uint32_t cons_head,
        uint64_t *ring = (uint64_t *)&r[1];
        unaligned_uint64_t *obj = (unaligned_uint64_t *)obj_table;
        if (likely(idx + n <= size)) {
+#ifdef RTE_ARCH_64
+               if (n == RTE_MEMPOOL_CACHE_MAX_SIZE) {
+                       rte_memcpy(obj_table, &ring[idx], RTE_MEMPOOL_CACHE_MAX_SIZE * sizeof(uint64_t));
+                       return;
+               }
+#endif
                for (i = 0; i < (n & ~0x3); i += 4, idx += 4) {
                        obj[i] = ring[idx];
                        obj[i + 1] = ring[idx + 1];