mempool: optimize put objects to mempool with cache

Message ID 20220117115231.8060-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series mempool: optimize put objects to mempool with cache |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/github-robot: build fail github build: failed
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-abi-testing warning Testing issues

Commit Message

Morten Brørup Jan. 17, 2022, 11:52 a.m. UTC
  This patch optimizes the rte_mempool_do_generic_put() caching algorithm.

The algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

(Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
"size" objects, which is reflected in the above description.)

Now, the algorithm is:
 1. If the the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch fixes the following two inefficiencies of the old algorithm:

1. The most recent (hot) objects are flushed, leaving the oldest (cold)
objects in the mempool cache.
This is bad for CPUs with a small L1 cache, because when they get
objects from the mempool after the mempool cache has been flushed, they
get cold objects instead of hot objects.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

2. The cache is still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The current flushing behaviour is suboptimal for real life applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
NB: When the application is in a state where put/get events are in
balance, the cache should remain within its low and high watermarks, and
the algorithms for refilling/flushing the cache should not come into
play.
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

Not adding the new objects to the mempool cache before flushing it also
allows the memory allocated for the mempool cache to be reduced from 3 x
to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.

Futhermore, a minor bug in the flush threshold comparison has been
corrected; it must be "len > flushthresh", not "len > flushthresh".
Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
would be flushed already when reaching size elements, not when exceeding
size elements.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

And finally, using the x86 variant of rte_memcpy() is inefficient here,
where n is relatively small and unknown at compile time.
Now, it has been replaced by an alternative copying method, optimized
for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
or multiples thereof.
The mempool cache is cache line aligned for the benefit of this copying
method, which on some CPU architectures performs worse on data crossing
a cache boundary.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 47 ++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 13 deletions(-)
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..1ce850bedd 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -94,7 +94,8 @@  struct rte_mempool_cache {
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
+	/**< Cache objects */
 } __rte_cache_aligned;
 
 /**
@@ -1344,31 +1345,51 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = cache->objs;
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
+	}
+
+	/* Add the objects to the cache. */
+	for (; n >= 4; n -= 4) {
+#ifdef RTE_ARCH_64
+		rte_mov32((unsigned char *)cache_objs, (const unsigned char *)obj_table);
+#else
+		rte_mov16((unsigned char *)cache_objs, (const unsigned char *)obj_table);
+#endif
+		cache_objs += 4;
+		obj_table += 4;
 	}
+	for (; n > 0; --n)
+		*cache_objs++ = *obj_table++;
 
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");