[v4] mempool: fix get objects from mempool with cache

Message ID 20221007104450.2567961-1-andrew.rybchenko@oktetlabs.ru (mailing list archive)
State Accepted, archived
Delegated to: Thomas Monjalon
Headers
Series [v4] mempool: fix get objects from mempool with cache |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/github-robot: build success github build: passed
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS

Commit Message

Andrew Rybchenko Oct. 7, 2022, 10:44 a.m. UTC
  From: Morten Brørup <mb@smartsharesystems.com>

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

Fix the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

This bug caused a major performance degradation in scenarios where the
application burst length is the same as the cache size. In such cases,
the objects were not ever fetched from the mempool cache, regardless if
they could have been.
This scenario occurs e.g. if an application has configured a mempool
with a size matching the application's burst size.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the backend.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the backend were returned ahead of the objects from the cache.
This bug degraded application performance on CPUs with a small L1 cache,
which benefit from having the hot objects first in the returned array.
(This is probably also the reason why the function returns the objects
in reverse order, which it still does.)
Now, all code paths first return objects from the cache, subsequently
from the backend.

The function was not behaving as described (by the function using it)
and expected by applications using it. This in itself is also a bug.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the backend (instead of only the
number of requested objects minus the objects available in the backend),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the backend fails,
only the remaining requested objects are retrieved from the backend.

The function would fail despite there are enough objects in the cache
plus the common pool.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
---
v4 changes (Andrew Rybchenko)
 - Avoid usage of misleading ring, since other mempool drivers
   exist, use term backend
 - Avoid term ring in goto label, use driver_dequeue as a label name
 - Add likely() to cache != NULL in driver dequeue, just for symmetry
 - Highlight that remaining objects are deqeueued from the driver

v3 changes (Andrew Rybchenko)
 - Always get first objects from the cache even if request is bigger
   than cache size. Remove one corresponding condition from the path
   when request is fully served from cache.
 - Simplify code to avoid duplication:
    - Get objects directly from backend in single place only.
    - Share code which gets from the cache first regardless if
      everythihg is obtained from the cache or just the first part.
 - Rollback cache length in unlikely failure branch to avoid cache
   vs NULL check in success branch.

v2 changes
- Do not modify description of return value. This belongs in a separate
doc fix.
- Elaborate even more on which bugs the modifications fix.

 lib/mempool/rte_mempool.h | 80 +++++++++++++++++++++++++--------------
 1 file changed, 51 insertions(+), 29 deletions(-)
  

Comments

Thomas Monjalon Oct. 8, 2022, 8:56 p.m. UTC | #1
07/10/2022 12:44, Andrew Rybchenko:
> From: Morten Brørup <mb@smartsharesystems.com>
> 
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> Fix the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> This bug caused a major performance degradation in scenarios where the
> application burst length is the same as the cache size. In such cases,
> the objects were not ever fetched from the mempool cache, regardless if
> they could have been.
> This scenario occurs e.g. if an application has configured a mempool
> with a size matching the application's burst size.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the backend.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the objects
> from the backend were returned ahead of the objects from the cache.
> This bug degraded application performance on CPUs with a small L1 cache,
> which benefit from having the hot objects first in the returned array.
> (This is probably also the reason why the function returns the objects
> in reverse order, which it still does.)
> Now, all code paths first return objects from the cache, subsequently
> from the backend.
> 
> The function was not behaving as described (by the function using it)
> and expected by applications using it. This in itself is also a bug.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the backend (instead of only the
> number of requested objects minus the objects available in the backend),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the backend fails,
> only the remaining requested objects are retrieved from the backend.
> 
> The function would fail despite there are enough objects in the cache
> plus the common pool.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>

Applied, thanks.
  
Morten Brørup Oct. 11, 2022, 8:30 p.m. UTC | #2
Dear Intel PMD maintainers (CC: techboard),

I strongly recommend that you update the code you copy-pasted from the mempool library to your PMDs, so they reflect the new and improved mempool cache behavior [1]. When choosing to copy-paste code from a core library, you should feel obliged to keep your copied code matching the source code you copied it from!

Also, as reported in bug #1052, you forgot to copy-paste the instrumentation, thereby 1. making the mempool debug statistics invalid and 2. omitting the mempool accesses from the trace when using your PMDs. :-(

Alternatively, just remove the copy-pasted code and use the mempool library's API instead. ;-)

The direct re-arm code also contains copy-pasted mempool cache handling code - which was accepted with the argument that the same code was already copy-pasted elsewhere. I don't know if the direct re-arm code also needs updating... Authors of that patch (CC to this email), please coordinate with the PMD maintainers.

PS:  As noted in the 22.11-rc1 release notes, more changes to the mempool library [2] may be coming.

[1]: https://patches.dpdk.org/project/dpdk/patch/20221007104450.2567961-1-andrew.rybchenko@oktetlabs.ru/

[2]: https://patches.dpdk.org/project/dpdk/list/?series=25063

-Morten
  
Honnappa Nagarahalli Oct. 11, 2022, 9:47 p.m. UTC | #3
<snip>

> 
> Dear Intel PMD maintainers (CC: techboard),
> 
> I strongly recommend that you update the code you copy-pasted from the
> mempool library to your PMDs, so they reflect the new and improved
> mempool cache behavior [1]. When choosing to copy-paste code from a core
> library, you should feel obliged to keep your copied code matching the source
> code you copied it from!
> 
> Also, as reported in bug #1052, you forgot to copy-paste the instrumentation,
> thereby 1. making the mempool debug statistics invalid and 2. omitting the
> mempool accesses from the trace when using your PMDs. :-(
We are working on mempool APIs to expose the per core cache memory to PMD so that the buffers can be copied directly. We are planning to fix this duplication as part of that.

> 
> Alternatively, just remove the copy-pasted code and use the mempool
> library's API instead. ;-)
> 
> The direct re-arm code also contains copy-pasted mempool cache handling
> code - which was accepted with the argument that the same code was
> already copy-pasted elsewhere. I don't know if the direct re-arm code also
> needs updating... Authors of that patch (CC to this email), please coordinate
> with the PMD maintainers.
Direct-rearm patch is not accepted yet.

> 
> PS:  As noted in the 22.11-rc1 release notes, more changes to the mempool
> library [2] may be coming.
> 
> [1]:
> https://patches.dpdk.org/project/dpdk/patch/20221007104450.2567961-1-
> andrew.rybchenko@oktetlabs.ru/
> 
> [2]: https://patches.dpdk.org/project/dpdk/list/?series=25063
> 
> -Morten
  
Olivier Matz Oct. 14, 2022, 2:01 p.m. UTC | #4
On Sat, Oct 08, 2022 at 10:56:06PM +0200, Thomas Monjalon wrote:
> 07/10/2022 12:44, Andrew Rybchenko:
> > From: Morten Brørup <mb@smartsharesystems.com>
> > 
> > A flush threshold for the mempool cache was introduced in DPDK version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then, and some inefficiencies were introduced.
> > 
> > Fix the following in rte_mempool_do_generic_get():
> > 
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction of
> > the flush threshold. E.g. the cache can hold up to flushthresh objects,
> > which is more than its size, so some requests were not served from the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> > 
> > This bug caused a major performance degradation in scenarios where the
> > application burst length is the same as the cache size. In such cases,
> > the objects were not ever fetched from the mempool cache, regardless if
> > they could have been.
> > This scenario occurs e.g. if an application has configured a mempool
> > with a size matching the application's burst size.
> > 
> > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the backend.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the objects
> > from the backend were returned ahead of the objects from the cache.
> > This bug degraded application performance on CPUs with a small L1 cache,
> > which benefit from having the hot objects first in the returned array.
> > (This is probably also the reason why the function returns the objects
> > in reverse order, which it still does.)
> > Now, all code paths first return objects from the cache, subsequently
> > from the backend.
> > 
> > The function was not behaving as described (by the function using it)
> > and expected by applications using it. This in itself is also a bug.
> > 
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the backend (instead of only the
> > number of requested objects minus the objects available in the backend),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the cache,
> > and if the subsequent backfilling of the cache from the backend fails,
> > only the remaining requested objects are retrieved from the backend.
> > 
> > The function would fail despite there are enough objects in the cache
> > plus the common pool.
> > 
> > 4. The code flow for satisfying the request from the cache was slightly
> > inefficient:
> > The likely code path where the objects are simply served from the cache
> > was treated as unlikely. Now it is treated as likely.
> > 
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Applied, thanks.

Better late than never: I reviewed this patch after it has been pushed,
and it looks good to me.

Thanks,
Olivier
  
Morten Brørup Oct. 30, 2022, 8:44 a.m. UTC | #5
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Tuesday, 11 October 2022 23.48
> 
> <snip>
> 
> >
> > Dear Intel PMD maintainers (CC: techboard),
> >
> > I strongly recommend that you update the code you copy-pasted from
> the
> > mempool library to your PMDs, so they reflect the new and improved
> > mempool cache behavior [1]. When choosing to copy-paste code from a
> core
> > library, you should feel obliged to keep your copied code matching
> the source
> > code you copied it from!
> >
> > Also, as reported in bug #1052, you forgot to copy-paste the
> instrumentation,
> > thereby 1. making the mempool debug statistics invalid and 2.
> omitting the
> > mempool accesses from the trace when using your PMDs. :-(
> We are working on mempool APIs to expose the per core cache memory to
> PMD so that the buffers can be copied directly. We are planning to fix
> this duplication as part of that.

Is the copy-paste bug fix going to make it for 22.11?

Otherwise, these PMDs are managing the mempool cache differently than the mempool library does. (And the mempool library instrumentation will remain partially bypassed for these PMDs.) This should be mentioned as a know bug in the release notes.

> 
> >
> > Alternatively, just remove the copy-pasted code and use the mempool
> > library's API instead. ;-)
> >
> > The direct re-arm code also contains copy-pasted mempool cache
> handling
> > code - which was accepted with the argument that the same code was
> > already copy-pasted elsewhere. I don't know if the direct re-arm code
> also
> > needs updating... Authors of that patch (CC to this email), please
> coordinate
> > with the PMD maintainers.
> Direct-rearm patch is not accepted yet.
> 
> >
> > PS:  As noted in the 22.11-rc1 release notes, more changes to the
> mempool
> > library [2] may be coming.
> >
> > [1]:
> > https://patches.dpdk.org/project/dpdk/patch/20221007104450.2567961-1-
> > andrew.rybchenko@oktetlabs.ru/
> >
> > [2]: https://patches.dpdk.org/project/dpdk/list/?series=25063
> >
> > -Morten
>
  
Honnappa Nagarahalli Oct. 30, 2022, 10:50 p.m. UTC | #6
<snip>
> >
> > >
> > > Dear Intel PMD maintainers (CC: techboard),
> > >
> > > I strongly recommend that you update the code you copy-pasted from
> > the
> > > mempool library to your PMDs, so they reflect the new and improved
> > > mempool cache behavior [1]. When choosing to copy-paste code from a
> > core
> > > library, you should feel obliged to keep your copied code matching
> > the source
> > > code you copied it from!
> > >
> > > Also, as reported in bug #1052, you forgot to copy-paste the
> > instrumentation,
> > > thereby 1. making the mempool debug statistics invalid and 2.
> > omitting the
> > > mempool accesses from the trace when using your PMDs. :-(
> > We are working on mempool APIs to expose the per core cache memory to
> > PMD so that the buffers can be copied directly. We are planning to fix
> > this duplication as part of that.
> 
> Is the copy-paste bug fix going to make it for 22.11?
It will not make it to 22.11. It is targeted for 23.02.

> 
> Otherwise, these PMDs are managing the mempool cache differently than
> the mempool library does. (And the mempool library instrumentation will
> remain partially bypassed for these PMDs.) This should be mentioned as a
> know bug in the release notes.
Agree

> 
> >
> > >
> > > Alternatively, just remove the copy-pasted code and use the mempool
> > > library's API instead. ;-)
> > >
> > > The direct re-arm code also contains copy-pasted mempool cache
> > handling
> > > code - which was accepted with the argument that the same code was
> > > already copy-pasted elsewhere. I don't know if the direct re-arm
> > > code
> > also
> > > needs updating... Authors of that patch (CC to this email), please
> > coordinate
> > > with the PMD maintainers.
> > Direct-rearm patch is not accepted yet.
> >
> > >
> > > PS:  As noted in the 22.11-rc1 release notes, more changes to the
> > mempool
> > > library [2] may be coming.
> > >
> > > [1]:
> > >
> https://patches.dpdk.org/project/dpdk/patch/20221007104450.2567961-1
> > > -
> > > andrew.rybchenko@oktetlabs.ru/
> > >
> > > [2]: https://patches.dpdk.org/project/dpdk/list/?series=25063
> > >
> > > -Morten
> >
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 4c4af2a8ed..2401c4ac80 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1435,60 +1435,82 @@  rte_mempool_put(struct rte_mempool *mp, void *obj)
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  * @return
  *   - >=0: Success; number of objects supplied.
- *   - <0: Error; code of ring dequeue function.
+ *   - <0: Error; code of driver dequeue function.
  */
 static __rte_always_inline int
 rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
+	unsigned int remaining = n;
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
-		goto ring_dequeue;
+	/* No cache provided */
+	if (unlikely(cache == NULL))
+		goto driver_dequeue;
 
-	cache_objs = cache->objs;
+	/* Use the cache as much as we have to return hot objects first */
+	len = RTE_MIN(remaining, cache->len);
+	cache_objs = &cache->objs[cache->len];
+	cache->len -= len;
+	remaining -= len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+	if (remaining == 0) {
+		/* The entire request is satisfied from the cache. */
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
-		if (unlikely(ret < 0)) {
-			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
-			 */
-			goto ring_dequeue;
-		}
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-		cache->len += req;
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	/* if dequeue below would overflow mem allocated for cache */
+	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+		goto driver_dequeue;
+
+	/* Fill the cache from the backend; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + remaining);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the backend.
+		 */
+		goto driver_dequeue;
+	}
+
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache_objs = &cache->objs[cache->size + remaining];
+	for (index = 0; index < remaining; index++)
+		*obj_table++ = *--cache_objs;
 
-	cache->len -= n;
+	cache->len = cache->size;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
 	return 0;
 
-ring_dequeue:
+driver_dequeue:
 
-	/* get remaining objects from ring */
-	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
+	/* Get remaining objects directly from the backend. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, remaining);
 
 	if (ret < 0) {
+		if (likely(cache != NULL)) {
+			cache->len = n - remaining;
+			/*
+			 * No further action is required to roll the first part
+			 * of the request back into the cache, as objects in
+			 * the cache are intact.
+			 */
+		}
+
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
 	} else {