[v2] mempool: fix get objects from mempool with cache

Message ID 20220202081426.77975-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v2] mempool: fix get objects from mempool with cache |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/github-robot: build success github build: passed
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS

Commit Message

Morten Brørup Feb. 2, 2022, 8:14 a.m. UTC
  A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

This patch fixes the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

This bug caused a major performance degradation in scenarios where the
application burst length is the same as the cache size. In such cases,
the objects were not ever fetched from the mempool cache, regardless if
they could have been.
This scenario occurs e.g. if an application has configured a mempool
with a size matching the application's burst size.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the ring were returned ahead of the objects from the cache.
This bug degraded application performance on CPUs with a small L1 cache,
which benefit from having the hot objects first in the returned array.
(This is probably also the reason why the function returns the objects
in reverse order, which it still does.)
Now, all code paths first return objects from the cache, subsequently
from the ring.

The function was not behaving as described (by the function using it)
and expected by applications using it. This in itself is also a bug.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the ring (instead of only the
number of requested objects minus the objects available in the ring),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the ring fails, only
the remaining requested objects are retrieved from the ring.

The function would fail despite there are enough objects in the cache
plus the common pool.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

v2 changes
- Do not modify description of return value. This belongs in a separate
doc fix.
- Elaborate even more on which bugs the modifications fix.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 21 deletions(-)
  

Comments

Morten Brørup June 15, 2022, 9:18 p.m. UTC | #1
+CC: Beilei Xing <beilei.xing@intel.com>, i40e maintainer, may be interested in the performance improvements achieved by this patch.

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 2 February 2022 09.14
> 
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> This patch fixes the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction
> of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> This bug caused a major performance degradation in scenarios where the
> application burst length is the same as the cache size. In such cases,
> the objects were not ever fetched from the mempool cache, regardless if
> they could have been.
> This scenario occurs e.g. if an application has configured a mempool
> with a size matching the application's burst size.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the
> objects
> from the ring were returned ahead of the objects from the cache.
> This bug degraded application performance on CPUs with a small L1
> cache,
> which benefit from having the hot objects first in the returned array.
> (This is probably also the reason why the function returns the objects
> in reverse order, which it still does.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> The function was not behaving as described (by the function using it)
> and expected by applications using it. This in itself is also a bug.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails,
> only
> the remaining requested objects are retrieved from the ring.
> 
> The function would fail despite there are enough objects in the cache
> plus the common pool.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
> 
> v2 changes
> - Do not modify description of return value. This belongs in a separate
> doc fix.
> - Elaborate even more on which bugs the modifications fix.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
>  1 file changed, 54 insertions(+), 21 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..2898c690b0 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	uint32_t index, len;
>  	void **cache_objs;
> 
> -	/* No cache provided or cannot be satisfied from cache */
> -	if (unlikely(cache == NULL || n >= cache->size))
> +	/* No cache provided or if get would overflow mem allocated for
> cache */
> +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_dequeue;
> 
> -	cache_objs = cache->objs;
> +	cache_objs = &cache->objs[cache->len];
> +
> +	if (n <= cache->len) {
> +		/* The entire request can be satisfied from the cache. */
> +		cache->len -= n;
> +		for (index = 0; index < n; index++)
> +			*obj_table++ = *--cache_objs;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> 
> -	/* Can this be satisfied from the cache? */
> -	if (cache->len < n) {
> -		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = n + (cache->size - cache->len);
> +		return 0;
> +	}
> 
> -		/* How many do we require i.e. number to fill the cache +
> the request */
> -		ret = rte_mempool_ops_dequeue_bulk(mp,
> -			&cache->objs[cache->len], req);
> +	/* Satisfy the first part of the request by depleting the cache.
> */
> +	len = cache->len;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
> +
> +	/* Number of objects remaining to satisfy the request. */
> +	len = n - len;
> +
> +	/* Fill the cache from the ring; fetch size + remaining objects.
> */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> +			cache->size + len);
> +	if (unlikely(ret < 0)) {
> +		/*
> +		 * We are buffer constrained, and not able to allocate
> +		 * cache + remaining.
> +		 * Do not fill the cache, just satisfy the remaining part
> of
> +		 * the request directly from the ring.
> +		 */
> +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
>  		if (unlikely(ret < 0)) {
>  			/*
> -			 * In the off chance that we are buffer constrained,
> -			 * where we are not able to allocate cache + n, go to
> -			 * the ring directly. If that fails, we are truly out
> of
> -			 * buffers.
> +			 * That also failed.
> +			 * No further action is required to roll the first
> +			 * part of the request back into the cache, as both
> +			 * cache->len and the objects in the cache are
> intact.
>  			 */
> -			goto ring_dequeue;
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> +
> +			return ret;
>  		}
> 
> -		cache->len += req;
> +		/* Commit that the cache was emptied. */
> +		cache->len = 0;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> +
> +		return 0;
>  	}
> 
> -	/* Now fill in the response ... */
> -	for (index = 0, len = cache->len - 1; index < n; ++index, len--,
> obj_table++)
> -		*obj_table = cache_objs[len];
> +	cache_objs = &cache->objs[cache->size + len];
> 
> -	cache->len -= n;
> +	/* Satisfy the remaining part of the request from the filled
> cache. */
> +	cache->len = cache->size;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
> 
>  	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
>  	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> @@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> 
>  ring_dequeue:
> 
> -	/* get remaining objects from ring */
> +	/* Get the objects from the ring. */
>  	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
> 
>  	if (ret < 0) {
> --
> 2.17.1

PING.

According to Patchwork [1], this patch provides up to 10.9 % single thread throughput improvement on XL710 with x86, and 0.7 % improvement with ARM.

Still no interest?

PS: Bruce reviewed V1 of this patch [2], but I don't think it is appropriate copying a Reviewed-by tag from one version of a patch to another, regardless how small the changes are.

[1] http://mails.dpdk.org/archives/test-report/2022-February/256462.html
[2] http://inbox.dpdk.org/dev/YeaDSxj%2FuZ0vPMl+@bricha3-MOBL.ger.corp.intel.com/
  
Morten Brørup Sept. 29, 2022, 10:52 a.m. UTC | #2
PING again.

If the explanation and/or diff is too longwinded, just look at the resulting code instead - it is clean and easily readable.

This patch should not be controversial, so I would like to see it merged into the coming LTS release. (Unlike my other mempool patch [3], which changes the behavior of the mempool cache.)

[3]: https://patchwork.dpdk.org/project/dpdk/patch/20220202103354.79832-1-mb@smartsharesystems.com/

Med venlig hilsen / Kind regards,
-Morten Brørup

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 15 June 2022 23.18
> 
> +CC: Beilei Xing <beilei.xing@intel.com>, i40e maintainer, may be
> interested in the performance improvements achieved by this patch.
> 
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Wednesday, 2 February 2022 09.14
> >
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then, and some inefficiencies were introduced.
> >
> > This patch fixes the following in rte_mempool_do_generic_get():
> >
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> > of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> >
> > This bug caused a major performance degradation in scenarios where
> the
> > application burst length is the same as the cache size. In such
> cases,
> > the objects were not ever fetched from the mempool cache, regardless
> if
> > they could have been.
> > This scenario occurs e.g. if an application has configured a mempool
> > with a size matching the application's burst size.
> >
> > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the ring.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the
> > objects
> > from the ring were returned ahead of the objects from the cache.
> > This bug degraded application performance on CPUs with a small L1
> > cache,
> > which benefit from having the hot objects first in the returned
> array.
> > (This is probably also the reason why the function returns the
> objects
> > in reverse order, which it still does.)
> > Now, all code paths first return objects from the cache, subsequently
> > from the ring.
> >
> > The function was not behaving as described (by the function using it)
> > and expected by applications using it. This in itself is also a bug.
> >
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the ring (instead of only the
> > number of requested objects minus the objects available in the ring),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the
> cache,
> > and if the subsequent backfilling of the cache from the ring fails,
> > only
> > the remaining requested objects are retrieved from the ring.
> >
> > The function would fail despite there are enough objects in the cache
> > plus the common pool.
> >
> > 4. The code flow for satisfying the request from the cache was
> slightly
> > inefficient:
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely. Now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > v2 changes
> > - Do not modify description of return value. This belongs in a
> separate
> > doc fix.
> > - Elaborate even more on which bugs the modifications fix.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++---------
> --
> >  1 file changed, 54 insertions(+), 21 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c1527..2898c690b0 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool
> > *mp, void **obj_table,
> >  	uint32_t index, len;
> >  	void **cache_objs;
> >
> > -	/* No cache provided or cannot be satisfied from cache */
> > -	if (unlikely(cache == NULL || n >= cache->size))
> > +	/* No cache provided or if get would overflow mem allocated for
> > cache */
> > +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> >  		goto ring_dequeue;
> >
> > -	cache_objs = cache->objs;
> > +	cache_objs = &cache->objs[cache->len];
> > +
> > +	if (n <= cache->len) {
> > +		/* The entire request can be satisfied from the cache. */
> > +		cache->len -= n;
> > +		for (index = 0; index < n; index++)
> > +			*obj_table++ = *--cache_objs;
> > +
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> >
> > -	/* Can this be satisfied from the cache? */
> > -	if (cache->len < n) {
> > -		/* No. Backfill the cache first, and then fill from it */
> > -		uint32_t req = n + (cache->size - cache->len);
> > +		return 0;
> > +	}
> >
> > -		/* How many do we require i.e. number to fill the cache +
> > the request */
> > -		ret = rte_mempool_ops_dequeue_bulk(mp,
> > -			&cache->objs[cache->len], req);
> > +	/* Satisfy the first part of the request by depleting the cache.
> > */
> > +	len = cache->len;
> > +	for (index = 0; index < len; index++)
> > +		*obj_table++ = *--cache_objs;
> > +
> > +	/* Number of objects remaining to satisfy the request. */
> > +	len = n - len;
> > +
> > +	/* Fill the cache from the ring; fetch size + remaining objects.
> > */
> > +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> > +			cache->size + len);
> > +	if (unlikely(ret < 0)) {
> > +		/*
> > +		 * We are buffer constrained, and not able to allocate
> > +		 * cache + remaining.
> > +		 * Do not fill the cache, just satisfy the remaining part
> > of
> > +		 * the request directly from the ring.
> > +		 */
> > +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
> >  		if (unlikely(ret < 0)) {
> >  			/*
> > -			 * In the off chance that we are buffer constrained,
> > -			 * where we are not able to allocate cache + n, go to
> > -			 * the ring directly. If that fails, we are truly out
> > of
> > -			 * buffers.
> > +			 * That also failed.
> > +			 * No further action is required to roll the first
> > +			 * part of the request back into the cache, as both
> > +			 * cache->len and the objects in the cache are
> > intact.
> >  			 */
> > -			goto ring_dequeue;
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> > +
> > +			return ret;
> >  		}
> >
> > -		cache->len += req;
> > +		/* Commit that the cache was emptied. */
> > +		cache->len = 0;
> > +
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > +
> > +		return 0;
> >  	}
> >
> > -	/* Now fill in the response ... */
> > -	for (index = 0, len = cache->len - 1; index < n; ++index, len--,
> > obj_table++)
> > -		*obj_table = cache_objs[len];
> > +	cache_objs = &cache->objs[cache->size + len];
> >
> > -	cache->len -= n;
> > +	/* Satisfy the remaining part of the request from the filled
> > cache. */
> > +	cache->len = cache->size;
> > +	for (index = 0; index < len; index++)
> > +		*obj_table++ = *--cache_objs;
> >
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > @@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> > *mp, void **obj_table,
> >
> >  ring_dequeue:
> >
> > -	/* get remaining objects from ring */
> > +	/* Get the objects from the ring. */
> >  	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
> >
> >  	if (ret < 0) {
> > --
> > 2.17.1
> 
> PING.
> 
> According to Patchwork [1], this patch provides up to 10.9 % single
> thread throughput improvement on XL710 with x86, and 0.7 % improvement
> with ARM.
> 
> Still no interest?
> 
> PS: Bruce reviewed V1 of this patch [2], but I don't think it is
> appropriate copying a Reviewed-by tag from one version of a patch to
> another, regardless how small the changes are.
> 
> [1] http://mails.dpdk.org/archives/test-report/2022-
> February/256462.html
> [2] http://inbox.dpdk.org/dev/YeaDSxj%2FuZ0vPMl+@bricha3-
> MOBL.ger.corp.intel.com/
  
Andrew Rybchenko Oct. 4, 2022, 12:57 p.m. UTC | #3
Hi Morten,

In general I agree that the fix is required.
In sent v3 I'm trying to make it a bit better from my point of
view. See few notes below.

On 2/2/22 11:14, Morten Brørup wrote:
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> This patch fixes the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> This bug caused a major performance degradation in scenarios where the
> application burst length is the same as the cache size. In such cases,
> the objects were not ever fetched from the mempool cache, regardless if
> they could have been.
> This scenario occurs e.g. if an application has configured a mempool
> with a size matching the application's burst size.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the objects
> from the ring were returned ahead of the objects from the cache.
> This bug degraded application performance on CPUs with a small L1 cache,
> which benefit from having the hot objects first in the returned array.
> (This is probably also the reason why the function returns the objects
> in reverse order, which it still does.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> The function was not behaving as described (by the function using it)
> and expected by applications using it. This in itself is also a bug.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails, only
> the remaining requested objects are retrieved from the ring.
> 
> The function would fail despite there are enough objects in the cache
> plus the common pool.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.

I've just sent v3 with suggested changes to the patch.

> 
> v2 changes
> - Do not modify description of return value. This belongs in a separate
> doc fix.
> - Elaborate even more on which bugs the modifications fix.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>   lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
>   1 file changed, 54 insertions(+), 21 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..2898c690b0 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>   	uint32_t index, len;
>   	void **cache_objs;
>   
> -	/* No cache provided or cannot be satisfied from cache */
> -	if (unlikely(cache == NULL || n >= cache->size))
> +	/* No cache provided or if get would overflow mem allocated for cache */
> +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))

The second condition is unnecessary until we try to fill in
cache from backend.

>   		goto ring_dequeue;
>   
> -	cache_objs = cache->objs;
> +	cache_objs = &cache->objs[cache->len];
> +
> +	if (n <= cache->len) {
> +		/* The entire request can be satisfied from the cache. */
> +		cache->len -= n;
> +		for (index = 0; index < n; index++)
> +			*obj_table++ = *--cache_objs;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
>   
> -	/* Can this be satisfied from the cache? */
> -	if (cache->len < n) {
> -		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = n + (cache->size - cache->len);
> +		return 0;
> +	}
>   
> -		/* How many do we require i.e. number to fill the cache + the request */
> -		ret = rte_mempool_ops_dequeue_bulk(mp,
> -			&cache->objs[cache->len], req);
> +	/* Satisfy the first part of the request by depleting the cache. */
> +	len = cache->len;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;

I dislike duplication of these lines here and above. See v3.

> +
> +	/* Number of objects remaining to satisfy the request. */
> +	len = n - len;
> +
> +	/* Fill the cache from the ring; fetch size + remaining objects. */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> +			cache->size + len);
> +	if (unlikely(ret < 0)) {
> +		/*
> +		 * We are buffer constrained, and not able to allocate
> +		 * cache + remaining.
> +		 * Do not fill the cache, just satisfy the remaining part of
> +		 * the request directly from the ring.
> +		 */
> +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);

I dislike the duplication as well. We can goto ring_dequeue
instead. See v3.

>   		if (unlikely(ret < 0)) {
>   			/*
> -			 * In the off chance that we are buffer constrained,
> -			 * where we are not able to allocate cache + n, go to
> -			 * the ring directly. If that fails, we are truly out of
> -			 * buffers.
> +			 * That also failed.
> +			 * No further action is required to roll the first
> +			 * part of the request back into the cache, as both
> +			 * cache->len and the objects in the cache are intact.
>   			 */
> -			goto ring_dequeue;
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> +
> +			return ret;
>   		}
>   
> -		cache->len += req;
> +		/* Commit that the cache was emptied. */
> +		cache->len = 0;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> +
> +		return 0;
>   	}
>   
> -	/* Now fill in the response ... */
> -	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
> -		*obj_table = cache_objs[len];
> +	cache_objs = &cache->objs[cache->size + len];
>   
> -	cache->len -= n;
> +	/* Satisfy the remaining part of the request from the filled cache. */
> +	cache->len = cache->size;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
>   
>   	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
>   	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> @@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>   
>   ring_dequeue:
>   
> -	/* get remaining objects from ring */
> +	/* Get the objects from the ring. */
>   	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
>   
>   	if (ret < 0) {
  
Morten Brørup Oct. 4, 2022, 3:13 p.m. UTC | #4
@Aaron, do you have any insights or comments to my curiosity below?

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Tuesday, 4 October 2022 14.58
> 
> Hi Morten,
> 
> In general I agree that the fix is required.
> In sent v3 I'm trying to make it a bit better from my point of
> view. See few notes below.

I stand by my review and accept of v3 - this message is not intended to change that! I'm just curious...

I wonder how accurate the automated performance tests ([v2], [v3]) are, and if they are comparable between February and October?

[v2]: http://mails.dpdk.org/archives/test-report/2022-February/256462.html
[v3]: http://mails.dpdk.org/archives/test-report/2022-October/311526.html


Ubuntu 20.04
Kernel: 4.15.0-generic
Compiler: gcc 7.4
NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 40000 Mbps
Target: x86_64-native-linuxapp-gcc
Fail/Total: 0/4

Detail performance results:
** V2 **:
+----------+-------------+---------+------------+------------------------------+
| num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
|          |             |         |            |           expected           |
+==========+=============+=========+============+==============================+
| 1        | 2           | 512     | 64         | 0.5%                         |
+----------+-------------+---------+------------+------------------------------+
| 1        | 2           | 2048    | 64         | -1.5%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 512     | 64         | 4.3%                         |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 2048    | 64         | 10.9%                        |
+----------+-------------+---------+------------+------------------------------+

** V3 **:
+----------+-------------+---------+------------+------------------------------+
| num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
|          |             |         |            |           expected           |
+==========+=============+=========+============+==============================+
| 1        | 2           | 512     | 64         | -0.7%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 2           | 2048    | 64         | -2.3%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 512     | 64         | 0.5%                         |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 2048    | 64         | 7.9%                         |
+----------+-------------+---------+------------+------------------------------+
  
Andrew Rybchenko Oct. 4, 2022, 3:58 p.m. UTC | #5
On 10/4/22 18:13, Morten Brørup wrote:
> @Aaron, do you have any insights or comments to my curiosity below?
> 
>> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
>> Sent: Tuesday, 4 October 2022 14.58
>>
>> Hi Morten,
>>
>> In general I agree that the fix is required.
>> In sent v3 I'm trying to make it a bit better from my point of
>> view. See few notes below.
> 
> I stand by my review and accept of v3 - this message is not intended to change that! I'm just curious...
> 
> I wonder how accurate the automated performance tests ([v2], [v3]) are, and if they are comparable between February and October?
> 
> [v2]: http://mails.dpdk.org/archives/test-report/2022-February/256462.html
> [v3]: http://mails.dpdk.org/archives/test-report/2022-October/311526.html
> 
> 
> Ubuntu 20.04
> Kernel: 4.15.0-generic
> Compiler: gcc 7.4
> NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 40000 Mbps
> Target: x86_64-native-linuxapp-gcc
> Fail/Total: 0/4
> 
> Detail performance results:
> ** V2 **:
> +----------+-------------+---------+------------+------------------------------+
> | num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
> |          |             |         |            |           expected           |
> +==========+=============+=========+============+==============================+
> | 1        | 2           | 512     | 64         | 0.5%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 2           | 2048    | 64         | -1.5%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 512     | 64         | 4.3%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 2048    | 64         | 10.9%                        |
> +----------+-------------+---------+------------+------------------------------+
> 
> ** V3 **:
> +----------+-------------+---------+------------+------------------------------+
> | num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
> |          |             |         |            |           expected           |
> +==========+=============+=========+============+==============================+
> | 1        | 2           | 512     | 64         | -0.7%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 2           | 2048    | 64         | -2.3%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 512     | 64         | 0.5%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 2048    | 64         | 7.9%                         |
> +----------+-------------+---------+------------+------------------------------+
> 

Very interesting, may be it make sense to sent your patch and
mine once again to check current figures and results stability.
  
Morten Brørup Oct. 4, 2022, 4:36 p.m. UTC | #6
RESENT for test purposes.

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

This patch fixes the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

This bug caused a major performance degradation in scenarios where the
application burst length is the same as the cache size. In such cases,
the objects were not ever fetched from the mempool cache, regardless if
they could have been.
This scenario occurs e.g. if an application has configured a mempool
with a size matching the application's burst size.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the ring were returned ahead of the objects from the cache.
This bug degraded application performance on CPUs with a small L1 cache,
which benefit from having the hot objects first in the returned array.
(This is probably also the reason why the function returns the objects
in reverse order, which it still does.)
Now, all code paths first return objects from the cache, subsequently
from the ring.

The function was not behaving as described (by the function using it)
and expected by applications using it. This in itself is also a bug.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the ring (instead of only the
number of requested objects minus the objects available in the ring),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the ring fails, only
the remaining requested objects are retrieved from the ring.

The function would fail despite there are enough objects in the cache
plus the common pool.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

v2 changes
- Do not modify description of return value. This belongs in a separate
doc fix.
- Elaborate even more on which bugs the modifications fix.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 21 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..2898c690b0 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	cache_objs = &cache->objs[cache->len];
+
+	if (n <= cache->len) {
+		/* The entire request can be satisfied from the cache. */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+		return 0;
+	}
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
+	/* Satisfy the first part of the request by depleting the cache. */
+	len = cache->len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
+
+	/* Number of objects remaining to satisfy the request. */
+	len = n - len;
+
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + len);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
 		if (unlikely(ret < 0)) {
 			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
+			 * That also failed.
+			 * No further action is required to roll the first
+			 * part of the request back into the cache, as both
+			 * cache->len and the objects in the cache are intact.
 			 */
-			goto ring_dequeue;
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+			return ret;
 		}
 
-		cache->len += req;
+		/* Commit that the cache was emptied. */
+		cache->len = 0;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	cache_objs = &cache->objs[cache->size + len];
 
-	cache->len -= n;
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache->len = cache->size;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring. */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {
  
Morten Brørup Oct. 4, 2022, 6:09 p.m. UTC | #7
> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Tuesday, 4 October 2022 17.59
> 
> On 10/4/22 18:13, Morten Brørup wrote:
> > @Aaron, do you have any insights or comments to my curiosity below?
> >
> >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> >> Sent: Tuesday, 4 October 2022 14.58
> >>
> >> Hi Morten,
> >>
> >> In general I agree that the fix is required.
> >> In sent v3 I'm trying to make it a bit better from my point of
> >> view. See few notes below.
> >
> > I stand by my review and accept of v3 - this message is not intended
> to change that! I'm just curious...
> >
> > I wonder how accurate the automated performance tests ([v2], [v3])
> are, and if they are comparable between February and October?
> >
> > [v2]: http://mails.dpdk.org/archives/test-report/2022-
> February/256462.html
> > [v3]: http://mails.dpdk.org/archives/test-report/2022-
> October/311526.html
> >
> >
> > Ubuntu 20.04
> > Kernel: 4.15.0-generic
> > Compiler: gcc 7.4
> > NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2
> 40000 Mbps
> > Target: x86_64-native-linuxapp-gcc
> > Fail/Total: 0/4
> >
> > Detail performance results:
> > ** V2 **:
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | num_cpus | num_threads | txd/rxd | frame_size |  throughput
> difference from  |
> > |          |             |         |            |           expected
> |
> >
> +==========+=============+=========+============+======================
> ========+
> > | 1        | 2           | 512     | 64         | 0.5%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 2           | 2048    | 64         | -1.5%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 1           | 512     | 64         | 4.3%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 1           | 2048    | 64         | 10.9%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> >
> > ** V3 **:
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | num_cpus | num_threads | txd/rxd | frame_size |  throughput
> difference from  |
> > |          |             |         |            |           expected
> |
> >
> +==========+=============+=========+============+======================
> ========+
> > | 1        | 2           | 512     | 64         | -0.7%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 2           | 2048    | 64         | -2.3%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 1           | 512     | 64         | 0.5%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 1           | 2048    | 64         | 7.9%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> >
> 
> Very interesting, may be it make sense to sent your patch and
> mine once again to check current figures and results stability.
> 

V2 retest:

http://mails.dpdk.org/archives/test-report/2022-October/311609.html

Ubuntu 20.04
Kernel: 4.15.0-generic
Compiler: gcc 7.4
NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 40000 Mbps
Target: x86_64-native-linuxapp-gcc
Fail/Total: 0/4

Detail performance results: 
+----------+-------------+---------+------------+------------------------------+
| num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
|          |             |         |            |           expected           |
+==========+=============+=========+============+==============================+
| 1        | 2           | 512     | 64         | 0.1%                         |
+----------+-------------+---------+------------+------------------------------+
| 1        | 2           | 2048    | 64         | -3.0%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 512     | 64         | -3.7%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 2048    | 64         | 8.1%                         |
+----------+-------------+---------+------------+------------------------------+

Probably not very accurate.

And 8.1% is very close to the V3 7.9% - nothing to worry about, compared to the previous v2 test result of 10.9%.

-Morten
  
Aaron Conole Oct. 6, 2022, 1:43 p.m. UTC | #8
Morten Brørup <mb@smartsharesystems.com> writes:

> @Aaron, do you have any insights or comments to my curiosity below?

Sorry, the perf tests from Feb to Oct should *generally* be comparable
but keep in mind that they are based on different baseline versions of
DPDK.  Also, the perf tests are done as thresholds rather than hard
limits (because there can be some minor variations run to run, iirc).

>> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
>> Sent: Tuesday, 4 October 2022 14.58
>> 
>> Hi Morten,
>> 
>> In general I agree that the fix is required.
>> In sent v3 I'm trying to make it a bit better from my point of
>> view. See few notes below.
>
> I stand by my review and accept of v3 - this message is not intended to change that! I'm just curious...
>
> I wonder how accurate the automated performance tests ([v2], [v3])
> are, and if they are comparable between February and October?
>
> [v2]: http://mails.dpdk.org/archives/test-report/2022-February/256462.html
> [v3]: http://mails.dpdk.org/archives/test-report/2022-October/311526.html
>
>
> Ubuntu 20.04
> Kernel: 4.15.0-generic
> Compiler: gcc 7.4
> NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 40000 Mbps
> Target: x86_64-native-linuxapp-gcc
> Fail/Total: 0/4
>
> Detail performance results:
> ** V2 **:
> +----------+-------------+---------+------------+------------------------------+
> | num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
> |          |             |         |            |           expected           |
> +==========+=============+=========+============+==============================+
> | 1        | 2           | 512     | 64         | 0.5%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 2           | 2048    | 64         | -1.5%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 512     | 64         | 4.3%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 2048    | 64         | 10.9%                        |
> +----------+-------------+---------+------------+------------------------------+
>
> ** V3 **:
> +----------+-------------+---------+------------+------------------------------+
> | num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
> |          |             |         |            |           expected           |
> +==========+=============+=========+============+==============================+
> | 1        | 2           | 512     | 64         | -0.7%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 2           | 2048    | 64         | -2.3%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 512     | 64         | 0.5%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 2048    | 64         | 7.9%                         |
> +----------+-------------+---------+------------+------------------------------+
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..2898c690b0 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1463,38 +1463,71 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	cache_objs = &cache->objs[cache->len];
+
+	if (n <= cache->len) {
+		/* The entire request can be satisfied from the cache. */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+		return 0;
+	}
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
+	/* Satisfy the first part of the request by depleting the cache. */
+	len = cache->len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
+
+	/* Number of objects remaining to satisfy the request. */
+	len = n - len;
+
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + len);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
 		if (unlikely(ret < 0)) {
 			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
+			 * That also failed.
+			 * No further action is required to roll the first
+			 * part of the request back into the cache, as both
+			 * cache->len and the objects in the cache are intact.
 			 */
-			goto ring_dequeue;
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+			return ret;
 		}
 
-		cache->len += req;
+		/* Commit that the cache was emptied. */
+		cache->len = 0;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	cache_objs = &cache->objs[cache->size + len];
 
-	cache->len -= n;
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache->len = cache->size;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1536,7 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring. */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {