mempool: optimize get objects with constant n

Message ID 20230411064845.37713-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series mempool: optimize get objects with constant n |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/iol-abi-testing success Testing PASS
ci/github-robot: build success github build: passed
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS

Commit Message

Morten Brørup April 11, 2023, 6:48 a.m. UTC
  When getting objects from the mempool, the number of objects to get is
often constant at build time.

This patch adds another code path for this case, so the compiler can
optimize more, e.g. unroll the copy loop when the entire request is
satisfied from the cache.

On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
mempool_perf_test with constant n shows an increase in rate_persec by an
average of 17 %, minimum 9.5 %, maximum 24 %.

The code path where the number of objects to get is unknown at build time
remains essentially unchanged.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)
  

Comments

Bruce Richardson April 18, 2023, 11:06 a.m. UTC | #1
On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> When getting objects from the mempool, the number of objects to get is
> often constant at build time.
> 
> This patch adds another code path for this case, so the compiler can
> optimize more, e.g. unroll the copy loop when the entire request is
> satisfied from the cache.
> 
> On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> mempool_perf_test with constant n shows an increase in rate_persec by an
> average of 17 %, minimum 9.5 %, maximum 24 %.
> 
> The code path where the number of objects to get is unknown at build time
> remains essentially unchanged.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>

Change looks a good idea. Some suggestions inline below, which you may want to
take on board for any future version. I'd strongly suggest adding some
extra clarifying code comments, as I suggest below.
With those exta code comments:

Acked-by: Bruce Richardson <bruce.richardson@intel.com>

> ---
>  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 9f530db24b..ade0100ec7 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  	if (unlikely(cache == NULL))
>  		goto driver_dequeue;
>  
> -	/* Use the cache as much as we have to return hot objects first */
> -	len = RTE_MIN(remaining, cache->len);
>  	cache_objs = &cache->objs[cache->len];
> +
> +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> +		/*
> +		 * The request size is known at build time, and
> +		 * the entire request can be satisfied from the cache,
> +		 * so let the compiler unroll the fixed length copy loop.
> +		 */
> +		cache->len -= n;
> +		for (index = 0; index < n; index++)
> +			*obj_table++ = *--cache_objs;
> +

This loop looks a little awkward to me. Would it be clearer (and perhaps
easier for compilers to unroll efficiently if it was rewritten as:

	cache->len -= n;
	cache_objs = &cache->objs[cache->len];
	for (index = 0; index < n; index++)
		obj_table[index] = cache_objs[index];

alternatively those last two lines can be replaced by a memcpy, which the
compiler should nicely optimize itself, for constant size copy:

	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);

> +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> +
> +		return 0;
> +	}
> +
> +	/* Use the cache as much as we have to return hot objects first */
> +	len = __extension__(__builtin_constant_p(n)) ? cache->len :
> +			RTE_MIN(remaining, cache->len);

This line confused me a lot, until I applied the patch and stared at it a
lot :-). Why would the length value depend upon whether or not n is a
compile-time constant?
I think it would really help here to add in a comment stating that when n
is a compile-time constant, at this point it much be "> cache->len" since
the previous block was untaken, therefore we don't need to call RTE_MIN.

>  	cache->len -= len;
>  	remaining -= len;
>  	for (index = 0; index < len; index++)
>  		*obj_table++ = *--cache_objs;
>  
> -	if (remaining == 0) {
> +	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
>  		/* The entire request is satisfied from the cache. */
>  
>  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);

Once I'd worked out the logic for the above conditional check, then this
conditional adjustment was clear. I just wonder if a further comment might
help here.

I am also wondering if having larger blocks for the constant and
non-constant cases may help. It would lead to some duplication but may
clarify the code.

> -- 
> 2.17.1
>
  
Morten Brørup April 18, 2023, 11:29 a.m. UTC | #2
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 18 April 2023 13.07
> 
> On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > When getting objects from the mempool, the number of objects to get is
> > often constant at build time.
> >
> > This patch adds another code path for this case, so the compiler can
> > optimize more, e.g. unroll the copy loop when the entire request is
> > satisfied from the cache.
> >
> > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > mempool_perf_test with constant n shows an increase in rate_persec by an
> > average of 17 %, minimum 9.5 %, maximum 24 %.
> >
> > The code path where the number of objects to get is unknown at build time
> > remains essentially unchanged.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Change looks a good idea. Some suggestions inline below, which you may want to
> take on board for any future version. I'd strongly suggest adding some
> extra clarifying code comments, as I suggest below.
> With those exta code comments:
> 
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> > ---
> >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> >  1 file changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 9f530db24b..ade0100ec7 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp,
> void **obj_table,
> >  	if (unlikely(cache == NULL))
> >  		goto driver_dequeue;
> >
> > -	/* Use the cache as much as we have to return hot objects first */
> > -	len = RTE_MIN(remaining, cache->len);
> >  	cache_objs = &cache->objs[cache->len];
> > +
> > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > +		/*
> > +		 * The request size is known at build time, and
> > +		 * the entire request can be satisfied from the cache,
> > +		 * so let the compiler unroll the fixed length copy loop.
> > +		 */
> > +		cache->len -= n;
> > +		for (index = 0; index < n; index++)
> > +			*obj_table++ = *--cache_objs;
> > +
> 
> This loop looks a little awkward to me. Would it be clearer (and perhaps
> easier for compilers to unroll efficiently if it was rewritten as:
> 
> 	cache->len -= n;
> 	cache_objs = &cache->objs[cache->len];
> 	for (index = 0; index < n; index++)
> 		obj_table[index] = cache_objs[index];

The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments.

Regardless, your point here is still valid! I expected that any unrolling capable compiler can unroll *dst++ = *--src; but I can experiment with different compilers on godbolt.org to see if dst[index] = src[-index] is better.

A future version could be hand coded with vector instructions here, like in the Ethdev drivers.

> 
> alternatively those last two lines can be replaced by a memcpy, which the
> compiler should nicely optimize itself, for constant size copy:
> 
> 	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);

Just for reference: It was previously discussed to ignore the stack ordering and simply copy the objects, but the idea was rejected due to the potential performance impact of not returning the hottest objects, i.e. the objects at the top of the stack, as the first in the returned array.

> 
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> > +
> > +		return 0;
> > +	}
> > +
> > +	/* Use the cache as much as we have to return hot objects first */
> > +	len = __extension__(__builtin_constant_p(n)) ? cache->len :
> > +			RTE_MIN(remaining, cache->len);
> 
> This line confused me a lot, until I applied the patch and stared at it a
> lot :-). Why would the length value depend upon whether or not n is a
> compile-time constant?
> I think it would really help here to add in a comment stating that when n
> is a compile-time constant, at this point it much be "> cache->len" since
> the previous block was untaken, therefore we don't need to call RTE_MIN.

I agree. This is almost like perl... write-only source code.

I will add a few explanatory comments.

> 
> >  	cache->len -= len;
> >  	remaining -= len;
> >  	for (index = 0; index < len; index++)
> >  		*obj_table++ = *--cache_objs;
> >
> > -	if (remaining == 0) {
> > +	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
> >  		/* The entire request is satisfied from the cache. */
> >
> >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> 
> Once I'd worked out the logic for the above conditional check, then this
> conditional adjustment was clear. I just wonder if a further comment might
> help here.
> 
> I am also wondering if having larger blocks for the constant and
> non-constant cases may help. It would lead to some duplication but may
> clarify the code.

I get your point, but the code is still short enough to grasp (after comments have been added). So I prefer to avoid code duplication.

> 
> > --
> > 2.17.1
> >
  
Bruce Richardson April 18, 2023, 12:54 p.m. UTC | #3
On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, 18 April 2023 13.07
> > 
> > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > When getting objects from the mempool, the number of objects to get is
> > > often constant at build time.
> > >
> > > This patch adds another code path for this case, so the compiler can
> > > optimize more, e.g. unroll the copy loop when the entire request is
> > > satisfied from the cache.
> > >
> > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > > mempool_perf_test with constant n shows an increase in rate_persec by an
> > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > >
> > > The code path where the number of objects to get is unknown at build time
> > > remains essentially unchanged.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > 
> > Change looks a good idea. Some suggestions inline below, which you may want to
> > take on board for any future version. I'd strongly suggest adding some
> > extra clarifying code comments, as I suggest below.
> > With those exta code comments:
> > 
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > 
> > > ---
> > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 9f530db24b..ade0100ec7 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp,
> > void **obj_table,
> > >  	if (unlikely(cache == NULL))
> > >  		goto driver_dequeue;
> > >
> > > -	/* Use the cache as much as we have to return hot objects first */
> > > -	len = RTE_MIN(remaining, cache->len);
> > >  	cache_objs = &cache->objs[cache->len];
> > > +
> > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > +		/*
> > > +		 * The request size is known at build time, and
> > > +		 * the entire request can be satisfied from the cache,
> > > +		 * so let the compiler unroll the fixed length copy loop.
> > > +		 */
> > > +		cache->len -= n;
> > > +		for (index = 0; index < n; index++)
> > > +			*obj_table++ = *--cache_objs;
> > > +
> > 
> > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > easier for compilers to unroll efficiently if it was rewritten as:
> > 
> > 	cache->len -= n;
> > 	cache_objs = &cache->objs[cache->len];
> > 	for (index = 0; index < n; index++)
> > 		obj_table[index] = cache_objs[index];
> 
> The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments.
> 
> Regardless, your point here is still valid! I expected that any unrolling capable compiler can unroll *dst++ = *--src; but I can experiment with different compilers on godbolt.org to see if dst[index] = src[-index] is better.
> 
> A future version could be hand coded with vector instructions here, like in the Ethdev drivers.
> 
> > 
> > alternatively those last two lines can be replaced by a memcpy, which the
> > compiler should nicely optimize itself, for constant size copy:
> > 
> > 	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);
> 
> Just for reference: It was previously discussed to ignore the stack ordering and simply copy the objects, but the idea was rejected due to the potential performance impact of not returning the hottest objects, i.e. the objects at the top of the stack, as the first in the returned array.
> 

Ah, yes, I had forgotten that we want the order reversed, so most recent
objects first. It would be interesting to see if ignoring order within a
burst can lead to some performance improvements, as I expect that in most
cases, the buffers to be used in a burst are to be used pretty much
simultaneously. However, based on your feedback, feel free to ignore this
comment and keep code as-is.

> > 
> > > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> > > +
> > > +		return 0;
> > > +	}
> > > +
> > > +	/* Use the cache as much as we have to return hot objects first */
> > > +	len = __extension__(__builtin_constant_p(n)) ? cache->len :
> > > +			RTE_MIN(remaining, cache->len);
> > 
> > This line confused me a lot, until I applied the patch and stared at it a
> > lot :-). Why would the length value depend upon whether or not n is a
> > compile-time constant?
> > I think it would really help here to add in a comment stating that when n
> > is a compile-time constant, at this point it much be "> cache->len" since
> > the previous block was untaken, therefore we don't need to call RTE_MIN.
> 
> I agree. This is almost like perl... write-only source code.
> 
> I will add a few explanatory comments.
> 

Thanks.

> > 
> > >  	cache->len -= len;
> > >  	remaining -= len;
> > >  	for (index = 0; index < len; index++)
> > >  		*obj_table++ = *--cache_objs;
> > >
> > > -	if (remaining == 0) {
> > > +	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
> > >  		/* The entire request is satisfied from the cache. */
> > >
> > >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > 
> > Once I'd worked out the logic for the above conditional check, then this
> > conditional adjustment was clear. I just wonder if a further comment might
> > help here.
> > 
> > I am also wondering if having larger blocks for the constant and
> > non-constant cases may help. It would lead to some duplication but may
> > clarify the code.
> 
> I get your point, but the code is still short enough to grasp (after comments have been added). So I prefer to avoid code duplication.
> 

Sure.

> > 
> > > --
> > > 2.17.1
> > >
  
Bruce Richardson April 18, 2023, 12:55 p.m. UTC | #4
On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, 18 April 2023 13.07
> > 
> > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > When getting objects from the mempool, the number of objects to get is
> > > often constant at build time.
> > >
> > > This patch adds another code path for this case, so the compiler can
> > > optimize more, e.g. unroll the copy loop when the entire request is
> > > satisfied from the cache.
> > >
> > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > > mempool_perf_test with constant n shows an increase in rate_persec by an
> > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > >
> > > The code path where the number of objects to get is unknown at build time
> > > remains essentially unchanged.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > 
> > Change looks a good idea. Some suggestions inline below, which you may want to
> > take on board for any future version. I'd strongly suggest adding some
> > extra clarifying code comments, as I suggest below.
> > With those exta code comments:
> > 
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > 
> > > ---
> > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 9f530db24b..ade0100ec7 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp,
> > void **obj_table,
> > >  	if (unlikely(cache == NULL))
> > >  		goto driver_dequeue;
> > >
> > > -	/* Use the cache as much as we have to return hot objects first */
> > > -	len = RTE_MIN(remaining, cache->len);
> > >  	cache_objs = &cache->objs[cache->len];
> > > +
> > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > +		/*
> > > +		 * The request size is known at build time, and
> > > +		 * the entire request can be satisfied from the cache,
> > > +		 * so let the compiler unroll the fixed length copy loop.
> > > +		 */
> > > +		cache->len -= n;
> > > +		for (index = 0; index < n; index++)
> > > +			*obj_table++ = *--cache_objs;
> > > +
> > 
> > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > easier for compilers to unroll efficiently if it was rewritten as:
> > 
> > 	cache->len -= n;
> > 	cache_objs = &cache->objs[cache->len];
> > 	for (index = 0; index < n; index++)
> > 		obj_table[index] = cache_objs[index];
> 
> The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments.
> 

BTW: Please add this as a comment in the code too, above the loop to avoid
future developers (or even future me), asking this question again!

/Bruce
  
Tyler Retzlaff April 18, 2023, 3:15 p.m. UTC | #5
On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > When getting objects from the mempool, the number of objects to get is
> > often constant at build time.
> > 
> > This patch adds another code path for this case, so the compiler can
> > optimize more, e.g. unroll the copy loop when the entire request is
> > satisfied from the cache.
> > 
> > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > mempool_perf_test with constant n shows an increase in rate_persec by an
> > average of 17 %, minimum 9.5 %, maximum 24 %.
> > 
> > The code path where the number of objects to get is unknown at build time
> > remains essentially unchanged.
> > 
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Change looks a good idea. Some suggestions inline below, which you may want to
> take on board for any future version. I'd strongly suggest adding some
> extra clarifying code comments, as I suggest below.
> With those exta code comments:
> 
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> > ---
> >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> >  1 file changed, 21 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 9f530db24b..ade0100ec7 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
> >  	if (unlikely(cache == NULL))
> >  		goto driver_dequeue;
> >  
> > -	/* Use the cache as much as we have to return hot objects first */
> > -	len = RTE_MIN(remaining, cache->len);
> >  	cache_objs = &cache->objs[cache->len];
> > +
> > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {

don't take direct dependency on compiler builtins. define a macro so we
don't have to play shotgun surgery later.

also what is the purpose of using __extension__ here? are you annotating
the use of __builtin_constant_p() or is there more? because if that's
the only reason i see no need to use __extension__ when already using a
compiler specific builtin like this, that it is not standard is implied
and enforced by a compile break.

> > +		/*
> > +		 * The request size is known at build time, and
> > +		 * the entire request can be satisfied from the cache,
> > +		 * so let the compiler unroll the fixed length copy loop.
> > +		 */
> > +		cache->len -= n;
> > +		for (index = 0; index < n; index++)
> > +			*obj_table++ = *--cache_objs;
> > +
> 
> This loop looks a little awkward to me. Would it be clearer (and perhaps
> easier for compilers to unroll efficiently if it was rewritten as:
> 
> 	cache->len -= n;
> 	cache_objs = &cache->objs[cache->len];
> 	for (index = 0; index < n; index++)
> 		obj_table[index] = cache_objs[index];
> 
> alternatively those last two lines can be replaced by a memcpy, which the
> compiler should nicely optimize itself, for constant size copy:
> 
> 	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);
> 
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> > +
> > +		return 0;
> > +	}
> > +
> > +	/* Use the cache as much as we have to return hot objects first */
> > +	len = __extension__(__builtin_constant_p(n)) ? cache->len :
> > +			RTE_MIN(remaining, cache->len);
> 
> This line confused me a lot, until I applied the patch and stared at it a
> lot :-). Why would the length value depend upon whether or not n is a
> compile-time constant?
> I think it would really help here to add in a comment stating that when n
> is a compile-time constant, at this point it much be "> cache->len" since
> the previous block was untaken, therefore we don't need to call RTE_MIN.
> 
> >  	cache->len -= len;
> >  	remaining -= len;
> >  	for (index = 0; index < len; index++)
> >  		*obj_table++ = *--cache_objs;
> >  
> > -	if (remaining == 0) {
> > +	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
> >  		/* The entire request is satisfied from the cache. */
> >  
> >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> 
> Once I'd worked out the logic for the above conditional check, then this
> conditional adjustment was clear. I just wonder if a further comment might
> help here.
> 
> I am also wondering if having larger blocks for the constant and
> non-constant cases may help. It would lead to some duplication but may
> clarify the code.
> 
> > -- 
> > 2.17.1
> >
  
Morten Brørup April 18, 2023, 3:30 p.m. UTC | #6
> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> Sent: Tuesday, 18 April 2023 17.15
> 
> On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > When getting objects from the mempool, the number of objects to get
> is
> > > often constant at build time.
> > >
> > > This patch adds another code path for this case, so the compiler can
> > > optimize more, e.g. unroll the copy loop when the entire request is
> > > satisfied from the cache.
> > >
> > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > > mempool_perf_test with constant n shows an increase in rate_persec
> by an
> > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > >
> > > The code path where the number of objects to get is unknown at build
> time
> > > remains essentially unchanged.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >
> > Change looks a good idea. Some suggestions inline below, which you may
> want to
> > take on board for any future version. I'd strongly suggest adding some
> > extra clarifying code comments, as I suggest below.
> > With those exta code comments:
> >
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> >
> > > ---
> > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 9f530db24b..ade0100ec7 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct
> rte_mempool *mp, void **obj_table,
> > >  	if (unlikely(cache == NULL))
> > >  		goto driver_dequeue;
> > >
> > > -	/* Use the cache as much as we have to return hot objects first */
> > > -	len = RTE_MIN(remaining, cache->len);
> > >  	cache_objs = &cache->objs[cache->len];
> > > +
> > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> 
> don't take direct dependency on compiler builtins. define a macro so we
> don't have to play shotgun surgery later.
> 
> also what is the purpose of using __extension__ here? are you annotating
> the use of __builtin_constant_p() or is there more? because if that's
> the only reason i see no need to use __extension__ when already using a
> compiler specific builtin like this, that it is not standard is implied
> and enforced by a compile break.

ARM 32-bit memcpy() [1] does it this way, so I did the same.

[1]: https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_memcpy_32.h#L122

While I agree that a macro for __builtin_constant_p() would be good, it belongs in a patch to fix portability, not in this patch.
  
Tyler Retzlaff April 18, 2023, 3:44 p.m. UTC | #7
On Tue, Apr 18, 2023 at 05:30:27PM +0200, Morten Brørup wrote:
> > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> > Sent: Tuesday, 18 April 2023 17.15
> > 
> > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > When getting objects from the mempool, the number of objects to get
> > is
> > > > often constant at build time.
> > > >
> > > > This patch adds another code path for this case, so the compiler can
> > > > optimize more, e.g. unroll the copy loop when the entire request is
> > > > satisfied from the cache.
> > > >
> > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > > > mempool_perf_test with constant n shows an increase in rate_persec
> > by an
> > > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > > >
> > > > The code path where the number of objects to get is unknown at build
> > time
> > > > remains essentially unchanged.
> > > >
> > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >
> > > Change looks a good idea. Some suggestions inline below, which you may
> > want to
> > > take on board for any future version. I'd strongly suggest adding some
> > > extra clarifying code comments, as I suggest below.
> > > With those exta code comments:
> > >
> > > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > >
> > > > ---
> > > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > > index 9f530db24b..ade0100ec7 100644
> > > > --- a/lib/mempool/rte_mempool.h
> > > > +++ b/lib/mempool/rte_mempool.h
> > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct
> > rte_mempool *mp, void **obj_table,
> > > >  	if (unlikely(cache == NULL))
> > > >  		goto driver_dequeue;
> > > >
> > > > -	/* Use the cache as much as we have to return hot objects first */
> > > > -	len = RTE_MIN(remaining, cache->len);
> > > >  	cache_objs = &cache->objs[cache->len];
> > > > +
> > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > 
> > don't take direct dependency on compiler builtins. define a macro so we
> > don't have to play shotgun surgery later.
> > 
> > also what is the purpose of using __extension__ here? are you annotating
> > the use of __builtin_constant_p() or is there more? because if that's
> > the only reason i see no need to use __extension__ when already using a
> > compiler specific builtin like this, that it is not standard is implied
> > and enforced by a compile break.
> 
> ARM 32-bit memcpy() [1] does it this way, so I did the same.
> 
> [1]: https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_memcpy_32.h#L122

i see thanks.

> 
> While I agree that a macro for __builtin_constant_p() would be good, it belongs in a patch to fix portability, not in this patch.

i agree it isn't composite of this change.

would you mind introducing it as a separate patch and depending on it or
do you feel that would delay this patch too much? i wouldn't mind doing
it myself but there is a long merge time on my patches which means i end
up having to carry the adaptations locally for weeks at a time.

thanks
  
Morten Brørup April 18, 2023, 3:50 p.m. UTC | #8
> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> Sent: Tuesday, 18 April 2023 17.45
> 
> On Tue, Apr 18, 2023 at 05:30:27PM +0200, Morten Brørup wrote:
> > > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> > > Sent: Tuesday, 18 April 2023 17.15
> > >
> > > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > When getting objects from the mempool, the number of objects to
> get
> > > is
> > > > > often constant at build time.
> > > > >
> > > > > This patch adds another code path for this case, so the compiler
> can
> > > > > optimize more, e.g. unroll the copy loop when the entire request
> is
> > > > > satisfied from the cache.
> > > > >
> > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc
> 9.4.0,
> > > > > mempool_perf_test with constant n shows an increase in
> rate_persec
> > > by an
> > > > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > > > >
> > > > > The code path where the number of objects to get is unknown at
> build
> > > time
> > > > > remains essentially unchanged.
> > > > >
> > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > >
> > > > Change looks a good idea. Some suggestions inline below, which you
> may
> > > want to
> > > > take on board for any future version. I'd strongly suggest adding
> some
> > > > extra clarifying code comments, as I suggest below.
> > > > With those exta code comments:
> > > >
> > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > > >
> > > > > ---
> > > > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > > > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/lib/mempool/rte_mempool.h
> b/lib/mempool/rte_mempool.h
> > > > > index 9f530db24b..ade0100ec7 100644
> > > > > --- a/lib/mempool/rte_mempool.h
> > > > > +++ b/lib/mempool/rte_mempool.h
> > > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct
> > > rte_mempool *mp, void **obj_table,
> > > > >  	if (unlikely(cache == NULL))
> > > > >  		goto driver_dequeue;
> > > > >
> > > > > -	/* Use the cache as much as we have to return hot objects
> first */
> > > > > -	len = RTE_MIN(remaining, cache->len);
> > > > >  	cache_objs = &cache->objs[cache->len];
> > > > > +
> > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache-
> >len) {
> > >
> > > don't take direct dependency on compiler builtins. define a macro so
> we
> > > don't have to play shotgun surgery later.
> > >
> > > also what is the purpose of using __extension__ here? are you
> annotating
> > > the use of __builtin_constant_p() or is there more? because if
> that's
> > > the only reason i see no need to use __extension__ when already
> using a
> > > compiler specific builtin like this, that it is not standard is
> implied
> > > and enforced by a compile break.
> >
> > ARM 32-bit memcpy() [1] does it this way, so I did the same.
> >
> > [1]:
> https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_me
> mcpy_32.h#L122
> 
> i see thanks.
> 
> >
> > While I agree that a macro for __builtin_constant_p() would be good,
> it belongs in a patch to fix portability, not in this patch.
> 
> i agree it isn't composite of this change.
> 
> would you mind introducing it as a separate patch and depending on it or
> do you feel that would delay this patch too much? i wouldn't mind doing
> it myself but there is a long merge time on my patches which means i end
> up having to carry the adaptations locally for weeks at a time.

I would rather not.

Introducing global macros in rte_common.h usually triggers a lot of discussion and pushback, and I don't want it to hold back this patch.
  
Tyler Retzlaff April 18, 2023, 4:01 p.m. UTC | #9
On Tue, Apr 18, 2023 at 05:50:56PM +0200, Morten Brørup wrote:
> > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> > Sent: Tuesday, 18 April 2023 17.45
> > 
> > On Tue, Apr 18, 2023 at 05:30:27PM +0200, Morten Brørup wrote:
> > > > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> > > > Sent: Tuesday, 18 April 2023 17.15
> > > >
> > > > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > > When getting objects from the mempool, the number of objects to
> > get
> > > > is
> > > > > > often constant at build time.
> > > > > >
> > > > > > This patch adds another code path for this case, so the compiler
> > can
> > > > > > optimize more, e.g. unroll the copy loop when the entire request
> > is
> > > > > > satisfied from the cache.
> > > > > >
> > > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc
> > 9.4.0,
> > > > > > mempool_perf_test with constant n shows an increase in
> > rate_persec
> > > > by an
> > > > > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > > > > >
> > > > > > The code path where the number of objects to get is unknown at
> > build
> > > > time
> > > > > > remains essentially unchanged.
> > > > > >
> > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > >
> > > > > Change looks a good idea. Some suggestions inline below, which you
> > may
> > > > want to
> > > > > take on board for any future version. I'd strongly suggest adding
> > some
> > > > > extra clarifying code comments, as I suggest below.
> > > > > With those exta code comments:
> > > > >
> > > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > >
> > > > > > ---
> > > > > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > > > > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/mempool/rte_mempool.h
> > b/lib/mempool/rte_mempool.h
> > > > > > index 9f530db24b..ade0100ec7 100644
> > > > > > --- a/lib/mempool/rte_mempool.h
> > > > > > +++ b/lib/mempool/rte_mempool.h
> > > > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct
> > > > rte_mempool *mp, void **obj_table,
> > > > > >  	if (unlikely(cache == NULL))
> > > > > >  		goto driver_dequeue;
> > > > > >
> > > > > > -	/* Use the cache as much as we have to return hot objects
> > first */
> > > > > > -	len = RTE_MIN(remaining, cache->len);
> > > > > >  	cache_objs = &cache->objs[cache->len];
> > > > > > +
> > > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache-
> > >len) {
> > > >
> > > > don't take direct dependency on compiler builtins. define a macro so
> > we
> > > > don't have to play shotgun surgery later.
> > > >
> > > > also what is the purpose of using __extension__ here? are you
> > annotating
> > > > the use of __builtin_constant_p() or is there more? because if
> > that's
> > > > the only reason i see no need to use __extension__ when already
> > using a
> > > > compiler specific builtin like this, that it is not standard is
> > implied
> > > > and enforced by a compile break.
> > >
> > > ARM 32-bit memcpy() [1] does it this way, so I did the same.
> > >
> > > [1]:
> > https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_me
> > mcpy_32.h#L122
> > 
> > i see thanks.
> > 
> > >
> > > While I agree that a macro for __builtin_constant_p() would be good,
> > it belongs in a patch to fix portability, not in this patch.
> > 
> > i agree it isn't composite of this change.
> > 
> > would you mind introducing it as a separate patch and depending on it or
> > do you feel that would delay this patch too much? i wouldn't mind doing
> > it myself but there is a long merge time on my patches which means i end
> > up having to carry the adaptations locally for weeks at a time.
> 
> I would rather not.
> 
> Introducing global macros in rte_common.h usually triggers a lot of discussion and pushback, and I don't want it to hold back this patch.

yeah, no kidding. i wish the process was a bit more friendly being on
the receiving end. it's unfortunate because it is discouraging
improvements.

i'll bring a patch for it then.
  
Morten Brørup April 18, 2023, 4:05 p.m. UTC | #10
> From: Morten Brørup
> Sent: Tuesday, 18 April 2023 13.30
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, 18 April 2023 13.07
> >
> > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:

[...]

> > > +		/*
> > > +		 * The request size is known at build time, and
> > > +		 * the entire request can be satisfied from the cache,
> > > +		 * so let the compiler unroll the fixed length copy loop.
> > > +		 */
> > > +		cache->len -= n;
> > > +		for (index = 0; index < n; index++)
> > > +			*obj_table++ = *--cache_objs;
> > > +
> >
> > This loop looks a little awkward to me. Would it be clearer (and
> perhaps
> > easier for compilers to unroll efficiently if it was rewritten as:
> >
> > 	cache->len -= n;
> > 	cache_objs = &cache->objs[cache->len];
> > 	for (index = 0; index < n; index++)
> > 		obj_table[index] = cache_objs[index];
> 
> The mempool cache is a stack, so the copy loop needs get the objects in
> decrementing order. I.e. the source index decrements and the destination
> index increments.
> 
> Regardless, your point here is still valid! I expected that any
> unrolling capable compiler can unroll *dst++ = *--src; but I can
> experiment with different compilers on godbolt.org to see if dst[index]
> = src[-index] is better.

Just for the record... I have now tried experimenting with the alternative, and it makes no difference.
  
Thomas Monjalon June 7, 2023, 7:51 a.m. UTC | #11
18/04/2023 14:55, Bruce Richardson:
> On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > > +		/*
> > > > +		 * The request size is known at build time, and
> > > > +		 * the entire request can be satisfied from the cache,
> > > > +		 * so let the compiler unroll the fixed length copy loop.
> > > > +		 */
> > > > +		cache->len -= n;
> > > > +		for (index = 0; index < n; index++)
> > > > +			*obj_table++ = *--cache_objs;
> > > > +
> > > 
> > > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > > easier for compilers to unroll efficiently if it was rewritten as:
> > > 
> > > 	cache->len -= n;
> > > 	cache_objs = &cache->objs[cache->len];
> > > 	for (index = 0; index < n; index++)
> > > 		obj_table[index] = cache_objs[index];
> > 
> > The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments.
> > 
> 
> BTW: Please add this as a comment in the code too, above the loop to avoid
> future developers (or even future me), asking this question again!

Looks like this request was missed.
  
Morten Brørup June 7, 2023, 8:03 a.m. UTC | #12
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, 7 June 2023 09.52
> 
> 18/04/2023 14:55, Bruce Richardson:
> > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > > > +		/*
> > > > > +		 * The request size is known at build time, and
> > > > > +		 * the entire request can be satisfied from the cache,
> > > > > +		 * so let the compiler unroll the fixed length copy loop.
> > > > > +		 */
> > > > > +		cache->len -= n;
> > > > > +		for (index = 0; index < n; index++)
> > > > > +			*obj_table++ = *--cache_objs;
> > > > > +
> > > >
> > > > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > > > easier for compilers to unroll efficiently if it was rewritten as:
> > > >
> > > > 	cache->len -= n;
> > > > 	cache_objs = &cache->objs[cache->len];
> > > > 	for (index = 0; index < n; index++)
> > > > 		obj_table[index] = cache_objs[index];
> > >
> > > The mempool cache is a stack, so the copy loop needs get the objects in
> decrementing order. I.e. the source index decrements and the destination index
> increments.
> > >
> >
> > BTW: Please add this as a comment in the code too, above the loop to avoid
> > future developers (or even future me), asking this question again!
> 
> Looks like this request was missed.

I intentionally omitted it, because I disagree with the suggestion.

Otherwise, reminders that the mempool cache is a stack should be plastered all over the source code, not just here. For reference, this copy loop (without such a reminder) also exists elsewhere in the same file.

Apologies for not responding to Bruce's request earlier.
  
Thomas Monjalon June 7, 2023, 8:10 a.m. UTC | #13
07/06/2023 10:03, Morten Brørup:
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Wednesday, 7 June 2023 09.52
> > 
> > 18/04/2023 14:55, Bruce Richardson:
> > > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > > > > +		/*
> > > > > > +		 * The request size is known at build time, and
> > > > > > +		 * the entire request can be satisfied from the cache,
> > > > > > +		 * so let the compiler unroll the fixed length copy loop.
> > > > > > +		 */
> > > > > > +		cache->len -= n;
> > > > > > +		for (index = 0; index < n; index++)
> > > > > > +			*obj_table++ = *--cache_objs;
> > > > > > +
> > > > >
> > > > > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > > > > easier for compilers to unroll efficiently if it was rewritten as:
> > > > >
> > > > > 	cache->len -= n;
> > > > > 	cache_objs = &cache->objs[cache->len];
> > > > > 	for (index = 0; index < n; index++)
> > > > > 		obj_table[index] = cache_objs[index];
> > > >
> > > > The mempool cache is a stack, so the copy loop needs get the objects in
> > decrementing order. I.e. the source index decrements and the destination index
> > increments.
> > > >
> > >
> > > BTW: Please add this as a comment in the code too, above the loop to avoid
> > > future developers (or even future me), asking this question again!
> > 
> > Looks like this request was missed.
> 
> I intentionally omitted it, because I disagree with the suggestion.
> 
> Otherwise, reminders that the mempool cache is a stack should be plastered all over the source code, not just here. For reference, this copy loop (without such a reminder) also exists elsewhere in the same file.
> 
> Apologies for not responding to Bruce's request earlier.

What about doing a general comment at the top of the function,
with the assignment of the pointer at the end of the array:

    /* The cache is a stack, so copy will be in reverse order. */
    cache_objs = &cache->objs[cache->len];

I could do it on apply if there is an agreement.
  
Morten Brørup June 7, 2023, 8:33 a.m. UTC | #14
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, 7 June 2023 10.10
> 
> 07/06/2023 10:03, Morten Brørup:
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Wednesday, 7 June 2023 09.52
> > >
> > > 18/04/2023 14:55, Bruce Richardson:
> > > > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > > > > > +		/*
> > > > > > > +		 * The request size is known at build time, and
> > > > > > > +		 * the entire request can be satisfied from the cache,
> > > > > > > +		 * so let the compiler unroll the fixed length copy loop.
> > > > > > > +		 */
> > > > > > > +		cache->len -= n;
> > > > > > > +		for (index = 0; index < n; index++)
> > > > > > > +			*obj_table++ = *--cache_objs;
> > > > > > > +
> > > > > >
> > > > > > This loop looks a little awkward to me. Would it be clearer (and
> perhaps
> > > > > > easier for compilers to unroll efficiently if it was rewritten as:
> > > > > >
> > > > > > 	cache->len -= n;
> > > > > > 	cache_objs = &cache->objs[cache->len];
> > > > > > 	for (index = 0; index < n; index++)
> > > > > > 		obj_table[index] = cache_objs[index];
> > > > >
> > > > > The mempool cache is a stack, so the copy loop needs get the objects
> in
> > > decrementing order. I.e. the source index decrements and the destination
> index
> > > increments.
> > > > >
> > > >
> > > > BTW: Please add this as a comment in the code too, above the loop to
> avoid
> > > > future developers (or even future me), asking this question again!
> > >
> > > Looks like this request was missed.
> >
> > I intentionally omitted it, because I disagree with the suggestion.
> >
> > Otherwise, reminders that the mempool cache is a stack should be plastered
> all over the source code, not just here. For reference, this copy loop
> (without such a reminder) also exists elsewhere in the same file.
> >
> > Apologies for not responding to Bruce's request earlier.
> 
> What about doing a general comment at the top of the function,
> with the assignment of the pointer at the end of the array:
> 
>     /* The cache is a stack, so copy will be in reverse order. */
>     cache_objs = &cache->objs[cache->len];
> 
> I could do it on apply if there is an agreement.
> 
ACK from me.

For consistency, please also add the reminder to the two existing reverse order copy loops in rte_mempool_do_generic_get().
  
Morten Brørup June 7, 2023, 8:41 a.m. UTC | #15
> From: Morten Brørup
> Sent: Wednesday, 7 June 2023 10.34
> 
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Wednesday, 7 June 2023 10.10
> >

[...]

> > What about doing a general comment at the top of the function,
> > with the assignment of the pointer at the end of the array:
> >
> >     /* The cache is a stack, so copy will be in reverse order. */
> >     cache_objs = &cache->objs[cache->len];
> >
> > I could do it on apply if there is an agreement.
> >
> ACK from me.
> 
> For consistency, please also add the reminder to the two existing reverse
> order copy loops in rte_mempool_do_generic_get().

Rubbish! I was looking at the wrong version of the code.

So, plain ACK from me. No further comments!

Sorry. :-!
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 9f530db24b..ade0100ec7 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1500,15 +1500,33 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	if (unlikely(cache == NULL))
 		goto driver_dequeue;
 
-	/* Use the cache as much as we have to return hot objects first */
-	len = RTE_MIN(remaining, cache->len);
 	cache_objs = &cache->objs[cache->len];
+
+	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
+		/*
+		 * The request size is known at build time, and
+		 * the entire request can be satisfied from the cache,
+		 * so let the compiler unroll the fixed length copy loop.
+		 */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
+
+		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
+		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
+
+		return 0;
+	}
+
+	/* Use the cache as much as we have to return hot objects first */
+	len = __extension__(__builtin_constant_p(n)) ? cache->len :
+			RTE_MIN(remaining, cache->len);
 	cache->len -= len;
 	remaining -= len;
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	if (remaining == 0) {
+	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
 		/* The entire request is satisfied from the cache. */
 
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);