[v2,1/1] mempool: implement index-based per core cache

Message ID 20220113053630.886638-2-dharmik.thakkar@arm.com (mailing list archive)
State Rejected, archived
Delegated to: Thomas Monjalon
Headers
Series mempool: implement index-based per core cache |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/github-robot: build success github build: passed
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Functional fail Functional Testing issues
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS

Commit Message

Dharmik Thakkar Jan. 13, 2022, 5:36 a.m. UTC
  Current mempool per core cache implementation stores pointers to mbufs
On 64b architectures, each pointer consumes 8B
This patch replaces it with index-based implementation,
where in each buffer is addressed by (pool base address + index)
It reduces the amount of memory/cache required for per core cache

L3Fwd performance testing reveals minor improvements in the cache
performance (L1 and L2 misses reduced by 0.60%)
with no change in throughput

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/mempool/rte_mempool.h             | 150 +++++++++++++++++++++++++-
 lib/mempool/rte_mempool_ops_default.c |   7 ++
 2 files changed, 156 insertions(+), 1 deletion(-)
  

Comments

Jerin Jacob Jan. 13, 2022, 10:18 a.m. UTC | #1
On Thu, Jan 13, 2022 at 11:06 AM Dharmik Thakkar
<dharmik.thakkar@arm.com> wrote:
>
> Current mempool per core cache implementation stores pointers to mbufs
> On 64b architectures, each pointer consumes 8B
> This patch replaces it with index-based implementation,
> where in each buffer is addressed by (pool base address + index)
> It reduces the amount of memory/cache required for per core cache
>
> L3Fwd performance testing reveals minor improvements in the cache
> performance (L1 and L2 misses reduced by 0.60%)
> with no change in throughput
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---

>
>         /* Now fill in the response ... */
> +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE

Instead of having this #ifdef clutter everywhere for the pair,
I think, we can define RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE once,
and have a different implementation.
i.e
#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
void x()
{

}
void y()
{

}
#else

void x()
{

}
void y()
{

}

#endif

call
x();
y();

in the main code.

> diff --git a/lib/mempool/rte_mempool_ops_default.c b/lib/mempool/rte_mempool_ops_default.c
> index 22fccf9d7619..3543cad9d4ce 100644
> --- a/lib/mempool/rte_mempool_ops_default.c
> +++ b/lib/mempool/rte_mempool_ops_default.c
> @@ -127,6 +127,13 @@ rte_mempool_op_populate_helper(struct rte_mempool *mp, unsigned int flags,
>                 obj = va + off;
>                 obj_cb(mp, obj_cb_arg, obj,
>                        (iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off));
> +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE

This is the only place used in C code.
Since we are going compile time approach. Can make this unconditional?
That will enable the use of this model in the application, without
recompiling DPDK.
All application needs to

#define RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE 1
#include <rte_mempool.h>

I believe enabling such structuring helps to avoid DPDK recompilation of code.


> +               /* Store pool base value to calculate indices for index-based
> +                * lcore cache implementation
> +                */
> +               if (i == 0)
> +                       mp->pool_base_value = obj;
> +#endif
>                 rte_mempool_ops_enqueue_bulk(mp, &obj, 1);
>                 off += mp->elt_size + mp->trailer_size;
>         }
> --
> 2.17.1
>
  
Morten Brørup Jan. 20, 2022, 8:21 a.m. UTC | #2
+CC Beilei as i40e maintainer

> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com]
> Sent: Thursday, 13 January 2022 06.37
> 
> Current mempool per core cache implementation stores pointers to mbufs
> On 64b architectures, each pointer consumes 8B
> This patch replaces it with index-based implementation,
> where in each buffer is addressed by (pool base address + index)
> It reduces the amount of memory/cache required for per core cache
> 
> L3Fwd performance testing reveals minor improvements in the cache
> performance (L1 and L2 misses reduced by 0.60%)
> with no change in throughput
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  lib/mempool/rte_mempool.h             | 150 +++++++++++++++++++++++++-
>  lib/mempool/rte_mempool_ops_default.c |   7 ++
>  2 files changed, 156 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c15273c..f2403fbc97a7 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -50,6 +50,10 @@
>  #include <rte_memcpy.h>
>  #include <rte_common.h>
> 
> +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> +#include <rte_vect.h>
> +#endif
> +
>  #include "rte_mempool_trace_fp.h"
> 
>  #ifdef __cplusplus
> @@ -239,6 +243,9 @@ struct rte_mempool {
>  	int32_t ops_index;
> 
>  	struct rte_mempool_cache *local_cache; /**< Per-lcore local cache
> */
> +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> +	void *pool_base_value; /**< Base value to calculate indices */
> +#endif
> 
>  	uint32_t populated_size;         /**< Number of populated
> objects. */
>  	struct rte_mempool_objhdr_list elt_list; /**< List of objects in
> pool */
> @@ -1314,7 +1321,22 @@ rte_mempool_cache_flush(struct rte_mempool_cache
> *cache,
>  	if (cache == NULL || cache->len == 0)
>  		return;
>  	rte_mempool_trace_cache_flush(cache, mp);
> +
> +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> +	unsigned int i;
> +	unsigned int cache_len = cache->len;
> +	void *obj_table[RTE_MEMPOOL_CACHE_MAX_SIZE * 3];
> +	void *base_value = mp->pool_base_value;
> +	uint32_t *cache_objs = (uint32_t *) cache->objs;

Hi Dharmik and Honnappa,

The essence of this patch is based on recasting the type of the objs field in the rte_mempool_cache structure from an array of pointers to an array of uint32_t.

However, this effectively breaks the ABI, because the rte_mempool_cache structure is public and part of the API.

Some drivers [1] even bypass the mempool API and access the rte_mempool_cache structure directly, assuming that the objs array in the cache is an array of pointers. So you cannot recast the fields in the rte_mempool_cache structure the way this patch requires.

Although I do consider bypassing an API's accessor functions "spaghetti code", this driver's behavior is formally acceptable as long as the rte_mempool_cache structure is not marked as internal.

I really liked your idea of using indexes instead of pointers, so I'm very sorry to shoot it down. :-(

[1]: E.g. the Intel i40e PMD, http://code.dpdk.org/dpdk/latest/source/drivers/net/i40e/i40e_rxtx_vec_avx512.c#L25

-Morten
  
Honnappa Nagarahalli Jan. 21, 2022, 6:01 a.m. UTC | #3
> 
> +CC Beilei as i40e maintainer
> 
> > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com]
> > Sent: Thursday, 13 January 2022 06.37
> >
> > Current mempool per core cache implementation stores pointers to mbufs
> > On 64b architectures, each pointer consumes 8B This patch replaces it
> > with index-based implementation, where in each buffer is addressed by
> > (pool base address + index) It reduces the amount of memory/cache
> > required for per core cache
> >
> > L3Fwd performance testing reveals minor improvements in the cache
> > performance (L1 and L2 misses reduced by 0.60%) with no change in
> > throughput
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  lib/mempool/rte_mempool.h             | 150 +++++++++++++++++++++++++-
> >  lib/mempool/rte_mempool_ops_default.c |   7 ++
> >  2 files changed, 156 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c15273c..f2403fbc97a7 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -50,6 +50,10 @@
> >  #include <rte_memcpy.h>
> >  #include <rte_common.h>
> >
> > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> > +#include <rte_vect.h>
> > +#endif
> > +
> >  #include "rte_mempool_trace_fp.h"
> >
> >  #ifdef __cplusplus
> > @@ -239,6 +243,9 @@ struct rte_mempool {
> >  	int32_t ops_index;
> >
> >  	struct rte_mempool_cache *local_cache; /**< Per-lcore local cache */
> > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> > +	void *pool_base_value; /**< Base value to calculate indices */
> > +#endif
> >
> >  	uint32_t populated_size;         /**< Number of populated
> > objects. */
> >  	struct rte_mempool_objhdr_list elt_list; /**< List of objects in
> > pool */ @@ -1314,7 +1321,22 @@ rte_mempool_cache_flush(struct
> > rte_mempool_cache *cache,
> >  	if (cache == NULL || cache->len == 0)
> >  		return;
> >  	rte_mempool_trace_cache_flush(cache, mp);
> > +
> > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> > +	unsigned int i;
> > +	unsigned int cache_len = cache->len;
> > +	void *obj_table[RTE_MEMPOOL_CACHE_MAX_SIZE * 3];
> > +	void *base_value = mp->pool_base_value;
> > +	uint32_t *cache_objs = (uint32_t *) cache->objs;
> 
> Hi Dharmik and Honnappa,
> 
> The essence of this patch is based on recasting the type of the objs field in the
> rte_mempool_cache structure from an array of pointers to an array of
> uint32_t.
> 
> However, this effectively breaks the ABI, because the rte_mempool_cache
> structure is public and part of the API.
The patch does not change the public structure, the new member is under compile time flag, not sure how it breaks the ABI.

> 
> Some drivers [1] even bypass the mempool API and access the
> rte_mempool_cache structure directly, assuming that the objs array in the
> cache is an array of pointers. So you cannot recast the fields in the
> rte_mempool_cache structure the way this patch requires.
IMO, those drivers are at fault. The mempool cache structure is public only because the APIs are inline. We should still maintain modularity and not use the members of structures belonging to another library directly. A similar effort involving rte_ring was not accepted sometime back [1]

[1] http://inbox.dpdk.org/dev/DBAPR08MB5814907968595EE56F5E20A798390@DBAPR08MB5814.eurprd08.prod.outlook.com/

> 
> Although I do consider bypassing an API's accessor functions "spaghetti
> code", this driver's behavior is formally acceptable as long as the
> rte_mempool_cache structure is not marked as internal.
> 
> I really liked your idea of using indexes instead of pointers, so I'm very sorry to
> shoot it down. :-(
> 
> [1]: E.g. the Intel i40e PMD,
> http://code.dpdk.org/dpdk/latest/source/drivers/net/i40e/i40e_rxtx_vec_avx
> 512.c#L25
It is possible to throw an error when this feature is enabled in this file. Alternatively, this PMD could implement the code for index based mempool.

> 
> -Morten
  
Morten Brørup Jan. 21, 2022, 7:36 a.m. UTC | #4
+Ray Kinsella, ABI Policy maintainer

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 21 January 2022 07.01
> 
> >
> > +CC Beilei as i40e maintainer
> >
> > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com]
> > > Sent: Thursday, 13 January 2022 06.37
> > >
> > > Current mempool per core cache implementation stores pointers to
> mbufs
> > > On 64b architectures, each pointer consumes 8B This patch replaces
> it
> > > with index-based implementation, where in each buffer is addressed
> by
> > > (pool base address + index) It reduces the amount of memory/cache
> > > required for per core cache
> > >
> > > L3Fwd performance testing reveals minor improvements in the cache
> > > performance (L1 and L2 misses reduced by 0.60%) with no change in
> > > throughput
> > >
> > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  lib/mempool/rte_mempool.h             | 150
> +++++++++++++++++++++++++-
> > >  lib/mempool/rte_mempool_ops_default.c |   7 ++
> > >  2 files changed, 156 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 1e7a3c15273c..f2403fbc97a7 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -50,6 +50,10 @@
> > >  #include <rte_memcpy.h>
> > >  #include <rte_common.h>
> > >
> > > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> > > +#include <rte_vect.h>
> > > +#endif
> > > +
> > >  #include "rte_mempool_trace_fp.h"
> > >
> > >  #ifdef __cplusplus
> > > @@ -239,6 +243,9 @@ struct rte_mempool {
> > >  	int32_t ops_index;
> > >
> > >  	struct rte_mempool_cache *local_cache; /**< Per-lcore local cache
> */
> > > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> > > +	void *pool_base_value; /**< Base value to calculate indices */
> > > +#endif
> > >
> > >  	uint32_t populated_size;         /**< Number of populated
> > > objects. */
> > >  	struct rte_mempool_objhdr_list elt_list; /**< List of objects in
> > > pool */ @@ -1314,7 +1321,22 @@ rte_mempool_cache_flush(struct
> > > rte_mempool_cache *cache,
> > >  	if (cache == NULL || cache->len == 0)
> > >  		return;
> > >  	rte_mempool_trace_cache_flush(cache, mp);
> > > +
> > > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> > > +	unsigned int i;
> > > +	unsigned int cache_len = cache->len;
> > > +	void *obj_table[RTE_MEMPOOL_CACHE_MAX_SIZE * 3];
> > > +	void *base_value = mp->pool_base_value;
> > > +	uint32_t *cache_objs = (uint32_t *) cache->objs;
> >
> > Hi Dharmik and Honnappa,
> >
> > The essence of this patch is based on recasting the type of the objs
> field in the
> > rte_mempool_cache structure from an array of pointers to an array of
> > uint32_t.
> >
> > However, this effectively breaks the ABI, because the
> rte_mempool_cache
> > structure is public and part of the API.
> The patch does not change the public structure, the new member is under
> compile time flag, not sure how it breaks the ABI.
> 
> >
> > Some drivers [1] even bypass the mempool API and access the
> > rte_mempool_cache structure directly, assuming that the objs array in
> the
> > cache is an array of pointers. So you cannot recast the fields in the
> > rte_mempool_cache structure the way this patch requires.
> IMO, those drivers are at fault. The mempool cache structure is public
> only because the APIs are inline. We should still maintain modularity
> and not use the members of structures belonging to another library
> directly. A similar effort involving rte_ring was not accepted sometime
> back [1]
> 
> [1]
> http://inbox.dpdk.org/dev/DBAPR08MB5814907968595EE56F5E20A798390@DBAPR0
> 8MB5814.eurprd08.prod.outlook.com/
> 
> >
> > Although I do consider bypassing an API's accessor functions
> "spaghetti
> > code", this driver's behavior is formally acceptable as long as the
> > rte_mempool_cache structure is not marked as internal.
> >
> > I really liked your idea of using indexes instead of pointers, so I'm
> very sorry to
> > shoot it down. :-(
> >
> > [1]: E.g. the Intel i40e PMD,
> >
> http://code.dpdk.org/dpdk/latest/source/drivers/net/i40e/i40e_rxtx_vec_
> avx
> > 512.c#L25
> It is possible to throw an error when this feature is enabled in this
> file. Alternatively, this PMD could implement the code for index based
> mempool.
> 

I agree with both your points, Honnappa.

The ABI remains intact, and only changes when this feature is enabled at compile time.

In addition to your suggestions, I propose that the patch modifies the objs type in the mempool cache structure itself, instead of type casting it through an access variable. This should throw an error when compiling an application that accesses it as a pointer array instead of a uint32_t array - like the affected Intel PMDs.

The updated objs field in the mempool cache structure should have the same size when compiled as the original objs field, so this feature doesn't change anything else in the ABI, only the type of the mempool cache objects.

Also, the description of the feature should stress that applications accessing the cache objects directly will fail miserably.
  
Bruce Richardson Jan. 21, 2022, 9:12 a.m. UTC | #5
On Fri, Jan 21, 2022 at 06:01:23AM +0000, Honnappa Nagarahalli wrote:
> 
> > 
> > +CC Beilei as i40e maintainer
> > 
> > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent:
> > > Thursday, 13 January 2022 06.37
> > >
> > > Current mempool per core cache implementation stores pointers to
> > > mbufs On 64b architectures, each pointer consumes 8B This patch
> > > replaces it with index-based implementation, where in each buffer is
> > > addressed by (pool base address + index) It reduces the amount of
> > > memory/cache required for per core cache
> > >
> > > L3Fwd performance testing reveals minor improvements in the cache
> > > performance (L1 and L2 misses reduced by 0.60%) with no change in
> > > throughput
> > >
> > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com> Reviewed-by:
> > > Ruifeng Wang <ruifeng.wang@arm.com> --- lib/mempool/rte_mempool.h
> > > | 150 +++++++++++++++++++++++++-
> > > lib/mempool/rte_mempool_ops_default.c |   7 ++ 2 files changed, 156
> > > insertions(+), 1 deletion(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 1e7a3c15273c..f2403fbc97a7 100644 ---
> > > a/lib/mempool/rte_mempool.h +++ b/lib/mempool/rte_mempool.h @@ -50,6
> > > +50,10 @@ #include <rte_memcpy.h> #include <rte_common.h>
> > >
> > > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE +#include <rte_vect.h>
> > > +#endif + #include "rte_mempool_trace_fp.h"
> > >
> > >  #ifdef __cplusplus @@ -239,6 +243,9 @@ struct rte_mempool { int32_t
> > >  ops_index;
> > >
> > >  	struct rte_mempool_cache *local_cache; /**< Per-lcore local cache
> > >  	*/ +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE +	void
> > >  	*pool_base_value; /**< Base value to calculate indices */ +#endif
> > >
> > >  	uint32_t populated_size;         /**< Number of populated objects.
> > >  	*/ struct rte_mempool_objhdr_list elt_list; /**< List of objects in
> > >  	pool */ @@ -1314,7 +1321,22 @@ rte_mempool_cache_flush(struct
> > >  	rte_mempool_cache *cache, if (cache == NULL || cache->len == 0)
> > >  	return; rte_mempool_trace_cache_flush(cache, mp); + +#ifdef
> > >  	RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE +	unsigned int i; +
> > >  	unsigned int cache_len = cache->len; +	void
> > >  	*obj_table[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; +	void *base_value =
> > >  	mp->pool_base_value; +	uint32_t *cache_objs = (uint32_t *)
> > >  	cache->objs;
> > 
> > Hi Dharmik and Honnappa,
> > 
> > The essence of this patch is based on recasting the type of the objs
> > field in the rte_mempool_cache structure from an array of pointers to
> > an array of uint32_t.
> > 
> > However, this effectively breaks the ABI, because the rte_mempool_cache
> > structure is public and part of the API.
> The patch does not change the public structure, the new member is under
> compile time flag, not sure how it breaks the ABI.
> 
> > 
> > Some drivers [1] even bypass the mempool API and access the
> > rte_mempool_cache structure directly, assuming that the objs array in
> > the cache is an array of pointers. So you cannot recast the fields in
> > the rte_mempool_cache structure the way this patch requires.
> IMO, those drivers are at fault. The mempool cache structure is public
> only because the APIs are inline. We should still maintain modularity and
> not use the members of structures belonging to another library directly.
> A similar effort involving rte_ring was not accepted sometime back [1]
> 
> [1]
> http://inbox.dpdk.org/dev/DBAPR08MB5814907968595EE56F5E20A798390@DBAPR08MB5814.eurprd08.prod.outlook.com/
> 
> > 
> > Although I do consider bypassing an API's accessor functions "spaghetti
> > code", this driver's behavior is formally acceptable as long as the
> > rte_mempool_cache structure is not marked as internal.
> > 
> > I really liked your idea of using indexes instead of pointers, so I'm
> > very sorry to shoot it down. :-(
> > 
> > [1]: E.g. the Intel i40e PMD,
> > http://code.dpdk.org/dpdk/latest/source/drivers/net/i40e/i40e_rxtx_vec_avx
> > 512.c#L25
> It is possible to throw an error when this feature is enabled in this
> file. Alternatively, this PMD could implement the code for index based
> mempool.
>
Yes, it can implement it, and if this model get put in mempool it probably
will [even if it's just a fallback to the mempool code in that case].

However, I would object to adding in this model in the library right now if it
cannot be proved to show some benefit in a realworld case. As I understand
it, the only benefit seen has been in unit test cases? I want to ensure
that for any perf improvements we put in that they have some real-world
applicabilty - the amoung of applicability will depend on the scope and
impact - and by the same token that we don't reject simplifications or
improvements on the basis that they *might* cause issues, if all perf data
fails to show any problem.

So for this patch, can we get some perf numbers for an app where it does
show the value of it? L3fwd is a very trivial app, and as such is usually
fairly reliable in showing perf benefits of optimizations if they exist.
Perhaps for this case, we need something with a bigger cache footprint
perhaps?

Regards,
/Bruce
  
Wang, Haiyue Jan. 23, 2022, 7:13 a.m. UTC | #6
> -----Original Message-----
> From: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Sent: Thursday, January 13, 2022 13:37
> To: Olivier Matz <olivier.matz@6wind.com>; Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Cc: dev@dpdk.org; nd@arm.com; honnappa.nagarahalli@arm.com; ruifeng.wang@arm.com; Dharmik Thakkar
> <dharmik.thakkar@arm.com>
> Subject: [PATCH v2 1/1] mempool: implement index-based per core cache
> 
> Current mempool per core cache implementation stores pointers to mbufs
> On 64b architectures, each pointer consumes 8B
> This patch replaces it with index-based implementation,
> where in each buffer is addressed by (pool base address + index)
> It reduces the amount of memory/cache required for per core cache
> 
> L3Fwd performance testing reveals minor improvements in the cache
> performance (L1 and L2 misses reduced by 0.60%)
> with no change in throughput
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  lib/mempool/rte_mempool.h             | 150 +++++++++++++++++++++++++-
>  lib/mempool/rte_mempool_ops_default.c |   7 ++
>  2 files changed, 156 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c15273c..f2403fbc97a7 100644


> diff --git a/lib/mempool/rte_mempool_ops_default.c b/lib/mempool/rte_mempool_ops_default.c
> index 22fccf9d7619..3543cad9d4ce 100644
> --- a/lib/mempool/rte_mempool_ops_default.c
> +++ b/lib/mempool/rte_mempool_ops_default.c
> @@ -127,6 +127,13 @@ rte_mempool_op_populate_helper(struct rte_mempool *mp, unsigned int flags,
>  		obj = va + off;
>  		obj_cb(mp, obj_cb_arg, obj,
>  		       (iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off));
> +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
> +		/* Store pool base value to calculate indices for index-based
> +		 * lcore cache implementation
> +		 */
> +		if (i == 0)
> +			mp->pool_base_value = obj;

This is wrong, the populate may run many times. ;-)

I tried bellow patch to run "rte_pktmbuf_pool_create(mbuf_pool_0, 1048575, 256, 0, 4096, 0)"

This is the debug message (also, your patch will make DPDK mempool not support > 4GB):

2bfffdb40 (from last debug line 'max') - 1b3fff240 (from first line 'base addr') = 10BFFE900

****mempool mbuf_pool_0 (size = 1048575, populated_size = 46952, elt_size = 4224): base addr = 0x1b3fff240, max = 0x0, diff = 18446744066394688960 (max_objs = 1048575)
****mempool mbuf_pool_0 (size = 1048575, populated_size = 297358, elt_size = 4224): base addr = 0x1c0000040, max = 0x0, diff = 18446744066193358784 (max_objs = 1001623)
****mempool mbuf_pool_0 (size = 1048575, populated_size = 547764, elt_size = 4224): base addr = 0x200000040, max = 0x0, diff = 18446744065119616960 (max_objs = 751217)
****mempool mbuf_pool_0 (size = 1048575, populated_size = 798170, elt_size = 4224): base addr = 0x240000040, max = 0x0, diff = 18446744064045875136 (max_objs = 500811)
****mempool mbuf_pool_0 (size = 1048575, populated_size = 1048575, elt_size = 4224): base addr = 0x280000040, max = 0x2bfffdb40, diff = 1073732352 (max_objs = 250405)

diff --git a/lib/mempool/rte_mempool_ops_default.c b/lib/mempool/rte_mempool_ops_default.c
index 22fccf9d76..854067cd43 100644
--- a/lib/mempool/rte_mempool_ops_default.c
+++ b/lib/mempool/rte_mempool_ops_default.c
@@ -99,6 +99,7 @@ rte_mempool_op_populate_helper(struct rte_mempool *mp, unsigned int flags,
        unsigned int i;
        void *obj;
        int ret;
+       void *pool_base_value = NULL, *pool_max_value = NULL;

        ret = rte_mempool_get_page_size(mp, &pg_sz);
        if (ret < 0)
@@ -128,9 +129,20 @@ rte_mempool_op_populate_helper(struct rte_mempool *mp, unsigned int flags,
                obj_cb(mp, obj_cb_arg, obj,
                       (iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off));
                rte_mempool_ops_enqueue_bulk(mp, &obj, 1);
+               if (i == 0)
+                       pool_base_value = obj;
+               else if (i == (max_objs - 1))
+                       pool_max_value = obj;
                off += mp->elt_size + mp->trailer_size;
        }

+       printf("****mempool %s (size = %u, populated_size = %u, elt_size = %u): base addr = 0x%llx, max = 0x%llx, diff = %lu (max_objs = %u)\n",
+               mp->name, mp->size, mp->populated_size,
+               mp->elt_size,
+               (unsigned long long) pool_base_value,
+               (unsigned long long) pool_max_value,
+               RTE_PTR_DIFF(pool_max_value, pool_base_value), max_objs);
+
        return i;
 }


> +#endif
>  		rte_mempool_ops_enqueue_bulk(mp, &obj, 1);
>  		off += mp->elt_size + mp->trailer_size;
>  	}
> --
> 2.17.1
  
Ray Kinsella Jan. 24, 2022, 1:05 p.m. UTC | #7
Morten Brørup <mb@smartsharesystems.com> writes:

> +Ray Kinsella, ABI Policy maintainer
>
>> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
>> Sent: Friday, 21 January 2022 07.01
>> 
>> >
>> > +CC Beilei as i40e maintainer
>> >
>> > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com]
>> > > Sent: Thursday, 13 January 2022 06.37
>> > >
>> > > Current mempool per core cache implementation stores pointers to
>> mbufs
>> > > On 64b architectures, each pointer consumes 8B This patch replaces
>> it
>> > > with index-based implementation, where in each buffer is addressed
>> by
>> > > (pool base address + index) It reduces the amount of memory/cache
>> > > required for per core cache
>> > >
>> > > L3Fwd performance testing reveals minor improvements in the cache
>> > > performance (L1 and L2 misses reduced by 0.60%) with no change in
>> > > throughput
>> > >
>> > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>> > > Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
>> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>> > > ---
>> > >  lib/mempool/rte_mempool.h             | 150
>> +++++++++++++++++++++++++-
>> > >  lib/mempool/rte_mempool_ops_default.c |   7 ++
>> > >  2 files changed, 156 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
>> > > index 1e7a3c15273c..f2403fbc97a7 100644
>> > > --- a/lib/mempool/rte_mempool.h
>> > > +++ b/lib/mempool/rte_mempool.h
>> > > @@ -50,6 +50,10 @@
>> > >  #include <rte_memcpy.h>
>> > >  #include <rte_common.h>
>> > >
>> > > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
>> > > +#include <rte_vect.h>
>> > > +#endif
>> > > +
>> > >  #include "rte_mempool_trace_fp.h"
>> > >
>> > >  #ifdef __cplusplus
>> > > @@ -239,6 +243,9 @@ struct rte_mempool {
>> > >  	int32_t ops_index;
>> > >
>> > >  	struct rte_mempool_cache *local_cache; /**< Per-lcore local cache
>> */
>> > > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
>> > > +	void *pool_base_value; /**< Base value to calculate indices */
>> > > +#endif
>> > >
>> > >  	uint32_t populated_size;         /**< Number of populated
>> > > objects. */
>> > >  	struct rte_mempool_objhdr_list elt_list; /**< List of objects in
>> > > pool */ @@ -1314,7 +1321,22 @@ rte_mempool_cache_flush(struct
>> > > rte_mempool_cache *cache,
>> > >  	if (cache == NULL || cache->len == 0)
>> > >  		return;
>> > >  	rte_mempool_trace_cache_flush(cache, mp);
>> > > +
>> > > +#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
>> > > +	unsigned int i;
>> > > +	unsigned int cache_len = cache->len;
>> > > +	void *obj_table[RTE_MEMPOOL_CACHE_MAX_SIZE * 3];
>> > > +	void *base_value = mp->pool_base_value;
>> > > +	uint32_t *cache_objs = (uint32_t *) cache->objs;
>> >
>> > Hi Dharmik and Honnappa,
>> >
>> > The essence of this patch is based on recasting the type of the objs
>> field in the
>> > rte_mempool_cache structure from an array of pointers to an array of
>> > uint32_t.
>> >
>> > However, this effectively breaks the ABI, because the
>> rte_mempool_cache
>> > structure is public and part of the API.
>> The patch does not change the public structure, the new member is under
>> compile time flag, not sure how it breaks the ABI.
>> 
>> >
>> > Some drivers [1] even bypass the mempool API and access the
>> > rte_mempool_cache structure directly, assuming that the objs array in
>> the
>> > cache is an array of pointers. So you cannot recast the fields in the
>> > rte_mempool_cache structure the way this patch requires.
>> IMO, those drivers are at fault. The mempool cache structure is public
>> only because the APIs are inline. We should still maintain modularity
>> and not use the members of structures belonging to another library
>> directly. A similar effort involving rte_ring was not accepted sometime
>> back [1]
>> 
>> [1]
>> http://inbox.dpdk.org/dev/DBAPR08MB5814907968595EE56F5E20A798390@DBAPR0
>> 8MB5814.eurprd08.prod.outlook.com/
>> 
>> >
>> > Although I do consider bypassing an API's accessor functions
>> "spaghetti
>> > code", this driver's behavior is formally acceptable as long as the
>> > rte_mempool_cache structure is not marked as internal.
>> >
>> > I really liked your idea of using indexes instead of pointers, so I'm
>> very sorry to
>> > shoot it down. :-(
>> >
>> > [1]: E.g. the Intel i40e PMD,
>> >
>> http://code.dpdk.org/dpdk/latest/source/drivers/net/i40e/i40e_rxtx_vec_
>> avx
>> > 512.c#L25
>> It is possible to throw an error when this feature is enabled in this
>> file. Alternatively, this PMD could implement the code for index based
>> mempool.
>> 
>
> I agree with both your points, Honnappa.
>
> The ABI remains intact, and only changes when this feature is enabled at compile time.
>
> In addition to your suggestions, I propose that the patch modifies the objs type in the mempool cache structure itself, instead of type casting it through an access variable. This should throw an error when compiling an application that accesses it as a pointer array instead of a uint32_t array - like the affected Intel PMDs.
>
> The updated objs field in the mempool cache structure should have the same size when compiled as the original objs field, so this feature doesn't change anything else in the ABI, only the type of the mempool cache objects.
>
> Also, the description of the feature should stress that applications accessing the cache objects directly will fail miserably.

Thanks for CC'ing me Morten.

My 2c is that, I would be slow in supporting this patch as it introduces
code paths that are harder (impossible?) to test regularly. So yes, it
is optional, in that case are we just adding automatically dead code -
I would ask, if a runtime option not make more sense for this?

Also we can't automatically assume what the PMD's are doing are breaking
an unwritten rule (breaking abstractions) - I would guess these are
doing it for solid performance reasons. If so that would futher support
my point about making the mempool runtime configurable and query-able
(is this mempool a bucket of indexes or pointers etc), and enabling the
PMDs to ask rather than assume.

Like Morten, I like the idea, saving memory and reducing cache misses
with indexes, this is all good IMHO.
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c15273c..f2403fbc97a7 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -50,6 +50,10 @@ 
 #include <rte_memcpy.h>
 #include <rte_common.h>
 
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+#include <rte_vect.h>
+#endif
+
 #include "rte_mempool_trace_fp.h"
 
 #ifdef __cplusplus
@@ -239,6 +243,9 @@  struct rte_mempool {
 	int32_t ops_index;
 
 	struct rte_mempool_cache *local_cache; /**< Per-lcore local cache */
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+	void *pool_base_value; /**< Base value to calculate indices */
+#endif
 
 	uint32_t populated_size;         /**< Number of populated objects. */
 	struct rte_mempool_objhdr_list elt_list; /**< List of objects in pool */
@@ -1314,7 +1321,22 @@  rte_mempool_cache_flush(struct rte_mempool_cache *cache,
 	if (cache == NULL || cache->len == 0)
 		return;
 	rte_mempool_trace_cache_flush(cache, mp);
+
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+	unsigned int i;
+	unsigned int cache_len = cache->len;
+	void *obj_table[RTE_MEMPOOL_CACHE_MAX_SIZE * 3];
+	void *base_value = mp->pool_base_value;
+	uint32_t *cache_objs = (uint32_t *) cache->objs;
+	for (i = 0; i < cache_len; i++) {
+		/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+		cache_objs[i] = cache_objs[i] << 3;
+		obj_table[i] = (void *) RTE_PTR_ADD(base_value, cache_objs[i]);
+	}
+	rte_mempool_ops_enqueue_bulk(mp, obj_table, cache->len);
+#else
 	rte_mempool_ops_enqueue_bulk(mp, cache->objs, cache->len);
+#endif
 	cache->len = 0;
 }
 
@@ -1334,7 +1356,14 @@  static __rte_always_inline void
 rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+	uint32_t *cache_objs;
+	void *base_value;
+	uint32_t i;
+	uint32_t temp_objs[2];
+#else
 	void **cache_objs;
+#endif
 
 	/* increment stat now, adding in mempool always success */
 	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
@@ -1344,7 +1373,13 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+	cache_objs = (uint32_t *) cache->objs;
+	cache_objs = &cache_objs[cache->len];
+	base_value = mp->pool_base_value;
+#else
 	cache_objs = &cache->objs[cache->len];
+#endif
 
 	/*
 	 * The cache follows the following algorithm
@@ -1354,13 +1389,50 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	 */
 
 	/* Add elements back into the cache */
+
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+#if defined __ARM_NEON
+	uint64x2_t v_obj_table;
+	uint64x2_t v_base_value = vdupq_n_u64((uint64_t)base_value);
+	uint32x2_t v_cache_objs;
+
+	for (i = 0; i < (n & ~0x1); i += 2) {
+		v_obj_table = vld1q_u64((const uint64_t *)&obj_table[i]);
+		v_cache_objs = vqmovn_u64(vsubq_u64(v_obj_table, v_base_value));
+
+		/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+		v_cache_objs = vshr_n_u32(v_cache_objs, 3);
+		vst1_u32(cache_objs + i, v_cache_objs);
+	}
+#else
+	for (i = 0; i < (n & ~0x1); i += 2) {
+		temp_objs[0] = (uint32_t) RTE_PTR_DIFF(obj_table[i], base_value);
+		temp_objs[1] = (uint32_t) RTE_PTR_DIFF(obj_table[i + 1], base_value);
+		/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+		cache_objs[i] = temp_objs[0] >> 3;
+		cache_objs[i + 1] = temp_objs[1] >> 3;
+	}
+#endif
+	if (n & 0x1) {
+		temp_objs[0] = (uint32_t) RTE_PTR_DIFF(obj_table[i], base_value);
+
+		/* Divide by sizeof(uintptr_t) to accommodate 16G/32G mempool */
+		cache_objs[i] = temp_objs[0] >> 3;
+	}
+#else
 	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+#endif
 
 	cache->len += n;
 
 	if (cache->len >= cache->flushthresh) {
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+		rte_mempool_ops_enqueue_bulk(mp, obj_table + cache->len - cache->size,
+				cache->len - cache->size);
+#else
 		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
 				cache->len - cache->size);
+#endif
 		cache->len = cache->size;
 	}
 
@@ -1461,13 +1533,23 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 {
 	int ret;
 	uint32_t index, len;
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+	uint32_t i;
+	uint32_t *cache_objs;
+	uint32_t objs[2];
+#else
 	void **cache_objs;
-
+#endif
 	/* No cache provided or cannot be satisfied from cache */
 	if (unlikely(cache == NULL || n >= cache->size))
 		goto ring_dequeue;
 
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+	void *base_value = mp->pool_base_value;
+	cache_objs = (uint32_t *) cache->objs;
+#else
 	cache_objs = cache->objs;
+#endif
 
 	/* Can this be satisfied from the cache? */
 	if (cache->len < n) {
@@ -1475,8 +1557,14 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 		uint32_t req = n + (cache->size - cache->len);
 
 		/* How many do we require i.e. number to fill the cache + the request */
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+		void *temp_objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+		ret = rte_mempool_ops_dequeue_bulk(mp,
+			temp_objs, req);
+#else
 		ret = rte_mempool_ops_dequeue_bulk(mp,
 			&cache->objs[cache->len], req);
+#endif
 		if (unlikely(ret < 0)) {
 			/*
 			 * In the off chance that we are buffer constrained,
@@ -1487,12 +1575,72 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			goto ring_dequeue;
 		}
 
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+		len = cache->len;
+		for (i = 0; i < req; ++i, ++len) {
+			cache_objs[len] = (uint32_t) RTE_PTR_DIFF(temp_objs[i],
+								base_value);
+			/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+			cache_objs[len] = cache_objs[len] >> 3;
+		}
+#endif
+
 		cache->len += req;
 	}
 
 	/* Now fill in the response ... */
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+#if defined __ARM_NEON
+	uint64x2_t v_obj_table;
+	uint64x2_t v_cache_objs;
+	uint64x2_t v_base_value = vdupq_n_u64((uint64_t)base_value);
+
+	for (index = 0, len = cache->len - 1; index < (n & ~0x3); index += 4,
+						len -= 4, obj_table += 4) {
+		v_cache_objs = vmovl_u32(vld1_u32(cache_objs + len - 1));
+		/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+		v_cache_objs = vshlq_n_u64(v_cache_objs, 3);
+		v_obj_table = vaddq_u64(v_cache_objs, v_base_value);
+		vst1q_u64((uint64_t *)obj_table, v_obj_table);
+		v_cache_objs = vmovl_u32(vld1_u32(cache_objs + len - 3));
+		/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+		v_cache_objs = vshlq_n_u64(v_cache_objs, 3);
+		v_obj_table = vaddq_u64(v_cache_objs, v_base_value);
+		vst1q_u64((uint64_t *)(obj_table + 2), v_obj_table);
+	}
+	switch (n & 0x3) {
+	case 3:
+		/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+		objs[0] = cache_objs[len--] << 3;
+		*(obj_table++) = (void *) RTE_PTR_ADD(base_value, objs[0]); /* fallthrough */
+	case 2:
+		/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+		objs[0] = cache_objs[len--] << 3;
+		*(obj_table++) = (void *) RTE_PTR_ADD(base_value, objs[0]); /* fallthrough */
+	case 1:
+		/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+		objs[0] = cache_objs[len] << 3;
+		*(obj_table) = (void *) RTE_PTR_ADD(base_value, objs[0]);
+	}
+#else
+	for (index = 0, len = cache->len - 1; index < (n & ~0x1); index += 2,
+						len -= 2, obj_table += 2) {
+		/* Scale by sizeof(uintptr_t) to accommodate 16GB/32GB mempool */
+		objs[0] = cache_objs[len] << 3;
+		objs[1] = cache_objs[len - 1] << 3;
+		*obj_table = (void *) RTE_PTR_ADD(base_value, objs[0]);
+		*(obj_table + 1) = (void *) RTE_PTR_ADD(base_value, objs[1]);
+	}
+
+	if (n & 0x1) {
+		objs[0] = cache_objs[len] << 3;
+		*obj_table = (void *) RTE_PTR_ADD(base_value, objs[0]);
+	}
+#endif
+#else
 	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
 		*obj_table = cache_objs[len];
+#endif
 
 	cache->len -= n;
 
diff --git a/lib/mempool/rte_mempool_ops_default.c b/lib/mempool/rte_mempool_ops_default.c
index 22fccf9d7619..3543cad9d4ce 100644
--- a/lib/mempool/rte_mempool_ops_default.c
+++ b/lib/mempool/rte_mempool_ops_default.c
@@ -127,6 +127,13 @@  rte_mempool_op_populate_helper(struct rte_mempool *mp, unsigned int flags,
 		obj = va + off;
 		obj_cb(mp, obj_cb_arg, obj,
 		       (iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off));
+#ifdef RTE_MEMPOOL_INDEX_BASED_LCORE_CACHE
+		/* Store pool base value to calculate indices for index-based
+		 * lcore cache implementation
+		 */
+		if (i == 0)
+			mp->pool_base_value = obj;
+#endif
 		rte_mempool_ops_enqueue_bulk(mp, &obj, 1);
 		off += mp->elt_size + mp->trailer_size;
 	}