[dpdk-dev,v2,1/3] mempool: add stack (lifo) mempool handler

Message ID 1463669335-30378-2-git-send-email-david.hunt@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Commit Message

Hunt, David May 19, 2016, 2:48 p.m. UTC
This is a mempool handler that is useful for pipelining apps, where
the mempool cache doesn't really work - example, where we have one
core doing rx (and alloc), and another core doing Tx (and return). In
such a case, the mempool ring simply cycles through all the mbufs,
resulting in a LLC miss on every mbuf allocated when the number of
mbufs is large. A stack recycles buffers more effectively in this
case.

v2: cleanup based on mailing list comments. Mainly removal of
unnecessary casts and comments.

Signed-off-by: David Hunt <david.hunt@intel.com>
---
 lib/librte_mempool/Makefile            |   1 +
 lib/librte_mempool/rte_mempool_stack.c | 145 +++++++++++++++++++++++++++++++++
 2 files changed, 146 insertions(+)
 create mode 100644 lib/librte_mempool/rte_mempool_stack.c
  

Comments

Olivier Matz May 23, 2016, 12:55 p.m. UTC | #1
Hi David,

Please find some comments below.

On 05/19/2016 04:48 PM, David Hunt wrote:
> This is a mempool handler that is useful for pipelining apps, where
> the mempool cache doesn't really work - example, where we have one
> core doing rx (and alloc), and another core doing Tx (and return). In
> such a case, the mempool ring simply cycles through all the mbufs,
> resulting in a LLC miss on every mbuf allocated when the number of
> mbufs is large. A stack recycles buffers more effectively in this
> case.
> 
> v2: cleanup based on mailing list comments. Mainly removal of
> unnecessary casts and comments.
> 
> Signed-off-by: David Hunt <david.hunt@intel.com>
> ---
>  lib/librte_mempool/Makefile            |   1 +
>  lib/librte_mempool/rte_mempool_stack.c | 145 +++++++++++++++++++++++++++++++++
>  2 files changed, 146 insertions(+)
>  create mode 100644 lib/librte_mempool/rte_mempool_stack.c
> 
> diff --git a/lib/librte_mempool/Makefile b/lib/librte_mempool/Makefile
> index f19366e..5aa9ef8 100644
> --- a/lib/librte_mempool/Makefile
> +++ b/lib/librte_mempool/Makefile
> @@ -44,6 +44,7 @@ LIBABIVER := 2
>  SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool.c
>  SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool_handler.c
>  SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool_default.c
> +SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool_stack.c
>  # install includes
>  SYMLINK-$(CONFIG_RTE_LIBRTE_MEMPOOL)-include := rte_mempool.h
>  
> diff --git a/lib/librte_mempool/rte_mempool_stack.c b/lib/librte_mempool/rte_mempool_stack.c
> new file mode 100644
> index 0000000..6e25028
> --- /dev/null
> +++ b/lib/librte_mempool/rte_mempool_stack.c
> @@ -0,0 +1,145 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.

Should be 2016?

> ...
> +
> +static void *
> +common_stack_alloc(struct rte_mempool *mp)
> +{
> +	struct rte_mempool_common_stack *s;
> +	unsigned n = mp->size;
> +	int size = sizeof(*s) + (n+16)*sizeof(void *);
> +
> +	/* Allocate our local memory structure */
> +	s = rte_zmalloc_socket("common-stack",

"mempool-stack" ?

> +			size,
> +			RTE_CACHE_LINE_SIZE,
> +			mp->socket_id);
> +	if (s == NULL) {
> +		RTE_LOG(ERR, MEMPOOL, "Cannot allocate stack!\n");
> +		return NULL;
> +	}
> +
> +	rte_spinlock_init(&s->sl);
> +
> +	s->size = n;
> +	mp->pool = s;
> +	rte_mempool_set_handler(mp, "stack");

rte_mempool_set_handler() is a user function, it should be called here

> +
> +	return s;
> +}
> +
> +static int common_stack_put(void *p, void * const *obj_table,
> +		unsigned n)
> +{
> +	struct rte_mempool_common_stack *s = p;
> +	void **cache_objs;
> +	unsigned index;
> +
> +	rte_spinlock_lock(&s->sl);
> +	cache_objs = &s->objs[s->len];
> +
> +	/* Is there sufficient space in the stack ? */
> +	if ((s->len + n) > s->size) {
> +		rte_spinlock_unlock(&s->sl);
> +		return -ENOENT;
> +	}

The usual return value for a failing put() is ENOBUFS (see in rte_ring).


After reading it, I realize that it's nearly exactly the same code than
in "app/test: test external mempool handler".
http://patchwork.dpdk.org/dev/patchwork/patch/12896/

We should drop one of them. If this stack handler is really useful for
a performance use-case, it could go in librte_mempool. At the first
read, the code looks like a demo example : it uses a simple spinlock for
concurrent accesses to the common pool. Maybe the mempool cache hides
this cost, in this case we could also consider removing the use of the
rte_ring.

Do you have some some performance numbers? Do you know if it scales
with the number of cores?

If we can identify the conditions where this mempool handler
overperforms the default handler, it would be valuable to have them
in the documentation.


Regards,
Olivier
  
Hunt, David June 15, 2016, 10:10 a.m. UTC | #2
Hi Olivier,

On 23/5/2016 1:55 PM, Olivier Matz wrote:
> Hi David,
>
> Please find some comments below.
>
> On 05/19/2016 04:48 PM, David Hunt wrote:
>> This is a mempool handler that is useful for pipelining apps, where
>> the mempool cache doesn't really work - example, where we have one
>> core doing rx (and alloc), and another core doing Tx (and return). In
>> such a case, the mempool ring simply cycles through all the mbufs,
>> resulting in a LLC miss on every mbuf allocated when the number of
>> mbufs is large. A stack recycles buffers more effectively in this
>> case.
>>
>> v2: cleanup based on mailing list comments. Mainly removal of
>> unnecessary casts and comments.
>>
>> Signed-off-by: David Hunt <david.hunt@intel.com>
>> ---
>>   lib/librte_mempool/Makefile            |   1 +
>>   lib/librte_mempool/rte_mempool_stack.c | 145 +++++++++++++++++++++++++++++++++
>>   2 files changed, 146 insertions(+)
>>   create mode 100644 lib/librte_mempool/rte_mempool_stack.c
>>
>> diff --git a/lib/librte_mempool/Makefile b/lib/librte_mempool/Makefile
>> index f19366e..5aa9ef8 100644
>> --- a/lib/librte_mempool/Makefile
>> +++ b/lib/librte_mempool/Makefile
>> @@ -44,6 +44,7 @@ LIBABIVER := 2
>>   SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool.c
>>   SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool_handler.c
>>   SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool_default.c
>> +SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool_stack.c
>>   # install includes
>>   SYMLINK-$(CONFIG_RTE_LIBRTE_MEMPOOL)-include := rte_mempool.h
>>   
>> diff --git a/lib/librte_mempool/rte_mempool_stack.c b/lib/librte_mempool/rte_mempool_stack.c
>> new file mode 100644
>> index 0000000..6e25028
>> --- /dev/null
>> +++ b/lib/librte_mempool/rte_mempool_stack.c
>> @@ -0,0 +1,145 @@
>> +/*-
>> + *   BSD LICENSE
>> + *
>> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
>> + *   All rights reserved.
> Should be 2016?

Yup, will change.

>> ...
>> +
>> +static void *
>> +common_stack_alloc(struct rte_mempool *mp)
>> +{
>> +	struct rte_mempool_common_stack *s;
>> +	unsigned n = mp->size;
>> +	int size = sizeof(*s) + (n+16)*sizeof(void *);
>> +
>> +	/* Allocate our local memory structure */
>> +	s = rte_zmalloc_socket("common-stack",
> "mempool-stack" ?

Yes. Also, I thing the names of the function should be changed from 
common_stack_x to simply stack_x. The "common_" does not add anything.

>> +			size,
>> +			RTE_CACHE_LINE_SIZE,
>> +			mp->socket_id);
>> +	if (s == NULL) {
>> +		RTE_LOG(ERR, MEMPOOL, "Cannot allocate stack!\n");
>> +		return NULL;
>> +	}
>> +
>> +	rte_spinlock_init(&s->sl);
>> +
>> +	s->size = n;
>> +	mp->pool = s;
>> +	rte_mempool_set_handler(mp, "stack");
> rte_mempool_set_handler() is a user function, it should be called here

Sure, removed.

>> +
>> +	return s;
>> +}
>> +
>> +static int common_stack_put(void *p, void * const *obj_table,
>> +		unsigned n)
>> +{
>> +	struct rte_mempool_common_stack *s = p;
>> +	void **cache_objs;
>> +	unsigned index;
>> +
>> +	rte_spinlock_lock(&s->sl);
>> +	cache_objs = &s->objs[s->len];
>> +
>> +	/* Is there sufficient space in the stack ? */
>> +	if ((s->len + n) > s->size) {
>> +		rte_spinlock_unlock(&s->sl);
>> +		return -ENOENT;
>> +	}
> The usual return value for a failing put() is ENOBUFS (see in rte_ring).

Done.

> After reading it, I realize that it's nearly exactly the same code than
> in "app/test: test external mempool handler".
> http://patchwork.dpdk.org/dev/patchwork/patch/12896/
>
> We should drop one of them. If this stack handler is really useful for
> a performance use-case, it could go in librte_mempool. At the first
> read, the code looks like a demo example : it uses a simple spinlock for
> concurrent accesses to the common pool. Maybe the mempool cache hides
> this cost, in this case we could also consider removing the use of the
> rte_ring.

Unlike the code in the test app, the stack handler does not use a ring. 
This is for the
case where applications do a lot of core-to-core transfers of mbufs. The 
test app was
simply to demonstrate a simple example of a malloc mempool handler. This 
patch adds
a new lifo handler for general use.

Using the mempool_perf_autotest, I see a 30% increase in throughput when
local cache is enabled/used.  However, there is up to a 50% degradation 
when local cache
is NOT used, so it's not usable in all situations. However, with a 30% 
gain for the cache
use-case, I think it's worth having in there as an option for people to 
try if the use-case suits.


> Do you have some some performance numbers? Do you know if it scales
> with the number of cores?

30% gain when local cache is used. And these numbers scale up with the
number of cores on my test machine. It may be better for other use cases.

> If we can identify the conditions where this mempool handler
> overperforms the default handler, it would be valuable to have them
> in the documentation.
>

I could certainly add this to the docs, and mention the recommendation to
use local cache.

Regards,
Dave.
  
Hunt, David June 17, 2016, 2:18 p.m. UTC | #3
Hi Olivier,

On 23/5/2016 1:55 PM, Olivier Matz wrote:
> Hi David,
>
> Please find some comments below.
>
> On 05/19/2016 04:48 PM, David Hunt wrote:
>> [...]
>> +++ b/lib/librte_mempool/rte_mempool_stack.c
>> @@ -0,0 +1,145 @@
>> +/*-
>> + *   BSD LICENSE
>> + *
>> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
>> + *   All rights reserved.
> Should be 2016?

Yes, fixed.

>> ...
>> +
>> +static void *
>> +common_stack_alloc(struct rte_mempool *mp)
>> +{
>> +	struct rte_mempool_common_stack *s;
>> +	unsigned n = mp->size;
>> +	int size = sizeof(*s) + (n+16)*sizeof(void *);
>> +
>> +	/* Allocate our local memory structure */
>> +	s = rte_zmalloc_socket("common-stack",
> "mempool-stack" ?

Done

>> +			size,
>> +			RTE_CACHE_LINE_SIZE,
>> +			mp->socket_id);
>> +	if (s == NULL) {
>> +		RTE_LOG(ERR, MEMPOOL, "Cannot allocate stack!\n");
>> +		return NULL;
>> +	}
>> +
>> +	rte_spinlock_init(&s->sl);
>> +
>> +	s->size = n;
>> +	mp->pool = s;
>> +	rte_mempool_set_handler(mp, "stack");
> rte_mempool_set_handler() is a user function, it should be called here

Removed.

>> +
>> +	return s;
>> +}
>> +
>> +static int common_stack_put(void *p, void * const *obj_table,
>> +		unsigned n)
>> +{
>> +	struct rte_mempool_common_stack *s = p;
>> +	void **cache_objs;
>> +	unsigned index;
>> +
>> +	rte_spinlock_lock(&s->sl);
>> +	cache_objs = &s->objs[s->len];
>> +
>> +	/* Is there sufficient space in the stack ? */
>> +	if ((s->len + n) > s->size) {
>> +		rte_spinlock_unlock(&s->sl);
>> +		return -ENOENT;
>> +	}
> The usual return value for a failing put() is ENOBUFS (see in rte_ring).

Done.

>
> After reading it, I realize that it's nearly exactly the same code than
> in "app/test: test external mempool handler".
> http://patchwork.dpdk.org/dev/patchwork/patch/12896/
>
> We should drop one of them. If this stack handler is really useful for
> a performance use-case, it could go in librte_mempool. At the first
> read, the code looks like a demo example : it uses a simple spinlock for
> concurrent accesses to the common pool. Maybe the mempool cache hides
> this cost, in this case we could also consider removing the use of the
> rte_ring.

While I agree that the code is similar, the handler in the test is a 
ring based handler,
where as this patch adds an array based handler.

I think that the case for leaving it in as a test for the standard 
handler as part of the
previous mempool handler is valid, but maybe there is a case for 
removing it if
we add the stack handler. Maybe a future patch?

> Do you have some some performance numbers? Do you know if it scales
> with the number of cores?

For the mempool_perf_autotest, I'm seeing a 30% increase in performance 
for the
local cache use-case for 1 - 36 cores (results vary within those tests 
between
10-45% gain, but with an average of 30% gain over all the tests.).

However, for the tests with no local cache configured, throughput of the 
enqueue/dequeue
drops by about 30%, with the 36 core yelding the largest drop of 40%. So 
this handler would
not be recommended in no-cache applications.

> If we can identify the conditions where this mempool handler
> overperforms the default handler, it would be valuable to have them
> in the documentation.
>


Regards,
Dave.
  
Olivier Matz June 20, 2016, 8:17 a.m. UTC | #4
Hi David,

On 06/17/2016 04:18 PM, Hunt, David wrote:
>> After reading it, I realize that it's nearly exactly the same code than
>> in "app/test: test external mempool handler".
>> http://patchwork.dpdk.org/dev/patchwork/patch/12896/
>>
>> We should drop one of them. If this stack handler is really useful for
>> a performance use-case, it could go in librte_mempool. At the first
>> read, the code looks like a demo example : it uses a simple spinlock for
>> concurrent accesses to the common pool. Maybe the mempool cache hides
>> this cost, in this case we could also consider removing the use of the
>> rte_ring.
> 
> While I agree that the code is similar, the handler in the test is a
> ring based handler,
> where as this patch adds an array based handler.

Not sure I'm getting what you are saying. Do you mean stack instead
of array?

Actually, both are stacks when talking about bulks of objects. If we
consider each objects one by one, that's true the order will differ.
But as discussed in [1], the cache code already reverses the order of
objects when doing a mempool_get(). I'd say the reversing in cache code
is not really needed (only the order of object bulks should remain the
same). A rte_memcpy() looks to be faster, but it would require to do
some real-life tests to validate or unvalidate this theory.

So to conclude, I still think both code in app/test and lib/mempool are
quite similar, and only one of them should be kept.

[1] http://www.dpdk.org/ml/archives/dev/2016-May/039873.html

> I think that the case for leaving it in as a test for the standard
> handler as part of the
> previous mempool handler is valid, but maybe there is a case for
> removing it if
> we add the stack handler. Maybe a future patch?
> 
>> Do you have some some performance numbers? Do you know if it scales
>> with the number of cores?
> 
> For the mempool_perf_autotest, I'm seeing a 30% increase in performance
> for the
> local cache use-case for 1 - 36 cores (results vary within those tests
> between
> 10-45% gain, but with an average of 30% gain over all the tests.).
> 
> However, for the tests with no local cache configured, throughput of the
> enqueue/dequeue
> drops by about 30%, with the 36 core yelding the largest drop of 40%. So
> this handler would
> not be recommended in no-cache applications.

Interesting, thanks. If you also have real-life (I mean network)
performance tests, I'd be interested too.

Ideally, we should have a documentation explaining in which cases a
handler or another should be used. However, if we don't know this
today, I'm not opposed to add this new handler in 16.07, and let people
do their tests and comment, then describe it properly for 16.11.

What do you think?


Regards,
Olivier
  
Hunt, David June 20, 2016, 12:59 p.m. UTC | #5
Hi Olivier,

On 20/6/2016 9:17 AM, Olivier Matz wrote:
> Hi David,
>
> On 06/17/2016 04:18 PM, Hunt, David wrote:
>>> After reading it, I realize that it's nearly exactly the same code than
>>> in "app/test: test external mempool handler".
>>> http://patchwork.dpdk.org/dev/patchwork/patch/12896/
>>>
>>> We should drop one of them. If this stack handler is really useful for
>>> a performance use-case, it could go in librte_mempool. At the first
>>> read, the code looks like a demo example : it uses a simple spinlock for
>>> concurrent accesses to the common pool. Maybe the mempool cache hides
>>> this cost, in this case we could also consider removing the use of the
>>> rte_ring.
>> While I agree that the code is similar, the handler in the test is a
>> ring based handler,
>> where as this patch adds an array based handler.
> Not sure I'm getting what you are saying. Do you mean stack instead
> of array?

Yes, apologies, stack.

> Actually, both are stacks when talking about bulks of objects. If we
> consider each objects one by one, that's true the order will differ.
> But as discussed in [1], the cache code already reverses the order of
> objects when doing a mempool_get(). I'd say the reversing in cache code
> is not really needed (only the order of object bulks should remain the
> same). A rte_memcpy() looks to be faster, but it would require to do
> some real-life tests to validate or unvalidate this theory.
>
> So to conclude, I still think both code in app/test and lib/mempool are
> quite similar, and only one of them should be kept.
>
> [1] http://www.dpdk.org/ml/archives/dev/2016-May/039873.html

OK, so we will probably remove the test app portion in the future is if 
is not needed,
and if we apply the stack handler proposed in this patch set.

>> I think that the case for leaving it in as a test for the standard
>> handler as part of the
>> previous mempool handler is valid, but maybe there is a case for
>> removing it if
>> we add the stack handler. Maybe a future patch?
>>
>>> Do you have some some performance numbers? Do you know if it scales
>>> with the number of cores?
>> For the mempool_perf_autotest, I'm seeing a 30% increase in performance
>> for the
>> local cache use-case for 1 - 36 cores (results vary within those tests
>> between
>> 10-45% gain, but with an average of 30% gain over all the tests.).
>>
>> However, for the tests with no local cache configured, throughput of the
>> enqueue/dequeue
>> drops by about 30%, with the 36 core yelding the largest drop of 40%. So
>> this handler would
>> not be recommended in no-cache applications.
> Interesting, thanks. If you also have real-life (I mean network)
> performance tests, I'd be interested too.

I'm afraid don't currently have any real-life performance tests.

> Ideally, we should have a documentation explaining in which cases a
> handler or another should be used. However, if we don't know this
> today, I'm not opposed to add this new handler in 16.07, and let people
> do their tests and comment, then describe it properly for 16.11.
>
> What do you think?

I agree. Add it in 16.07, and let people develop use cases for it, as 
well as possibly
coming up with new handlers for 16.11. There's talk of hardware based 
handlers, I
would also hope to see some of those contributed soon.

Regards,
David.
  
Olivier Matz June 29, 2016, 2:31 p.m. UTC | #6
Hi Dave,

On 06/20/2016 02:59 PM, Hunt, David wrote:
> Hi Olivier,
>
> On 20/6/2016 9:17 AM, Olivier Matz wrote:
>> Hi David,
>>
>> On 06/17/2016 04:18 PM, Hunt, David wrote:
>>>> After reading it, I realize that it's nearly exactly the same code than
>>>> in "app/test: test external mempool handler".
>>>> http://patchwork.dpdk.org/dev/patchwork/patch/12896/
>>>>
>>>> We should drop one of them. If this stack handler is really useful for
>>>> a performance use-case, it could go in librte_mempool. At the first
>>>> read, the code looks like a demo example : it uses a simple spinlock
>>>> for
>>>> concurrent accesses to the common pool. Maybe the mempool cache hides
>>>> this cost, in this case we could also consider removing the use of the
>>>> rte_ring.
>>> While I agree that the code is similar, the handler in the test is a
>>> ring based handler,
>>> where as this patch adds an array based handler.
>> Not sure I'm getting what you are saying. Do you mean stack instead
>> of array?
>
> Yes, apologies, stack.
>
>> Actually, both are stacks when talking about bulks of objects. If we
>> consider each objects one by one, that's true the order will differ.
>> But as discussed in [1], the cache code already reverses the order of
>> objects when doing a mempool_get(). I'd say the reversing in cache code
>> is not really needed (only the order of object bulks should remain the
>> same). A rte_memcpy() looks to be faster, but it would require to do
>> some real-life tests to validate or unvalidate this theory.
>>
>> So to conclude, I still think both code in app/test and lib/mempool are
>> quite similar, and only one of them should be kept.
>>
>> [1] http://www.dpdk.org/ml/archives/dev/2016-May/039873.html
>
> OK, so we will probably remove the test app portion in the future is if
> is not needed,
> and if we apply the stack handler proposed in this patch set.

Back on this thread. Maybe I misunderstood what you were saying
here (because I see this comment is not addressed in v3).

I don't think we should add similar code at two different places
and then remove it later in another patchset. I feel it's better to
have only one instance of the stack handler, either in test, or
in librte_mempool.

If you plan to do a v4, I think this is something that could go in 
16.07-rc2.

Regards,
Olivier
  

Patch

diff --git a/lib/librte_mempool/Makefile b/lib/librte_mempool/Makefile
index f19366e..5aa9ef8 100644
--- a/lib/librte_mempool/Makefile
+++ b/lib/librte_mempool/Makefile
@@ -44,6 +44,7 @@  LIBABIVER := 2
 SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool.c
 SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool_handler.c
 SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool_default.c
+SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool_stack.c
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_MEMPOOL)-include := rte_mempool.h
 
diff --git a/lib/librte_mempool/rte_mempool_stack.c b/lib/librte_mempool/rte_mempool_stack.c
new file mode 100644
index 0000000..6e25028
--- /dev/null
+++ b/lib/librte_mempool/rte_mempool_stack.c
@@ -0,0 +1,145 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdio.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+
+struct rte_mempool_common_stack {
+	rte_spinlock_t sl;
+
+	uint32_t size;
+	uint32_t len;
+	void *objs[];
+};
+
+static void *
+common_stack_alloc(struct rte_mempool *mp)
+{
+	struct rte_mempool_common_stack *s;
+	unsigned n = mp->size;
+	int size = sizeof(*s) + (n+16)*sizeof(void *);
+
+	/* Allocate our local memory structure */
+	s = rte_zmalloc_socket("common-stack",
+			size,
+			RTE_CACHE_LINE_SIZE,
+			mp->socket_id);
+	if (s == NULL) {
+		RTE_LOG(ERR, MEMPOOL, "Cannot allocate stack!\n");
+		return NULL;
+	}
+
+	rte_spinlock_init(&s->sl);
+
+	s->size = n;
+	mp->pool = s;
+	rte_mempool_set_handler(mp, "stack");
+
+	return s;
+}
+
+static int common_stack_put(void *p, void * const *obj_table,
+		unsigned n)
+{
+	struct rte_mempool_common_stack *s = p;
+	void **cache_objs;
+	unsigned index;
+
+	rte_spinlock_lock(&s->sl);
+	cache_objs = &s->objs[s->len];
+
+	/* Is there sufficient space in the stack ? */
+	if ((s->len + n) > s->size) {
+		rte_spinlock_unlock(&s->sl);
+		return -ENOENT;
+	}
+
+	/* Add elements back into the cache */
+	for (index = 0; index < n; ++index, obj_table++)
+		cache_objs[index] = *obj_table;
+
+	s->len += n;
+
+	rte_spinlock_unlock(&s->sl);
+	return 0;
+}
+
+static int common_stack_get(void *p, void **obj_table,
+		unsigned n)
+{
+	struct rte_mempool_common_stack *s = p;
+	void **cache_objs;
+	unsigned index, len;
+
+	rte_spinlock_lock(&s->sl);
+
+	if (unlikely(n > s->len)) {
+		rte_spinlock_unlock(&s->sl);
+		return -ENOENT;
+	}
+
+	cache_objs = s->objs;
+
+	for (index = 0, len = s->len - 1; index < n;
+			++index, len--, obj_table++)
+		*obj_table = cache_objs[len];
+
+	s->len -= n;
+	rte_spinlock_unlock(&s->sl);
+	return n;
+}
+
+static unsigned common_stack_get_count(void *p)
+{
+	struct rte_mempool_common_stack *s = p;
+
+	return s->len;
+}
+
+static void
+common_stack_free(void *p)
+{
+	rte_free(p);
+}
+
+static struct rte_mempool_handler handler_stack = {
+	.name = "stack",
+	.alloc = common_stack_alloc,
+	.free = common_stack_free,
+	.put = common_stack_put,
+	.get = common_stack_get,
+	.get_count = common_stack_get_count
+};
+
+MEMPOOL_REGISTER_HANDLER(handler_stack);