[v2] net/mana: use rte_pktmbuf_alloc_bulk for allocating RX WQEs
Checks
Commit Message
From: Long Li <longli@microsoft.com>
Instead of allocating mbufs one by one during RX, use rte_pktmbuf_alloc_bulk()
to allocate them in a batch.
Signed-off-by: Long Li <longli@microsoft.com>
---
Change in v2:
use rte_calloc_socket() in place of rte_calloc()
drivers/net/mana/rx.c | 68 ++++++++++++++++++++++++++++---------------
1 file changed, 44 insertions(+), 24 deletions(-)
Comments
On 1/30/2024 1:13 AM, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
>
> Instead of allocating mbufs one by one during RX, use rte_pktmbuf_alloc_bulk()
> to allocate them in a batch.
>
> Signed-off-by: Long Li <longli@microsoft.com>
>
Can you please quantify the performance improvement (as percentage),
this clarifies the impact of the modification.
<...>
> @@ -121,19 +115,32 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq *rxq)
> * Post work requests for a Rx queue.
> */
> static int
> -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
> {
> int ret;
> uint32_t i;
> + struct rte_mbuf **mbufs;
> +
> + mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
> + 0, rxq->mp->socket_id);
> + if (!mbufs)
> + return -ENOMEM;
>
'mbufs' is temporarily storage for allocated mbuf pointers, why not
allocate if from stack instead, can be faster and easier to manage:
"struct rte_mbuf *mbufs[count]"
> +
> + ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
> + if (ret) {
> + DP_LOG(ERR, "failed to allocate mbufs for RX");
> + rxq->stats.nombuf += count;
> + goto fail;
> + }
>
> #ifdef RTE_ARCH_32
> rxq->wqe_cnt_to_short_db = 0;
> #endif
> - for (i = 0; i < rxq->num_desc; i++) {
> - ret = mana_alloc_and_post_rx_wqe(rxq);
> + for (i = 0; i < count; i++) {
> + ret = mana_post_rx_wqe(rxq, mbufs[i]);
> if (ret) {
> DP_LOG(ERR, "failed to post RX ret = %d", ret);
> - return ret;
> + goto fail;
>
This may leak memory. There are allocated mbufs, if exit from loop here
and free 'mubfs' variable, how remaining mubfs will be freed?
On Tue, 30 Jan 2024 10:19:32 +0000
Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> > -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> > +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
> > {
> > int ret;
> > uint32_t i;
> > + struct rte_mbuf **mbufs;
> > +
> > + mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
> > + 0, rxq->mp->socket_id);
> > + if (!mbufs)
> > + return -ENOMEM;
> >
>
> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> allocate if from stack instead, can be faster and easier to manage:
> "struct rte_mbuf *mbufs[count]"
That would introduce a variable length array.
VLA's should be removed, they are not supported on Windows and many
security tools flag them. The problem is that it makes the code brittle
if count gets huge.
But certainly regular calloc() or alloca() would work here.
On Tue, Jan 30, 2024 at 08:43:52AM -0800, Stephen Hemminger wrote:
> On Tue, 30 Jan 2024 10:19:32 +0000
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>
> > > -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> > > +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
> > > {
> > > int ret;
> > > uint32_t i;
> > > + struct rte_mbuf **mbufs;
> > > +
> > > + mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
> > > + 0, rxq->mp->socket_id);
> > > + if (!mbufs)
> > > + return -ENOMEM;
> > >
> >
> > 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> > allocate if from stack instead, can be faster and easier to manage:
> > "struct rte_mbuf *mbufs[count]"
>
> That would introduce a variable length array.
> VLA's should be removed, they are not supported on Windows and many
> security tools flag them. The problem is that it makes the code brittle
> if count gets huge.
+1
>
> But certainly regular calloc() or alloca() would work here.
> Can you please quantify the performance improvement (as percentage), this
> clarifies the impact of the modification.
I didn't see any meaningful performance improvements in benchmarks. However, this should improve CPU cycles and reduce potential locking conflicts in real-world applications.
Using batch allocation was one of the review comments during initial driver submission, suggested by Stephen Hemminger. I promised to fix it at that time. Sorry it took a while to submit this patch.
>
> <...>
>
> > @@ -121,19 +115,32 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq
> *rxq)
> > * Post work requests for a Rx queue.
> > */
> > static int
> > -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> > +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
> > {
> > int ret;
> > uint32_t i;
> > + struct rte_mbuf **mbufs;
> > +
> > + mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct
> rte_mbuf *),
> > + 0, rxq->mp->socket_id);
> > + if (!mbufs)
> > + return -ENOMEM;
> >
>
> 'mbufs' is temporarily storage for allocated mbuf pointers, why not allocate if from
> stack instead, can be faster and easier to manage:
> "struct rte_mbuf *mbufs[count]"
>
>
> > +
> > + ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
> > + if (ret) {
> > + DP_LOG(ERR, "failed to allocate mbufs for RX");
> > + rxq->stats.nombuf += count;
> > + goto fail;
> > + }
> >
> > #ifdef RTE_ARCH_32
> > rxq->wqe_cnt_to_short_db = 0;
> > #endif
> > - for (i = 0; i < rxq->num_desc; i++) {
> > - ret = mana_alloc_and_post_rx_wqe(rxq);
> > + for (i = 0; i < count; i++) {
> > + ret = mana_post_rx_wqe(rxq, mbufs[i]);
> > if (ret) {
> > DP_LOG(ERR, "failed to post RX ret = %d", ret);
> > - return ret;
> > + goto fail;
> >
>
> This may leak memory. There are allocated mbufs, if exit from loop here and free
> 'mubfs' variable, how remaining mubfs will be freed?
Mbufs are always freed after fail:
fail:
rte_free(mbufs);
>
On 1/30/2024 9:30 PM, Long Li wrote:
>> Can you please quantify the performance improvement (as percentage), this
>> clarifies the impact of the modification.
>
> I didn't see any meaningful performance improvements in benchmarks. However, this should improve CPU cycles and reduce potential locking conflicts in real-world applications.
>
> Using batch allocation was one of the review comments during initial driver submission, suggested by Stephen Hemminger. I promised to fix it at that time. Sorry it took a while to submit this patch.
>
That is OK, using bulk alloc is reasonable approach, only can you please
document the impact (performance increase) in the commit log.
>>
>> <...>
>>
>>> @@ -121,19 +115,32 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq
>> *rxq)
>>> * Post work requests for a Rx queue.
>>> */
>>> static int
>>> -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
>>> +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
>>> {
>>> int ret;
>>> uint32_t i;
>>> + struct rte_mbuf **mbufs;
>>> +
>>> + mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct
>> rte_mbuf *),
>>> + 0, rxq->mp->socket_id);
>>> + if (!mbufs)
>>> + return -ENOMEM;
>>>
>>
>> 'mbufs' is temporarily storage for allocated mbuf pointers, why not allocate if from
>> stack instead, can be faster and easier to manage:
>> "struct rte_mbuf *mbufs[count]"
>>
>>
>>> +
>>> + ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
>>> + if (ret) {
>>> + DP_LOG(ERR, "failed to allocate mbufs for RX");
>>> + rxq->stats.nombuf += count;
>>> + goto fail;
>>> + }
>>>
>>> #ifdef RTE_ARCH_32
>>> rxq->wqe_cnt_to_short_db = 0;
>>> #endif
>>> - for (i = 0; i < rxq->num_desc; i++) {
>>> - ret = mana_alloc_and_post_rx_wqe(rxq);
>>> + for (i = 0; i < count; i++) {
>>> + ret = mana_post_rx_wqe(rxq, mbufs[i]);
>>> if (ret) {
>>> DP_LOG(ERR, "failed to post RX ret = %d", ret);
>>> - return ret;
>>> + goto fail;
>>>
>>
>> This may leak memory. There are allocated mbufs, if exit from loop here and free
>> 'mubfs' variable, how remaining mubfs will be freed?
>
> Mbufs are always freed after fail:
>
> fail:
> rte_free(mbufs);
>
Nope, I am not talking about the 'mbufs' variable, I am talking about
mbuf pointers stored in the 'mbufs' array which are allocated by
'rte_pktmbuf_alloc_bulk()'.
> Subject: Re: [Patch v2] net/mana: use rte_pktmbuf_alloc_bulk for allocating RX
> WQEs
>
> On 1/30/2024 9:30 PM, Long Li wrote:
> >> Can you please quantify the performance improvement (as percentage),
> >> this clarifies the impact of the modification.
> >
> > I didn't see any meaningful performance improvements in benchmarks.
> However, this should improve CPU cycles and reduce potential locking conflicts in
> real-world applications.
> >
> > Using batch allocation was one of the review comments during initial driver
> submission, suggested by Stephen Hemminger. I promised to fix it at that time.
> Sorry it took a while to submit this patch.
> >
>
> That is OK, using bulk alloc is reasonable approach, only can you please document
> the impact (performance increase) in the commit log.
Will do that.
>
> >>
> >> <...>
> >>
> >>> @@ -121,19 +115,32 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq
> >> *rxq)
> >>> * Post work requests for a Rx queue.
> >>> */
> >>> static int
> >>> -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> >>> +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
> >>> {
> >>> int ret;
> >>> uint32_t i;
> >>> + struct rte_mbuf **mbufs;
> >>> +
> >>> + mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct
> >> rte_mbuf *),
> >>> + 0, rxq->mp->socket_id);
> >>> + if (!mbufs)
> >>> + return -ENOMEM;
> >>>
> >>
> >> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> >> allocate if from stack instead, can be faster and easier to manage:
> >> "struct rte_mbuf *mbufs[count]"
> >>
> >>
> >>> +
> >>> + ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
> >>> + if (ret) {
> >>> + DP_LOG(ERR, "failed to allocate mbufs for RX");
> >>> + rxq->stats.nombuf += count;
> >>> + goto fail;
> >>> + }
> >>>
> >>> #ifdef RTE_ARCH_32
> >>> rxq->wqe_cnt_to_short_db = 0;
> >>> #endif
> >>> - for (i = 0; i < rxq->num_desc; i++) {
> >>> - ret = mana_alloc_and_post_rx_wqe(rxq);
> >>> + for (i = 0; i < count; i++) {
> >>> + ret = mana_post_rx_wqe(rxq, mbufs[i]);
> >>> if (ret) {
> >>> DP_LOG(ERR, "failed to post RX ret = %d", ret);
> >>> - return ret;
> >>> + goto fail;
> >>>
> >>
> >> This may leak memory. There are allocated mbufs, if exit from loop
> >> here and free 'mubfs' variable, how remaining mubfs will be freed?
> >
> > Mbufs are always freed after fail:
> >
> > fail:
> > rte_free(mbufs);
> >
>
> Nope, I am not talking about the 'mbufs' variable, I am talking about mbuf
> pointers stored in the 'mbufs' array which are allocated by
> 'rte_pktmbuf_alloc_bulk()'.
You are right, I'm sending v3 to fix those.
Long
On 1/30/2024 4:43 PM, Stephen Hemminger wrote:
> On Tue, 30 Jan 2024 10:19:32 +0000
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>
>>> -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
>>> +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
>>> {
>>> int ret;
>>> uint32_t i;
>>> + struct rte_mbuf **mbufs;
>>> +
>>> + mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
>>> + 0, rxq->mp->socket_id);
>>> + if (!mbufs)
>>> + return -ENOMEM;
>>>
>>
>> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
>> allocate if from stack instead, can be faster and easier to manage:
>> "struct rte_mbuf *mbufs[count]"
>
> That would introduce a variable length array.
> VLA's should be removed, they are not supported on Windows and many
> security tools flag them. The problem is that it makes the code brittle
> if count gets huge.
>
> But certainly regular calloc() or alloca() would work here.
>
Most of the existing bulk alloc already uses VLA but I can see the
problem it is not being supported by Windows.
As this mbuf pointer array is short lived within the function, and being
in the fast path, I think continuous alloc and free can be prevented,
one option can be to define a fixed size, big enough, array which
requires additional loop for the cases 'count' size is bigger than array
size,
or an array can be allocated by driver init in device specific data ,as
we know it will be required continuously in the datapath, and it can be
freed during device close()/uninit().
I think an fixed size array from stack is easier and can be preferred.
> >> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> >> allocate if from stack instead, can be faster and easier to manage:
> >> "struct rte_mbuf *mbufs[count]"
> >
> > That would introduce a variable length array.
> > VLA's should be removed, they are not supported on Windows and many
> > security tools flag them. The problem is that it makes the code
> > brittle if count gets huge.
> >
> > But certainly regular calloc() or alloca() would work here.
> >
>
> Most of the existing bulk alloc already uses VLA but I can see the problem it is not
> being supported by Windows.
>
> As this mbuf pointer array is short lived within the function, and being in the fast
> path, I think continuous alloc and free can be prevented,
>
> one option can be to define a fixed size, big enough, array which requires
> additional loop for the cases 'count' size is bigger than array size,
>
> or an array can be allocated by driver init in device specific data ,as we know it
> will be required continuously in the datapath, and it can be freed during device
> close()/uninit().
>
> I think an fixed size array from stack is easier and can be preferred.
I sent a v3 of the patch, still using alloc().
I found two problems with using a fixed array:
1. the array size needs to be determined in advance. I don't know what a good number should be. If too big, some of them may be wasted. (and maybe make a bigger mess of CPU cache) If too small, it ends up doing multiple allocations, which is the problem this patch trying to solve.
2. if makes the code slightly more complex ,but I think 1 is the main problem.
I think another approach is to use VLA by default, but for Windows use alloc().
Long
On 2/1/2024 3:55 AM, Long Li wrote:
>>>> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
>>>> allocate if from stack instead, can be faster and easier to manage:
>>>> "struct rte_mbuf *mbufs[count]"
>>>
>>> That would introduce a variable length array.
>>> VLA's should be removed, they are not supported on Windows and many
>>> security tools flag them. The problem is that it makes the code
>>> brittle if count gets huge.
>>>
>>> But certainly regular calloc() or alloca() would work here.
>>>
>>
>> Most of the existing bulk alloc already uses VLA but I can see the problem it is not
>> being supported by Windows.
>>
>> As this mbuf pointer array is short lived within the function, and being in the fast
>> path, I think continuous alloc and free can be prevented,
>>
>> one option can be to define a fixed size, big enough, array which requires
>> additional loop for the cases 'count' size is bigger than array size,
>>
>> or an array can be allocated by driver init in device specific data ,as we know it
>> will be required continuously in the datapath, and it can be freed during device
>> close()/uninit().
>>
>> I think an fixed size array from stack is easier and can be preferred.
>
> I sent a v3 of the patch, still using alloc().
>
> I found two problems with using a fixed array:
> 1. the array size needs to be determined in advance. I don't know what a good number should be. If too big, some of them may be wasted. (and maybe make a bigger mess of CPU cache) If too small, it ends up doing multiple allocations, which is the problem this patch trying to solve.
>
I think default burst size 32 can be used like below:
struct rte_mbuf *mbufs[32];
loop: //use do {} while(); if you prefer
n = min(32, count);
rte_pktmbuf_alloc_bulk(mbufs, n);
for (i = 0; i < n; i++)
mana_post_rx_wqe(rxq, mbufs[i]);
count -= n;
if (count > 0) goto loop:
This additional loop doesn't make code very complex (I think not more
than additional alloc() & free()) and it doesn't waste memory.
I suggest doing a performance measurement with above change, it may
increase performance,
afterwards if you insist to go with original code, we can do it.
> 2. if makes the code slightly more complex ,but I think 1 is the main problem.
>
> I think another approach is to use VLA by default, but for Windows use alloc().
>
> Long
On Thu, Feb 01, 2024 at 03:55:55AM +0000, Long Li wrote:
> > >> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> > >> allocate if from stack instead, can be faster and easier to manage:
> > >> "struct rte_mbuf *mbufs[count]"
> > >
> > > That would introduce a variable length array.
> > > VLA's should be removed, they are not supported on Windows and many
> > > security tools flag them. The problem is that it makes the code
> > > brittle if count gets huge.
> > >
> > > But certainly regular calloc() or alloca() would work here.
> > >
> >
> > Most of the existing bulk alloc already uses VLA but I can see the problem it is not
> > being supported by Windows.
> >
> > As this mbuf pointer array is short lived within the function, and being in the fast
> > path, I think continuous alloc and free can be prevented,
> >
> > one option can be to define a fixed size, big enough, array which requires
> > additional loop for the cases 'count' size is bigger than array size,
> >
> > or an array can be allocated by driver init in device specific data ,as we know it
> > will be required continuously in the datapath, and it can be freed during device
> > close()/uninit().
> >
> > I think an fixed size array from stack is easier and can be preferred.
>
> I sent a v3 of the patch, still using alloc().
>
> I found two problems with using a fixed array:
> 1. the array size needs to be determined in advance. I don't know what a good number should be. If too big, some of them may be wasted. (and maybe make a bigger mess of CPU cache) If too small, it ends up doing multiple allocations, which is the problem this patch trying to solve.
> 2. if makes the code slightly more complex ,but I think 1 is the main problem.
>
> I think another approach is to use VLA by default, but for Windows use alloc().
a few thoughts on VLAs you may consider. not to be regarded as a strong
objection.
indications are that standard C will gradually phase out VLAs because
they're generally accepted as having been a bad idea. that said
compilers that implement them will probably keep them forever.
VLAs generate a lot of code relative to just using a more permanent
allocation. may not show up in your performance tests but you also may
not want it on your hotpath either.
mana doesn't currently support windows, are there plans to support
windows? if never then i suppose VLAs can be used since all the
toolchains you care about have them. though it does raise the bar, cause
more work, later refactor, carry regression risk should you change your
mind and choose to port to windows.
accepting the use of VLAs anywhere in dpdk prohibits general
checkpatches and/or compiling with compiler options that detect and flag
their inclusion as a part of the CI without having to add exclusion
logic for drivers that are allowed to use them.
>
> Long
> I think default burst size 32 can be used like below:
>
> struct rte_mbuf *mbufs[32];
>
> loop: //use do {} while(); if you prefer n = min(32, count);
> rte_pktmbuf_alloc_bulk(mbufs, n); for (i = 0; i < n; i++)
> mana_post_rx_wqe(rxq, mbufs[i]);
> count -= n;
> if (count > 0) goto loop:
>
>
> This additional loop doesn't make code very complex (I think not more than
> additional alloc() & free()) and it doesn't waste memory.
> I suggest doing a performance measurement with above change, it may increase
> performance, afterwards if you insist to go with original code, we can do it.
>
I submitted v4 with your suggestions. The code doesn't end up looking very messy. I measured the same performance with and without the patch.
Thanks,
Long
> > I think another approach is to use VLA by default, but for Windows use alloc().
>
> a few thoughts on VLAs you may consider. not to be regarded as a strong
> objection.
>
> indications are that standard C will gradually phase out VLAs because they're
> generally accepted as having been a bad idea. that said compilers that implement
> them will probably keep them forever.
>
> VLAs generate a lot of code relative to just using a more permanent allocation.
> may not show up in your performance tests but you also may not want it on your
> hotpath either.
>
> mana doesn't currently support windows, are there plans to support windows? if
> never then i suppose VLAs can be used since all the toolchains you care about
> have them. though it does raise the bar, cause more work, later refactor, carry
> regression risk should you change your mind and choose to port to windows.
>
> accepting the use of VLAs anywhere in dpdk prohibits general checkpatches
> and/or compiling with compiler options that detect and flag their inclusion as a
> part of the CI without having to add exclusion logic for drivers that are allowed to
> use them.
>
I agree we need to keep the code consistent. I submitted v4 using fixed array.
Thanks,
Long
@@ -2,6 +2,7 @@
* Copyright 2022 Microsoft Corporation
*/
#include <ethdev_driver.h>
+#include <rte_malloc.h>
#include <infiniband/verbs.h>
#include <infiniband/manadv.h>
@@ -59,9 +60,8 @@ mana_rq_ring_doorbell(struct mana_rxq *rxq)
}
static int
-mana_alloc_and_post_rx_wqe(struct mana_rxq *rxq)
+mana_post_rx_wqe(struct mana_rxq *rxq, struct rte_mbuf *mbuf)
{
- struct rte_mbuf *mbuf = NULL;
struct gdma_sgl_element sgl[1];
struct gdma_work_request request;
uint32_t wqe_size_in_bu;
@@ -69,12 +69,6 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq *rxq)
int ret;
struct mana_mr_cache *mr;
- mbuf = rte_pktmbuf_alloc(rxq->mp);
- if (!mbuf) {
- rxq->stats.nombuf++;
- return -ENOMEM;
- }
-
mr = mana_alloc_pmd_mr(&rxq->mr_btree, priv, mbuf);
if (!mr) {
DP_LOG(ERR, "failed to register RX MR");
@@ -121,19 +115,32 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq *rxq)
* Post work requests for a Rx queue.
*/
static int
-mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
+mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
{
int ret;
uint32_t i;
+ struct rte_mbuf **mbufs;
+
+ mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
+ 0, rxq->mp->socket_id);
+ if (!mbufs)
+ return -ENOMEM;
+
+ ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
+ if (ret) {
+ DP_LOG(ERR, "failed to allocate mbufs for RX");
+ rxq->stats.nombuf += count;
+ goto fail;
+ }
#ifdef RTE_ARCH_32
rxq->wqe_cnt_to_short_db = 0;
#endif
- for (i = 0; i < rxq->num_desc; i++) {
- ret = mana_alloc_and_post_rx_wqe(rxq);
+ for (i = 0; i < count; i++) {
+ ret = mana_post_rx_wqe(rxq, mbufs[i]);
if (ret) {
DP_LOG(ERR, "failed to post RX ret = %d", ret);
- return ret;
+ goto fail;
}
#ifdef RTE_ARCH_32
@@ -146,6 +153,8 @@ mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
mana_rq_ring_doorbell(rxq);
+fail:
+ rte_free(mbufs);
return ret;
}
@@ -404,7 +413,9 @@ mana_start_rx_queues(struct rte_eth_dev *dev)
}
for (i = 0; i < priv->num_queues; i++) {
- ret = mana_alloc_and_post_rx_wqes(dev->data->rx_queues[i]);
+ struct mana_rxq *rxq = dev->data->rx_queues[i];
+
+ ret = mana_alloc_and_post_rx_wqes(rxq, rxq->num_desc);
if (ret)
goto fail;
}
@@ -423,7 +434,7 @@ uint16_t
mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
{
uint16_t pkt_received = 0;
- uint16_t wqe_posted = 0;
+ uint16_t wqe_consumed = 0;
struct mana_rxq *rxq = dpdk_rxq;
struct mana_priv *priv = rxq->priv;
struct rte_mbuf *mbuf;
@@ -535,18 +546,23 @@ mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
rxq->gdma_rq.tail += desc->wqe_size_in_bu;
- /* Consume this request and post another request */
- ret = mana_alloc_and_post_rx_wqe(rxq);
- if (ret) {
- DP_LOG(ERR, "failed to post rx wqe ret=%d", ret);
- break;
- }
-
- wqe_posted++;
+ /* Record the number of the RX WQE we need to post to replenish
+ * consumed RX requests
+ */
+ wqe_consumed++;
if (pkt_received == pkts_n)
break;
#ifdef RTE_ARCH_32
+ /* Always post WQE as soon as it's consumed for short DB */
+ ret = mana_alloc_and_post_rx_wqes(rxq, wqe_consumed);
+ if (ret) {
+ DRV_LOG(ERR, "failed to post %d WQEs, ret %d",
+ wqe_consumed, ret);
+ return pkt_received;
+ }
+ wqe_consumed = 0;
+
/* Ring short doorbell if approaching the wqe increment
* limit.
*/
@@ -569,8 +585,12 @@ mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
goto repoll;
}
- if (wqe_posted)
- mana_rq_ring_doorbell(rxq);
+ if (wqe_consumed) {
+ ret = mana_alloc_and_post_rx_wqes(rxq, wqe_consumed);
+ if (ret)
+ DRV_LOG(ERR, "failed to post %d WQEs, ret %d",
+ wqe_consumed, ret);
+ }
return pkt_received;
}