[dpdk-dev,1/6] cxgbe: Optimize forwarding performance for 40G

Message ID 318fc8559675b1157e7f049a6a955a6a2059bac7.1443704150.git.rahul.lakkireddy@chelsio.com (mailing list archive)
State Superseded, archived
Headers

Commit Message

Rahul Lakkireddy Oct. 2, 2015, 11:16 a.m. UTC
  Update sge initialization with respect to free-list manager configuration
and ingress arbiter. Also update refill logic to refill mbufs only after
a certain threshold for rx.  Optimize tx packet prefetch and free.

Approx. 4 MPPS improvement seen in forwarding performance after the
optimization.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
---
 drivers/net/cxgbe/base/t4_regs.h | 16 ++++++++++++++++
 drivers/net/cxgbe/cxgbe_main.c   |  7 +++++++
 drivers/net/cxgbe/sge.c          | 17 ++++++++++++-----
 3 files changed, 35 insertions(+), 5 deletions(-)
  

Comments

Aaron Conole Oct. 2, 2015, 9:48 p.m. UTC | #1
Hi Rahul,

Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:

> Update sge initialization with respect to free-list manager configuration
> and ingress arbiter. Also update refill logic to refill mbufs only after
> a certain threshold for rx.  Optimize tx packet prefetch and free.
<<snip>>
>  			for (i = 0; i < sd->coalesce.idx; i++) {
> -				rte_pktmbuf_free(sd->coalesce.mbuf[i]);
> +				struct rte_mbuf *tmp = sd->coalesce.mbuf[i];
> +
> +				do {
> +					struct rte_mbuf *next = tmp->next;
> +
> +					rte_pktmbuf_free_seg(tmp);
> +					tmp = next;
> +				} while (tmp);
>  				sd->coalesce.mbuf[i] = NULL;
Pardon my ignorance here, but rte_pktmbuf_free does this work. I can't
actually see much difference between your rewrite of this block, and
the implementation of rte_pktmbuf_free() (apart from moving your branch
to the end of the function). Did your microbenchmarking really show this
as an improvement? 

Thanks for your time,
Aaron
  
Rahul Lakkireddy Oct. 5, 2015, 10:06 a.m. UTC | #2
Hi Aaron,

On Friday, October 10/02/15, 2015 at 14:48:28 -0700, Aaron Conole wrote:
> Hi Rahul,
> 
> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> 
> > Update sge initialization with respect to free-list manager configuration
> > and ingress arbiter. Also update refill logic to refill mbufs only after
> > a certain threshold for rx.  Optimize tx packet prefetch and free.
> <<snip>>
> >  			for (i = 0; i < sd->coalesce.idx; i++) {
> > -				rte_pktmbuf_free(sd->coalesce.mbuf[i]);
> > +				struct rte_mbuf *tmp = sd->coalesce.mbuf[i];
> > +
> > +				do {
> > +					struct rte_mbuf *next = tmp->next;
> > +
> > +					rte_pktmbuf_free_seg(tmp);
> > +					tmp = next;
> > +				} while (tmp);
> >  				sd->coalesce.mbuf[i] = NULL;
> Pardon my ignorance here, but rte_pktmbuf_free does this work. I can't
> actually see much difference between your rewrite of this block, and
> the implementation of rte_pktmbuf_free() (apart from moving your branch
> to the end of the function). Did your microbenchmarking really show this
> as an improvement? 
> 
> Thanks for your time,
> Aaron

rte_pktmbuf_free calls rte_mbuf_sanity_check which does a lot of
checks.  This additional check seems redundant for single segment
packets since rte_pktmbuf_free_seg also performs rte_mbuf_sanity_check.

Several PMDs already prefer to use rte_pktmbuf_free_seg directly over
rte_pktmbuf_free as it is faster.

The forwarding perf. improvement with only this particular block is
around 1 Mpps for 64B packets when using l3fwd with 8 queues.

Thanks,
Rahul
  
Ananyev, Konstantin Oct. 5, 2015, 11:46 a.m. UTC | #3
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Rahul Lakkireddy
> Sent: Monday, October 05, 2015 11:06 AM
> To: Aaron Conole
> Cc: dev@dpdk.org; Felix Marti; Kumar A S; Nirranjan Kirubaharan
> Subject: Re: [dpdk-dev] [PATCH 1/6] cxgbe: Optimize forwarding performance for 40G
> 
> Hi Aaron,
> 
> On Friday, October 10/02/15, 2015 at 14:48:28 -0700, Aaron Conole wrote:
> > Hi Rahul,
> >
> > Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> >
> > > Update sge initialization with respect to free-list manager configuration
> > > and ingress arbiter. Also update refill logic to refill mbufs only after
> > > a certain threshold for rx.  Optimize tx packet prefetch and free.
> > <<snip>>
> > >  			for (i = 0; i < sd->coalesce.idx; i++) {
> > > -				rte_pktmbuf_free(sd->coalesce.mbuf[i]);
> > > +				struct rte_mbuf *tmp = sd->coalesce.mbuf[i];
> > > +
> > > +				do {
> > > +					struct rte_mbuf *next = tmp->next;
> > > +
> > > +					rte_pktmbuf_free_seg(tmp);
> > > +					tmp = next;
> > > +				} while (tmp);
> > >  				sd->coalesce.mbuf[i] = NULL;
> > Pardon my ignorance here, but rte_pktmbuf_free does this work. I can't
> > actually see much difference between your rewrite of this block, and
> > the implementation of rte_pktmbuf_free() (apart from moving your branch
> > to the end of the function). Did your microbenchmarking really show this
> > as an improvement?
> >
> > Thanks for your time,
> > Aaron
> 
> rte_pktmbuf_free calls rte_mbuf_sanity_check which does a lot of
> checks. 

Only when RTE_LIBRTE_MBUF_DEBUG is enabled in your config.
By default it is switched off. 

> This additional check seems redundant for single segment
> packets since rte_pktmbuf_free_seg also performs rte_mbuf_sanity_check.
> 
> Several PMDs already prefer to use rte_pktmbuf_free_seg directly over
> rte_pktmbuf_free as it is faster.

Other PMDs use rte_pktmbuf_free_seg() as each TD has an associated 
with it segment. So as HW is done with the TD, SW frees associated segment.
In your case I don't see any point in re-implementing rte_pktmbuf_free() manually,
and I don't think it would be any faster.

Konstantin  

> 
> The forwarding perf. improvement with only this particular block is
> around 1 Mpps for 64B packets when using l3fwd with 8 queues.
> 
> Thanks,
> Rahul
  
Rahul Lakkireddy Oct. 5, 2015, 12:42 p.m. UTC | #4
Hi Konstantin,

On Monday, October 10/05/15, 2015 at 04:46:40 -0700, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Rahul Lakkireddy
> > Sent: Monday, October 05, 2015 11:06 AM
> > To: Aaron Conole
> > Cc: dev@dpdk.org; Felix Marti; Kumar A S; Nirranjan Kirubaharan
> > Subject: Re: [dpdk-dev] [PATCH 1/6] cxgbe: Optimize forwarding performance for 40G
> > 
> > Hi Aaron,
> > 
> > On Friday, October 10/02/15, 2015 at 14:48:28 -0700, Aaron Conole wrote:
> > > Hi Rahul,
> > >
> > > Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> > >
> > > > Update sge initialization with respect to free-list manager configuration
> > > > and ingress arbiter. Also update refill logic to refill mbufs only after
> > > > a certain threshold for rx.  Optimize tx packet prefetch and free.
> > > <<snip>>
> > > >  			for (i = 0; i < sd->coalesce.idx; i++) {
> > > > -				rte_pktmbuf_free(sd->coalesce.mbuf[i]);
> > > > +				struct rte_mbuf *tmp = sd->coalesce.mbuf[i];
> > > > +
> > > > +				do {
> > > > +					struct rte_mbuf *next = tmp->next;
> > > > +
> > > > +					rte_pktmbuf_free_seg(tmp);
> > > > +					tmp = next;
> > > > +				} while (tmp);
> > > >  				sd->coalesce.mbuf[i] = NULL;
> > > Pardon my ignorance here, but rte_pktmbuf_free does this work. I can't
> > > actually see much difference between your rewrite of this block, and
> > > the implementation of rte_pktmbuf_free() (apart from moving your branch
> > > to the end of the function). Did your microbenchmarking really show this
> > > as an improvement?
> > >
> > > Thanks for your time,
> > > Aaron
> > 
> > rte_pktmbuf_free calls rte_mbuf_sanity_check which does a lot of
> > checks. 
> 
> Only when RTE_LIBRTE_MBUF_DEBUG is enabled in your config.
> By default it is switched off. 

Right. I clearly missed this.
I am running with default config only btw.

> 
> > This additional check seems redundant for single segment
> > packets since rte_pktmbuf_free_seg also performs rte_mbuf_sanity_check.
> > 
> > Several PMDs already prefer to use rte_pktmbuf_free_seg directly over
> > rte_pktmbuf_free as it is faster.
> 
> Other PMDs use rte_pktmbuf_free_seg() as each TD has an associated 
> with it segment. So as HW is done with the TD, SW frees associated segment.
> In your case I don't see any point in re-implementing rte_pktmbuf_free() manually,
> and I don't think it would be any faster.
> 
> Konstantin  

As I mentioned below, I am clearly seeing a difference of 1 Mpps. And 1
Mpps is not a small difference IMHO.

When running l3fwd with 8 queues, I also collected a perf report.
When using rte_pktmbuf_free, I see that it eats up around 6% cpu as
below in perf top report:-
--------------------
32.00%  l3fwd                        [.] cxgbe_poll
22.25%  l3fwd                        [.] t4_eth_xmit
20.30%  l3fwd                        [.] main_loop
 6.77%  l3fwd                        [.] rte_pktmbuf_free
 4.86%  l3fwd                        [.] refill_fl_usembufs
 2.00%  l3fwd                        [.] write_sgl
.....
--------------------

While, when using rte_pktmbuf_free_seg directly, I don't see above
problem. perf top report now comes as:-
-------------------
33.36%  l3fwd                        [.] cxgbe_poll
32.69%  l3fwd                        [.] t4_eth_xmit
19.05%  l3fwd                        [.] main_loop
 5.21%  l3fwd                        [.] refill_fl_usembufs
 2.40%  l3fwd                        [.] write_sgl
....
-------------------

I obviously missed the debug flag for rte_mbuf_sanity_check.
However, there is a clear difference of 1 Mpps. I don't know if its the
change between while construct used in rte_pktmbuf_free and the
do..while construct that I used - is making the difference.


> 
> > 
> > The forwarding perf. improvement with only this particular block is
> > around 1 Mpps for 64B packets when using l3fwd with 8 queues.
> > 
> > Thanks,
> > Rahul
  
Ananyev, Konstantin Oct. 5, 2015, 2:09 p.m. UTC | #5
Hi Rahul,

> -----Original Message-----
> From: Rahul Lakkireddy [mailto:rahul.lakkireddy@chelsio.com]
> Sent: Monday, October 05, 2015 1:42 PM
> To: Ananyev, Konstantin
> Cc: Aaron Conole; dev@dpdk.org; Felix Marti; Kumar A S; Nirranjan Kirubaharan
> Subject: Re: [dpdk-dev] [PATCH 1/6] cxgbe: Optimize forwarding performance for 40G
> 
> Hi Konstantin,
> 
> On Monday, October 10/05/15, 2015 at 04:46:40 -0700, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Rahul Lakkireddy
> > > Sent: Monday, October 05, 2015 11:06 AM
> > > To: Aaron Conole
> > > Cc: dev@dpdk.org; Felix Marti; Kumar A S; Nirranjan Kirubaharan
> > > Subject: Re: [dpdk-dev] [PATCH 1/6] cxgbe: Optimize forwarding performance for 40G
> > >
> > > Hi Aaron,
> > >
> > > On Friday, October 10/02/15, 2015 at 14:48:28 -0700, Aaron Conole wrote:
> > > > Hi Rahul,
> > > >
> > > > Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> > > >
> > > > > Update sge initialization with respect to free-list manager configuration
> > > > > and ingress arbiter. Also update refill logic to refill mbufs only after
> > > > > a certain threshold for rx.  Optimize tx packet prefetch and free.
> > > > <<snip>>
> > > > >  			for (i = 0; i < sd->coalesce.idx; i++) {
> > > > > -				rte_pktmbuf_free(sd->coalesce.mbuf[i]);
> > > > > +				struct rte_mbuf *tmp = sd->coalesce.mbuf[i];
> > > > > +
> > > > > +				do {
> > > > > +					struct rte_mbuf *next = tmp->next;
> > > > > +
> > > > > +					rte_pktmbuf_free_seg(tmp);
> > > > > +					tmp = next;
> > > > > +				} while (tmp);
> > > > >  				sd->coalesce.mbuf[i] = NULL;
> > > > Pardon my ignorance here, but rte_pktmbuf_free does this work. I can't
> > > > actually see much difference between your rewrite of this block, and
> > > > the implementation of rte_pktmbuf_free() (apart from moving your branch
> > > > to the end of the function). Did your microbenchmarking really show this
> > > > as an improvement?
> > > >
> > > > Thanks for your time,
> > > > Aaron
> > >
> > > rte_pktmbuf_free calls rte_mbuf_sanity_check which does a lot of
> > > checks.
> >
> > Only when RTE_LIBRTE_MBUF_DEBUG is enabled in your config.
> > By default it is switched off.
> 
> Right. I clearly missed this.
> I am running with default config only btw.
> 
> >
> > > This additional check seems redundant for single segment
> > > packets since rte_pktmbuf_free_seg also performs rte_mbuf_sanity_check.
> > >
> > > Several PMDs already prefer to use rte_pktmbuf_free_seg directly over
> > > rte_pktmbuf_free as it is faster.
> >
> > Other PMDs use rte_pktmbuf_free_seg() as each TD has an associated
> > with it segment. So as HW is done with the TD, SW frees associated segment.
> > In your case I don't see any point in re-implementing rte_pktmbuf_free() manually,
> > and I don't think it would be any faster.
> >
> > Konstantin
> 
> As I mentioned below, I am clearly seeing a difference of 1 Mpps. And 1
> Mpps is not a small difference IMHO.

Agree with you here - it is a significant difference.

> 
> When running l3fwd with 8 queues, I also collected a perf report.
> When using rte_pktmbuf_free, I see that it eats up around 6% cpu as
> below in perf top report:-
> --------------------
> 32.00%  l3fwd                        [.] cxgbe_poll
> 22.25%  l3fwd                        [.] t4_eth_xmit
> 20.30%  l3fwd                        [.] main_loop
>  6.77%  l3fwd                        [.] rte_pktmbuf_free
>  4.86%  l3fwd                        [.] refill_fl_usembufs
>  2.00%  l3fwd                        [.] write_sgl
> .....
> --------------------
> 
> While, when using rte_pktmbuf_free_seg directly, I don't see above
> problem. perf top report now comes as:-
> -------------------
> 33.36%  l3fwd                        [.] cxgbe_poll
> 32.69%  l3fwd                        [.] t4_eth_xmit
> 19.05%  l3fwd                        [.] main_loop
>  5.21%  l3fwd                        [.] refill_fl_usembufs
>  2.40%  l3fwd                        [.] write_sgl
> ....
> -------------------

I don't think these 6% disappeared anywhere.
As I can see, now t4_eth_xmit() increased by roughly same amount
(you still have same job to do).
To me it looks like in that case compiler didn't really inline rte_pktmbuf_free().
Wonder can you add 'always_inline' attribute to the  rte_pktmbuf_free(),
and see would it make any difference?

Konstantin 

> 
> I obviously missed the debug flag for rte_mbuf_sanity_check.
> However, there is a clear difference of 1 Mpps. I don't know if its the
> change between while construct used in rte_pktmbuf_free and the
> do..while construct that I used - is making the difference.
> 
> 
> >
> > >
> > > The forwarding perf. improvement with only this particular block is
> > > around 1 Mpps for 64B packets when using l3fwd with 8 queues.
> > >
> > > Thanks,
> > > Rahul
  
Rahul Lakkireddy Oct. 5, 2015, 3:07 p.m. UTC | #6
On Monday, October 10/05/15, 2015 at 07:09:27 -0700, Ananyev, Konstantin wrote:
> Hi Rahul,

[...]

> > > > This additional check seems redundant for single segment
> > > > packets since rte_pktmbuf_free_seg also performs rte_mbuf_sanity_check.
> > > >
> > > > Several PMDs already prefer to use rte_pktmbuf_free_seg directly over
> > > > rte_pktmbuf_free as it is faster.
> > >
> > > Other PMDs use rte_pktmbuf_free_seg() as each TD has an associated
> > > with it segment. So as HW is done with the TD, SW frees associated segment.
> > > In your case I don't see any point in re-implementing rte_pktmbuf_free() manually,
> > > and I don't think it would be any faster.
> > >
> > > Konstantin
> > 
> > As I mentioned below, I am clearly seeing a difference of 1 Mpps. And 1
> > Mpps is not a small difference IMHO.
> 
> Agree with you here - it is a significant difference.
> 
> > 
> > When running l3fwd with 8 queues, I also collected a perf report.
> > When using rte_pktmbuf_free, I see that it eats up around 6% cpu as
> > below in perf top report:-
> > --------------------
> > 32.00%  l3fwd                        [.] cxgbe_poll
> > 22.25%  l3fwd                        [.] t4_eth_xmit
> > 20.30%  l3fwd                        [.] main_loop
> >  6.77%  l3fwd                        [.] rte_pktmbuf_free
> >  4.86%  l3fwd                        [.] refill_fl_usembufs
> >  2.00%  l3fwd                        [.] write_sgl
> > .....
> > --------------------
> > 
> > While, when using rte_pktmbuf_free_seg directly, I don't see above
> > problem. perf top report now comes as:-
> > -------------------
> > 33.36%  l3fwd                        [.] cxgbe_poll
> > 32.69%  l3fwd                        [.] t4_eth_xmit
> > 19.05%  l3fwd                        [.] main_loop
> >  5.21%  l3fwd                        [.] refill_fl_usembufs
> >  2.40%  l3fwd                        [.] write_sgl
> > ....
> > -------------------
> 
> I don't think these 6% disappeared anywhere.
> As I can see, now t4_eth_xmit() increased by roughly same amount
> (you still have same job to do).

Right.

> To me it looks like in that case compiler didn't really inline rte_pktmbuf_free().
> Wonder can you add 'always_inline' attribute to the  rte_pktmbuf_free(),
> and see would it make any difference?
> 
> Konstantin 

I will try out above and update further.


Thanks,
Rahul.
  
Rahul Lakkireddy Oct. 7, 2015, 3:27 p.m. UTC | #7
On Monday, October 10/05/15, 2015 at 20:37:31 +0530, Rahul Lakkireddy wrote:
> On Monday, October 10/05/15, 2015 at 07:09:27 -0700, Ananyev, Konstantin wrote:
> > Hi Rahul,
> 
> [...]
> 
> > > > > This additional check seems redundant for single segment
> > > > > packets since rte_pktmbuf_free_seg also performs rte_mbuf_sanity_check.
> > > > >
> > > > > Several PMDs already prefer to use rte_pktmbuf_free_seg directly over
> > > > > rte_pktmbuf_free as it is faster.
> > > >
> > > > Other PMDs use rte_pktmbuf_free_seg() as each TD has an associated
> > > > with it segment. So as HW is done with the TD, SW frees associated segment.
> > > > In your case I don't see any point in re-implementing rte_pktmbuf_free() manually,
> > > > and I don't think it would be any faster.
> > > >
> > > > Konstantin
> > > 
> > > As I mentioned below, I am clearly seeing a difference of 1 Mpps. And 1
> > > Mpps is not a small difference IMHO.
> > 
> > Agree with you here - it is a significant difference.
> > 
> > > 
> > > When running l3fwd with 8 queues, I also collected a perf report.
> > > When using rte_pktmbuf_free, I see that it eats up around 6% cpu as
> > > below in perf top report:-
> > > --------------------
> > > 32.00%  l3fwd                        [.] cxgbe_poll
> > > 22.25%  l3fwd                        [.] t4_eth_xmit
> > > 20.30%  l3fwd                        [.] main_loop
> > >  6.77%  l3fwd                        [.] rte_pktmbuf_free
> > >  4.86%  l3fwd                        [.] refill_fl_usembufs
> > >  2.00%  l3fwd                        [.] write_sgl
> > > .....
> > > --------------------
> > > 
> > > While, when using rte_pktmbuf_free_seg directly, I don't see above
> > > problem. perf top report now comes as:-
> > > -------------------
> > > 33.36%  l3fwd                        [.] cxgbe_poll
> > > 32.69%  l3fwd                        [.] t4_eth_xmit
> > > 19.05%  l3fwd                        [.] main_loop
> > >  5.21%  l3fwd                        [.] refill_fl_usembufs
> > >  2.40%  l3fwd                        [.] write_sgl
> > > ....
> > > -------------------
> > 
> > I don't think these 6% disappeared anywhere.
> > As I can see, now t4_eth_xmit() increased by roughly same amount
> > (you still have same job to do).
> 
> Right.
> 
> > To me it looks like in that case compiler didn't really inline rte_pktmbuf_free().
> > Wonder can you add 'always_inline' attribute to the  rte_pktmbuf_free(),
> > and see would it make any difference?
> > 
> > Konstantin 
> 
> I will try out above and update further.
> 

Tried always_inline and didn't see any difference in performance in
RHEL 6.4 with gcc 4.4.7, but was seeing 1 MPPS improvement with the
above block.

I've moved to latest RHEL 7.1 with gcc 4.8.3 and tried both
always_inline and the above block and I'm not seeing any difference
for both.

Will drop this block and submit a v2.

Thanks for the review Aaron and Konstantin.

Thanks,
Rahul
  

Patch

diff --git a/drivers/net/cxgbe/base/t4_regs.h b/drivers/net/cxgbe/base/t4_regs.h
index cd28b59..9057e40 100644
--- a/drivers/net/cxgbe/base/t4_regs.h
+++ b/drivers/net/cxgbe/base/t4_regs.h
@@ -266,6 +266,18 @@ 
 #define A_SGE_FL_BUFFER_SIZE2 0x104c
 #define A_SGE_FL_BUFFER_SIZE3 0x1050
 
+#define A_SGE_FLM_CFG 0x1090
+
+#define S_CREDITCNT    4
+#define M_CREDITCNT    0x3U
+#define V_CREDITCNT(x) ((x) << S_CREDITCNT)
+#define G_CREDITCNT(x) (((x) >> S_CREDITCNT) & M_CREDITCNT)
+
+#define S_CREDITCNTPACKING    2
+#define M_CREDITCNTPACKING    0x3U
+#define V_CREDITCNTPACKING(x) ((x) << S_CREDITCNTPACKING)
+#define G_CREDITCNTPACKING(x) (((x) >> S_CREDITCNTPACKING) & M_CREDITCNTPACKING)
+
 #define A_SGE_CONM_CTRL 0x1094
 
 #define S_EGRTHRESHOLD    8
@@ -361,6 +373,10 @@ 
 
 #define A_SGE_CONTROL2 0x1124
 
+#define S_IDMAARBROUNDROBIN    19
+#define V_IDMAARBROUNDROBIN(x) ((x) << S_IDMAARBROUNDROBIN)
+#define F_IDMAARBROUNDROBIN    V_IDMAARBROUNDROBIN(1U)
+
 #define S_INGPACKBOUNDARY    16
 #define M_INGPACKBOUNDARY    0x7U
 #define V_INGPACKBOUNDARY(x) ((x) << S_INGPACKBOUNDARY)
diff --git a/drivers/net/cxgbe/cxgbe_main.c b/drivers/net/cxgbe/cxgbe_main.c
index 3755444..316b87d 100644
--- a/drivers/net/cxgbe/cxgbe_main.c
+++ b/drivers/net/cxgbe/cxgbe_main.c
@@ -422,6 +422,13 @@  static int adap_init0_tweaks(struct adapter *adapter)
 	t4_set_reg_field(adapter, A_SGE_CONTROL, V_PKTSHIFT(M_PKTSHIFT),
 			 V_PKTSHIFT(rx_dma_offset));
 
+	t4_set_reg_field(adapter, A_SGE_FLM_CFG,
+			 V_CREDITCNT(M_CREDITCNT) | M_CREDITCNTPACKING,
+			 V_CREDITCNT(3) | V_CREDITCNTPACKING(1));
+
+	t4_set_reg_field(adapter, A_SGE_CONTROL2, V_IDMAARBROUNDROBIN(1U),
+			 V_IDMAARBROUNDROBIN(1U));
+
 	/*
 	 * Don't include the "IP Pseudo Header" in CPL_RX_PKT checksums: Linux
 	 * adds the pseudo header itself.
diff --git a/drivers/net/cxgbe/sge.c b/drivers/net/cxgbe/sge.c
index 6eb1244..e540881 100644
--- a/drivers/net/cxgbe/sge.c
+++ b/drivers/net/cxgbe/sge.c
@@ -286,8 +286,7 @@  static void unmap_rx_buf(struct sge_fl *q)
 
 static inline void ring_fl_db(struct adapter *adap, struct sge_fl *q)
 {
-	/* see if we have exceeded q->size / 4 */
-	if (q->pend_cred >= (q->size / 4)) {
+	if (q->pend_cred >= 64) {
 		u32 val = adap->params.arch.sge_fl_db;
 
 		if (is_t4(adap->params.chip))
@@ -995,7 +994,14 @@  static inline int tx_do_packet_coalesce(struct sge_eth_txq *txq,
 			int i;
 
 			for (i = 0; i < sd->coalesce.idx; i++) {
-				rte_pktmbuf_free(sd->coalesce.mbuf[i]);
+				struct rte_mbuf *tmp = sd->coalesce.mbuf[i];
+
+				do {
+					struct rte_mbuf *next = tmp->next;
+
+					rte_pktmbuf_free_seg(tmp);
+					tmp = next;
+				} while (tmp);
 				sd->coalesce.mbuf[i] = NULL;
 			}
 		}
@@ -1054,7 +1060,6 @@  out_free:
 		return 0;
 	}
 
-	rte_prefetch0(&((&txq->q)->sdesc->mbuf->pool));
 	pi = (struct port_info *)txq->eth_dev->data->dev_private;
 	adap = pi->adapter;
 
@@ -1070,6 +1075,7 @@  out_free:
 				txq->stats.mapping_err++;
 				goto out_free;
 			}
+			rte_prefetch0((volatile void *)addr);
 			return tx_do_packet_coalesce(txq, mbuf, cflits, adap,
 						     pi, addr);
 		} else {
@@ -1454,7 +1460,8 @@  static int process_responses(struct sge_rspq *q, int budget,
 			unsigned int params;
 			u32 val;
 
-			__refill_fl(q->adapter, &rxq->fl);
+			if (fl_cap(&rxq->fl) - rxq->fl.avail >= 64)
+				__refill_fl(q->adapter, &rxq->fl);
 			params = V_QINTR_TIMER_IDX(X_TIMERREG_UPDATE_CIDX);
 			q->next_intr_params = params;
 			val = V_CIDXINC(cidx_inc) | V_SEINTARM(params);