examples/l3fwd: enable multiple Tx queues on a lcore

Message ID 20201104072810.105498-1-leyi.rong@intel.com (mailing list archive)
State Rejected, archived
Delegated to: Thomas Monjalon
Headers
Series examples/l3fwd: enable multiple Tx queues on a lcore |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/travis-robot warning Travis build: failed
ci/iol-testing success Testing PASS
ci/iol-intel-Functional fail Functional Testing issues
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Leyi Rong Nov. 4, 2020, 7:28 a.m. UTC
  Currently, l3fwd doesn't support multiple Tx queues, while
multiple Rx queues is supported.
To improve the throughput performance when polling multiple
queues, this patch enables multiple Tx queues handling on a lcore.

Signed-off-by: Leyi Rong <leyi.rong@intel.com>
---
 examples/l3fwd/l3fwd_common.h        |  6 ++---
 examples/l3fwd/l3fwd_em.c            |  2 +-
 examples/l3fwd/l3fwd_em_hlm.h        |  4 ++--
 examples/l3fwd/l3fwd_em_sequential.h |  5 ++--
 examples/l3fwd/l3fwd_lpm.c           |  2 +-
 examples/l3fwd/l3fwd_lpm_sse.h       |  5 ++--
 examples/l3fwd/l3fwd_sse.h           |  4 ++--
 examples/l3fwd/main.c                | 35 ++++++++++++++++------------
 8 files changed, 35 insertions(+), 28 deletions(-)
  

Comments

David Marchand Nov. 4, 2020, 8:14 a.m. UTC | #1
If I count well, this is the v3 of the patch.
Please version your patches.

On Wed, Nov 4, 2020 at 8:52 AM Leyi Rong <leyi.rong@intel.com> wrote:
>
> Currently, l3fwd doesn't support multiple Tx queues, while
> multiple Rx queues is supported.
> To improve the throughput performance when polling multiple
> queues, this patch enables multiple Tx queues handling on a lcore.

Why would there be a gain in using multiple txq?
Is it with hw txq? sw txq? .. ?
  
Leyi Rong Nov. 4, 2020, 8:34 a.m. UTC | #2
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Wednesday, November 4, 2020 4:14 PM
> To: Rong, Leyi <leyi.rong@intel.com>
> Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; dev <dev@dpdk.org>
> Subject: Re: [PATCH] examples/l3fwd: enable multiple Tx queues on a lcore
> 
> If I count well, this is the v3 of the patch.
> Please version your patches.

The previous versions are set to superseded. As nothing changes with content
on those versions, can start from this version?

> 
> On Wed, Nov 4, 2020 at 8:52 AM Leyi Rong <leyi.rong@intel.com> wrote:
> >
> > Currently, l3fwd doesn't support multiple Tx queues, while multiple Rx
> > queues is supported.
> > To improve the throughput performance when polling multiple queues,
> > this patch enables multiple Tx queues handling on a lcore.
> 
> Why would there be a gain in using multiple txq?
> Is it with hw txq? sw txq? .. ?
> 
> 
> --
> David Marchand

As there always has thoughput limit for per queue, on some performance test case by using l3fwd,
the result will limited by the per queue thoughput limit. With multiple Tx queue enabled, the per
queue thoughput limit can be eliminated if the CPU core is not the bottleneck.

Leyi
  
David Marchand Nov. 4, 2020, 8:43 a.m. UTC | #3
On Wed, Nov 4, 2020 at 9:34 AM Rong, Leyi <leyi.rong@intel.com> wrote:
> > -----Original Message-----
> > From: David Marchand <david.marchand@redhat.com>
> > Sent: Wednesday, November 4, 2020 4:14 PM
> > To: Rong, Leyi <leyi.rong@intel.com>
> > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; dev <dev@dpdk.org>
> > Subject: Re: [PATCH] examples/l3fwd: enable multiple Tx queues on a lcore
> >
> > If I count well, this is the v3 of the patch.
> > Please version your patches.
>
> The previous versions are set to superseded. As nothing changes with content
> on those versions, can start from this version?

The commitlog changes even if the code itself did not change, so this
is a different patch.
Different patches mean different versions.
This shows that some work happened since the v1 submission.


> As there always has thoughput limit for per queue, on some performance test case by using l3fwd,
> the result will limited by the per queue thoughput limit. With multiple Tx queue enabled, the per
> queue thoughput limit can be eliminated if the CPU core is not the bottleneck.

Ah interesting.
Which nic has such limitations?
How much of an improvement can be expected from this?
  
Leyi Rong Nov. 4, 2020, 9:04 a.m. UTC | #4
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Wednesday, November 4, 2020 4:43 PM
> To: Rong, Leyi <leyi.rong@intel.com>
> Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; dev <dev@dpdk.org>
> Subject: Re: [PATCH] examples/l3fwd: enable multiple Tx queues on a lcore
> 
> On Wed, Nov 4, 2020 at 9:34 AM Rong, Leyi <leyi.rong@intel.com> wrote:
> > > -----Original Message-----
> > > From: David Marchand <david.marchand@redhat.com>
> > > Sent: Wednesday, November 4, 2020 4:14 PM
> > > To: Rong, Leyi <leyi.rong@intel.com>
> > > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; dev <dev@dpdk.org>
> > > Subject: Re: [PATCH] examples/l3fwd: enable multiple Tx queues on a
> > > lcore
> > >
> > > If I count well, this is the v3 of the patch.
> > > Please version your patches.
> >
> > The previous versions are set to superseded. As nothing changes with
> > content on those versions, can start from this version?
> 
> The commitlog changes even if the code itself did not change, so this is a
> different patch.
> Different patches mean different versions.
> This shows that some work happened since the v1 submission.
> 

Agreed.
> 
> > As there always has thoughput limit for per queue, on some performance
> > test case by using l3fwd, the result will limited by the per queue
> > thoughput limit. With multiple Tx queue enabled, the per queue thoughput
> limit can be eliminated if the CPU core is not the bottleneck.
> 
> Ah interesting.
> Which nic has such limitations?
> How much of an improvement can be expected from this?
> 
> 
> --
> David Marchand

The initial found was on XXV710 25Gb NIC, but suppose such issue can happen on more NICs 
as the high-end CPU per core boundary is higher than many NICs(except 100Gb and above) per queue performance boundary.
The improvement can be about 1.8X with that case@1t2q. 

Leyi
  
Jerin Jacob Nov. 5, 2020, 7:14 a.m. UTC | #5
On Wed, Nov 4, 2020 at 2:34 PM Rong, Leyi <leyi.rong@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: David Marchand <david.marchand@redhat.com>
> > Sent: Wednesday, November 4, 2020 4:43 PM
> > To: Rong, Leyi <leyi.rong@intel.com>
> > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; dev <dev@dpdk.org>
> > Subject: Re: [PATCH] examples/l3fwd: enable multiple Tx queues on a lcore
> >
> > On Wed, Nov 4, 2020 at 9:34 AM Rong, Leyi <leyi.rong@intel.com> wrote:
> > > > -----Original Message-----
> > > > From: David Marchand <david.marchand@redhat.com>
> > > > Sent: Wednesday, November 4, 2020 4:14 PM
> > > > To: Rong, Leyi <leyi.rong@intel.com>
> > > > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; dev <dev@dpdk.org>
> > > > Subject: Re: [PATCH] examples/l3fwd: enable multiple Tx queues on a
> > > > lcore
> > > >
> > > > If I count well, this is the v3 of the patch.
> > > > Please version your patches.
> > >
> > > The previous versions are set to superseded. As nothing changes with
> > > content on those versions, can start from this version?
> >
> > The commitlog changes even if the code itself did not change, so this is a
> > different patch.
> > Different patches mean different versions.
> > This shows that some work happened since the v1 submission.
> >
>
> Agreed.
> >
> > > As there always has thoughput limit for per queue, on some performance
> > > test case by using l3fwd, the result will limited by the per queue
> > > thoughput limit. With multiple Tx queue enabled, the per queue thoughput
> > limit can be eliminated if the CPU core is not the bottleneck.
> >
> > Ah interesting.
> > Which nic has such limitations?
> > How much of an improvement can be expected from this?
> >
> >
> > --
> > David Marchand
>
> The initial found was on XXV710 25Gb NIC, but suppose such issue can happen on more NICs
> as the high-end CPU per core boundary is higher than many NICs(except 100Gb and above) per queue performance boundary.
> The improvement can be about 1.8X with that case@1t2q.

As far as I understand, the Current l3fwd Tx queue creation is like this:
If the app has N cores and M ports then l3fwd creates, N x M Tx queues in total,
What will be new values based on this patch?

Does this patch has any regression in case the NIC queues able to cope
up with the throughput limit
from CPU.


>
> Leyi
>
  
Leyi Rong Nov. 5, 2020, 9:24 a.m. UTC | #6
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, November 5, 2020 3:15 PM
> To: Rong, Leyi <leyi.rong@intel.com>
> Cc: David Marchand <david.marchand@redhat.com>; Zhang, Qi Z
> <qi.z.zhang@intel.com>; dev <dev@dpdk.org>
> Subject: Re: [dpdk-dev] [PATCH] examples/l3fwd: enable multiple Tx queues on
> a lcore
> 
> On Wed, Nov 4, 2020 at 2:34 PM Rong, Leyi <leyi.rong@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: David Marchand <david.marchand@redhat.com>
> > > Sent: Wednesday, November 4, 2020 4:43 PM
> > > To: Rong, Leyi <leyi.rong@intel.com>
> > > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; dev <dev@dpdk.org>
> > > Subject: Re: [PATCH] examples/l3fwd: enable multiple Tx queues on a
> > > lcore
> > >
> > > On Wed, Nov 4, 2020 at 9:34 AM Rong, Leyi <leyi.rong@intel.com> wrote:
> > > > > -----Original Message-----
> > > > > From: David Marchand <david.marchand@redhat.com>
> > > > > Sent: Wednesday, November 4, 2020 4:14 PM
> > > > > To: Rong, Leyi <leyi.rong@intel.com>
> > > > > Cc: Zhang, Qi Z <qi.z.zhang@intel.com>; dev <dev@dpdk.org>
> > > > > Subject: Re: [PATCH] examples/l3fwd: enable multiple Tx queues
> > > > > on a lcore
> > > > >
> > > > > If I count well, this is the v3 of the patch.
> > > > > Please version your patches.
> > > >
> > > > The previous versions are set to superseded. As nothing changes
> > > > with content on those versions, can start from this version?
> > >
> > > The commitlog changes even if the code itself did not change, so
> > > this is a different patch.
> > > Different patches mean different versions.
> > > This shows that some work happened since the v1 submission.
> > >
> >
> > Agreed.
> > >
> > > > As there always has thoughput limit for per queue, on some
> > > > performance test case by using l3fwd, the result will limited by
> > > > the per queue thoughput limit. With multiple Tx queue enabled, the
> > > > per queue thoughput
> > > limit can be eliminated if the CPU core is not the bottleneck.
> > >
> > > Ah interesting.
> > > Which nic has such limitations?
> > > How much of an improvement can be expected from this?
> > >
> > >
> > > --
> > > David Marchand
> >
> > The initial found was on XXV710 25Gb NIC, but suppose such issue can
> > happen on more NICs as the high-end CPU per core boundary is higher than
> many NICs(except 100Gb and above) per queue performance boundary.
> > The improvement can be about 1.8X with that case@1t2q.
> 
> As far as I understand, the Current l3fwd Tx queue creation is like this:
> If the app has N cores and M ports then l3fwd creates, N x M Tx queues in total,
> What will be new values based on this patch?
> 

Hi Jacob,

Total queues number equals to queues per port multiply port number.
Just take #l3fwd -l 5,6 -n 6 -- -p 0x3 --config '(0,0,5),(0,1,5),(1,0,6),(1,1,6)' as example, 
With this patch appied, totally 2x2=4 tx queues can be polled, while only
1x2=2 tx queues can be used before.


> Does this patch has any regression in case the NIC queues able to cope up with
> the throughput limit from CPU.
> 

Regression test relevant with l3fwd passed with this patch, no obvious result drop 
on other cases.

> 
> >
> > Leyi
> >
  
Thomas Monjalon March 24, 2021, 5:23 p.m. UTC | #7
05/11/2020 10:24, Rong, Leyi:
> From: Jerin Jacob <jerinjacobk@gmail.com>
> > On Wed, Nov 4, 2020 at 2:34 PM Rong, Leyi <leyi.rong@intel.com> wrote:
> > > From: David Marchand <david.marchand@redhat.com>
> > > > On Wed, Nov 4, 2020 at 9:34 AM Rong, Leyi <leyi.rong@intel.com> wrote:
> > > > > As there always has thoughput limit for per queue, on some
> > > > > performance test case by using l3fwd, the result will limited by
> > > > > the per queue thoughput limit. With multiple Tx queue enabled, the
> > > > > per queue thoughput
> > > > > limit can be eliminated if the CPU core is not the bottleneck.
> > > >
> > > > Ah interesting.
> > > > Which nic has such limitations?
> > > > How much of an improvement can be expected from this?
> > >
> > > The initial found was on XXV710 25Gb NIC, but suppose such issue can
> > > happen on more NICs as the high-end CPU per core boundary is higher than
> > > many NICs(except 100Gb and above) per queue performance boundary.
> > > The improvement can be about 1.8X with that case@1t2q.
> > 
> > As far as I understand, the Current l3fwd Tx queue creation is like this:
> > If the app has N cores and M ports then l3fwd creates, N x M Tx queues in total,
> > What will be new values based on this patch?

Thank you Jerin for providing some info missing in the description of the patch.

> Hi Jacob,
> 
> Total queues number equals to queues per port multiply port number.
> Just take #l3fwd -l 5,6 -n 6 -- -p 0x3 --config '(0,0,5),(0,1,5),(1,0,6),(1,1,6)' as example, 
> With this patch appied, totally 2x2=4 tx queues can be polled, while only
> 1x2=2 tx queues can be used before.

It does not reply above question with N x M.

> > Does this patch has any regression in case the NIC queues able to cope up with
> > the throughput limit from CPU.
> 
> Regression test relevant with l3fwd passed with this patch, no obvious result drop 
> on other cases.

It does not reply the general question for all drivers you did not test.

As you probably noticed, this patch is blocked for months
because it is not properly explained.
  
Leyi Rong March 25, 2021, 5:38 a.m. UTC | #8
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Thursday, March 25, 2021 1:23 AM
> To: Jerin Jacob <jerinjacobk@gmail.com>; David Marchand
> <david.marchand@redhat.com>; Rong, Leyi <leyi.rong@intel.com>
> Cc: dev@dpdk.org; Zhang, Qi Z <qi.z.zhang@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH] examples/l3fwd: enable multiple Tx queues on
> a lcore
> 
> 05/11/2020 10:24, Rong, Leyi:
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > On Wed, Nov 4, 2020 at 2:34 PM Rong, Leyi <leyi.rong@intel.com> wrote:
> > > > From: David Marchand <david.marchand@redhat.com>
> > > > > On Wed, Nov 4, 2020 at 9:34 AM Rong, Leyi <leyi.rong@intel.com>
> wrote:
> > > > > > As there always has thoughput limit for per queue, on some
> > > > > > performance test case by using l3fwd, the result will limited
> > > > > > by the per queue thoughput limit. With multiple Tx queue
> > > > > > enabled, the per queue thoughput limit can be eliminated if
> > > > > > the CPU core is not the bottleneck.
> > > > >
> > > > > Ah interesting.
> > > > > Which nic has such limitations?
> > > > > How much of an improvement can be expected from this?
> > > >
> > > > The initial found was on XXV710 25Gb NIC, but suppose such issue
> > > > can happen on more NICs as the high-end CPU per core boundary is
> > > > higher than many NICs(except 100Gb and above) per queue performance
> boundary.
> > > > The improvement can be about 1.8X with that case@1t2q.
> > >
> > > As far as I understand, the Current l3fwd Tx queue creation is like this:
> > > If the app has N cores and M ports then l3fwd creates, N x M Tx
> > > queues in total, What will be new values based on this patch?
> 
> Thank you Jerin for providing some info missing in the description of the patch.
> 
> > Hi Jacob,
> >
> > Total queues number equals to queues per port multiply port number.
> > Just take #l3fwd -l 5,6 -n 6 -- -p 0x3 --config
> > '(0,0,5),(0,1,5),(1,0,6),(1,1,6)' as example, With this patch appied,
> > totally 2x2=4 tx queues can be polled, while only
> > 1x2=2 tx queues can be used before.
> 
> It does not reply above question with N x M.
> 
> > > Does this patch has any regression in case the NIC queues able to
> > > cope up with the throughput limit from CPU.
> >
> > Regression test relevant with l3fwd passed with this patch, no obvious
> > result drop on other cases.
> 
> It does not reply the general question for all drivers you did not test.
> 
> As you probably noticed, this patch is blocked for months because it is not
> properly explained.
> 

Hi Thomas,

This patch can be abandoned after synced with Konstantin months ago. And update the state to superseded on patchwork, Thanks!
  
Thomas Monjalon March 25, 2021, 8:10 a.m. UTC | #9
25/03/2021 06:38, Rong, Leyi:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 05/11/2020 10:24, Rong, Leyi:
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > On Wed, Nov 4, 2020 at 2:34 PM Rong, Leyi <leyi.rong@intel.com> wrote:
> > > > > From: David Marchand <david.marchand@redhat.com>
> > > > > > On Wed, Nov 4, 2020 at 9:34 AM Rong, Leyi <leyi.rong@intel.com>
> > wrote:
> > > > > > > As there always has thoughput limit for per queue, on some
> > > > > > > performance test case by using l3fwd, the result will limited
> > > > > > > by the per queue thoughput limit. With multiple Tx queue
> > > > > > > enabled, the per queue thoughput limit can be eliminated if
> > > > > > > the CPU core is not the bottleneck.
> > > > > >
> > > > > > Ah interesting.
> > > > > > Which nic has such limitations?
> > > > > > How much of an improvement can be expected from this?
> > > > >
> > > > > The initial found was on XXV710 25Gb NIC, but suppose such issue
> > > > > can happen on more NICs as the high-end CPU per core boundary is
> > > > > higher than many NICs(except 100Gb and above) per queue performance
> > boundary.
> > > > > The improvement can be about 1.8X with that case@1t2q.
> > > >
> > > > As far as I understand, the Current l3fwd Tx queue creation is like this:
> > > > If the app has N cores and M ports then l3fwd creates, N x M Tx
> > > > queues in total, What will be new values based on this patch?
> > 
> > Thank you Jerin for providing some info missing in the description of the patch.
> > 
> > > Hi Jacob,
> > >
> > > Total queues number equals to queues per port multiply port number.
> > > Just take #l3fwd -l 5,6 -n 6 -- -p 0x3 --config
> > > '(0,0,5),(0,1,5),(1,0,6),(1,1,6)' as example, With this patch appied,
> > > totally 2x2=4 tx queues can be polled, while only
> > > 1x2=2 tx queues can be used before.
> > 
> > It does not reply above question with N x M.
> > 
> > > > Does this patch has any regression in case the NIC queues able to
> > > > cope up with the throughput limit from CPU.
> > >
> > > Regression test relevant with l3fwd passed with this patch, no obvious
> > > result drop on other cases.
> > 
> > It does not reply the general question for all drivers you did not test.
> > 
> > As you probably noticed, this patch is blocked for months because it is not
> > properly explained.
> > 
> 
> Hi Thomas,
> 
> This patch can be abandoned after synced with Konstantin months ago. And update the state to superseded on patchwork, Thanks!

"Superseded" means a newer version has been sent.
I will change to "Rejected".
  

Patch

diff --git a/examples/l3fwd/l3fwd_common.h b/examples/l3fwd/l3fwd_common.h
index 7d83ff641a..ab114af8c6 100644
--- a/examples/l3fwd/l3fwd_common.h
+++ b/examples/l3fwd/l3fwd_common.h
@@ -178,8 +178,8 @@  static const struct {
 };
 
 static __rte_always_inline void
-send_packetsx4(struct lcore_conf *qconf, uint16_t port, struct rte_mbuf *m[],
-		uint32_t num)
+send_packetsx4(struct lcore_conf *qconf, uint16_t port, uint16_t queueid,
+	       struct rte_mbuf *m[], uint32_t num)
 {
 	uint32_t len, j, n;
 
@@ -190,7 +190,7 @@  send_packetsx4(struct lcore_conf *qconf, uint16_t port, struct rte_mbuf *m[],
 	 * then send them straightway.
 	 */
 	if (num >= MAX_TX_BURST && len == 0) {
-		n = rte_eth_tx_burst(port, qconf->tx_queue_id[port], m, num);
+		n = rte_eth_tx_burst(port, queueid, m, num);
 		if (unlikely(n < num)) {
 			do {
 				rte_pktmbuf_free(m[n]);
diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c
index 9996bfba34..8fddb8d55d 100644
--- a/examples/l3fwd/l3fwd_em.c
+++ b/examples/l3fwd/l3fwd_em.c
@@ -686,7 +686,7 @@  em_main_loop(__rte_unused void *dummy)
 
 #if defined RTE_ARCH_X86 || defined __ARM_NEON
 			l3fwd_em_send_packets(nb_rx, pkts_burst,
-							portid, qconf);
+							portid, queueid, qconf);
 #else
 			l3fwd_em_no_opt_send_packets(nb_rx, pkts_burst,
 							portid, qconf);
diff --git a/examples/l3fwd/l3fwd_em_hlm.h b/examples/l3fwd/l3fwd_em_hlm.h
index 278707c18c..d08f393eed 100644
--- a/examples/l3fwd/l3fwd_em_hlm.h
+++ b/examples/l3fwd/l3fwd_em_hlm.h
@@ -183,7 +183,7 @@  em_get_dst_port(const struct lcore_conf *qconf, struct rte_mbuf *pkt,
  */
 static inline void
 l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
-		uint16_t portid, struct lcore_conf *qconf)
+		uint16_t portid, uint16_t queueid, struct lcore_conf *qconf)
 {
 	int32_t i, j, pos;
 	uint16_t dst_port[MAX_PKT_BURST];
@@ -238,7 +238,7 @@  l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	for (; j < nb_rx; j++)
 		dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
 
-	send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
+	send_packets_multi(qconf, pkts_burst, dst_port, queueid, nb_rx);
 
 }
 
diff --git a/examples/l3fwd/l3fwd_em_sequential.h b/examples/l3fwd/l3fwd_em_sequential.h
index 6170052cf8..2d7071b0c9 100644
--- a/examples/l3fwd/l3fwd_em_sequential.h
+++ b/examples/l3fwd/l3fwd_em_sequential.h
@@ -74,7 +74,8 @@  em_get_dst_port(const struct lcore_conf *qconf, struct rte_mbuf *pkt,
  */
 static inline void
 l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
-			uint16_t portid, struct lcore_conf *qconf)
+		      uint16_t portid, uint16_t queueid,
+		      struct lcore_conf *qconf)
 {
 	int32_t i, j;
 	uint16_t dst_port[MAX_PKT_BURST];
@@ -93,7 +94,7 @@  l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 		dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
 	}
 
-	send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
+	send_packets_multi(qconf, pkts_burst, dst_port, queueid, nb_rx);
 }
 
 /*
diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
index 3dcf1fef18..8153150c37 100644
--- a/examples/l3fwd/l3fwd_lpm.c
+++ b/examples/l3fwd/l3fwd_lpm.c
@@ -243,7 +243,7 @@  lpm_main_loop(__rte_unused void *dummy)
 #if defined RTE_ARCH_X86 || defined __ARM_NEON \
 			 || defined RTE_ARCH_PPC_64
 			l3fwd_lpm_send_packets(nb_rx, pkts_burst,
-						portid, qconf);
+						portid, queueid, qconf);
 #else
 			l3fwd_lpm_no_opt_send_packets(nb_rx, pkts_burst,
 							portid, qconf);
diff --git a/examples/l3fwd/l3fwd_lpm_sse.h b/examples/l3fwd/l3fwd_lpm_sse.h
index 3f637a23d1..cd68179b76 100644
--- a/examples/l3fwd/l3fwd_lpm_sse.h
+++ b/examples/l3fwd/l3fwd_lpm_sse.h
@@ -83,7 +83,8 @@  processx4_step2(const struct lcore_conf *qconf,
  */
 static inline void
 l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
-			uint16_t portid, struct lcore_conf *qconf)
+		       uint16_t portid, uint16_t queueid,
+		       struct lcore_conf *qconf)
 {
 	int32_t j;
 	uint16_t dst_port[MAX_PKT_BURST];
@@ -114,7 +115,7 @@  l3fwd_lpm_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 		j++;
 	}
 
-	send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
+	send_packets_multi(qconf, pkts_burst, dst_port, queueid, nb_rx);
 }
 
 #endif /* __L3FWD_LPM_SSE_H__ */
diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
index bb565ed546..f91580a4ce 100644
--- a/examples/l3fwd/l3fwd_sse.h
+++ b/examples/l3fwd/l3fwd_sse.h
@@ -125,7 +125,7 @@  process_packet(struct rte_mbuf *pkt, uint16_t *dst_port)
  */
 static __rte_always_inline void
 send_packets_multi(struct lcore_conf *qconf, struct rte_mbuf **pkts_burst,
-		uint16_t dst_port[MAX_PKT_BURST], int nb_rx)
+		uint16_t dst_port[MAX_PKT_BURST], uint16_t queueid, int nb_rx)
 {
 	int32_t k;
 	int j = 0;
@@ -220,7 +220,7 @@  send_packets_multi(struct lcore_conf *qconf, struct rte_mbuf **pkts_burst,
 		k = pnum[j];
 
 		if (likely(pn != BAD_PORT))
-			send_packetsx4(qconf, pn, pkts_burst + j, k);
+			send_packetsx4(qconf, pn, queueid, pkts_burst + j, k);
 		else
 			for (m = j; m != j + k; m++)
 				rte_pktmbuf_free(pkts_burst[m]);
diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
index d62dec434c..93922e7d48 100644
--- a/examples/l3fwd/main.c
+++ b/examples/l3fwd/main.c
@@ -935,7 +935,7 @@  l3fwd_poll_resource_setup(void)
 		fflush(stdout);
 
 		nb_rx_queue = get_port_n_rx_queues(portid);
-		n_tx_queue = nb_lcores;
+		n_tx_queue = nb_rx_queue;
 		if (n_tx_queue > MAX_TX_QUEUE_PER_PORT)
 			n_tx_queue = MAX_TX_QUEUE_PER_PORT;
 		printf("Creating queues: nb_rxq=%d nb_txq=%u... ",
@@ -1006,11 +1006,12 @@  l3fwd_poll_resource_setup(void)
 		if (ret < 0)
 			rte_exit(EXIT_FAILURE, "init_mem failed\n");
 
-		/* init one TX queue per couple (lcore,port) */
+		/* init TX queues per couple (lcore,port) */
 		queueid = 0;
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
+			qconf = &lcore_conf[lcore_id];
 
 			if (numa_on)
 				socketid =
@@ -1018,21 +1019,25 @@  l3fwd_poll_resource_setup(void)
 			else
 				socketid = 0;
 
-			printf("txq=%u,%d,%d ", lcore_id, queueid, socketid);
-			fflush(stdout);
-
-			txconf = &dev_info.default_txconf;
-			txconf->offloads = local_port_conf.txmode.offloads;
-			ret = rte_eth_tx_queue_setup(portid, queueid, nb_txd,
-						     socketid, txconf);
-			if (ret < 0)
-				rte_exit(EXIT_FAILURE,
-					"rte_eth_tx_queue_setup: err=%d, "
-					"port=%d\n", ret, portid);
+			for (queue = 0; queue < qconf->n_rx_queue; queue++) {
+				queueid = qconf->rx_queue_list[queue].queue_id;
+				printf("txq=%u,%d,%d ",
+					lcore_id, queueid, socketid);
+				fflush(stdout);
+
+				txconf = &dev_info.default_txconf;
+				txconf->offloads =
+					local_port_conf.txmode.offloads;
+				ret = rte_eth_tx_queue_setup
+					(portid, queueid, nb_txd,
+					 socketid, txconf);
+				if (ret < 0)
+					rte_exit(EXIT_FAILURE,
+						"rte_eth_tx_queue_setup: err=%d, "
+						"port=%d\n", ret, portid);
+			}
 
-			qconf = &lcore_conf[lcore_id];
 			qconf->tx_queue_id[portid] = queueid;
-			queueid++;
 
 			qconf->tx_port_id[qconf->n_tx_port] = portid;
 			qconf->n_tx_port++;