[dpdk-dev,v5] examples/l3fwd: em path performance fix
Commit Message
It seems that for the most use cases, previous hash_multi_lookup provides
better performance, and more, sequential lookup can cause significant
performance drop.
This patch sets previously optional hash_multi_lookup method as default.
It also provides some minor optimizations such as queue drain only on used
tx ports.
This patch should be applied after Maciej Czekaj's patch "l3fwd: Fix
compilation with HASH_MULTI_LOOKUP"
v5 changes:
- removed debug informations, patch cleanup
v4 changes:
- rebased to be applicable after patch "l3fwd: Fix compilation with
HASH_MULTI_LOOKUP" of Maciej Czekaj
v3 changes:
- "lpm: extend IPv4 next hop field" patch extends dst_port table from
uint16_t to uint32_t omiting previously disabled l3fwd_em_hlm_sse.h,
what causes incompatible pointer type error after turning on this header
v2 changes:
- fixed copy-paste error causing that not all packets are classified right
in hash_multi_lookup implementation when burst size is not divisible
by 8
Fixes: 94c54b4158d5 ("examples/l3fwd: rework exact-match")
Fixes: dc81ebbacaeb ("lpm: extend IPv4 next hop field")
Fixes: 64d3955de1de ("examples/l3fwd: fix ARM build")
Reported-by: Qian Xu <qian.q.xu@intel.com>
Signed-off-by: Tomasz Kulasek <tomaszx.kulasek@intel.com>
---
examples/l3fwd/l3fwd.h | 6 ++++++
examples/l3fwd/l3fwd_em.c | 6 +++---
examples/l3fwd/l3fwd_em_hlm_sse.h | 28 ++++++++++------------------
examples/l3fwd/l3fwd_em_sse.h | 9 +++++++++
examples/l3fwd/l3fwd_lpm.c | 4 ++--
examples/l3fwd/main.c | 7 +++++++
6 files changed, 37 insertions(+), 23 deletions(-)
Comments
2016-03-18 10:52, Tomasz Kulasek:
> +#if !defined(NO_HASH_MULTI_LOOKUP) && defined(__ARM_NEON)
I think we should use CONFIG_RTE_ARCH_ARM_NEON here.
Any ARM maintainer to confirm?
Note that there is already another occurence of this compiler flag:
examples/l3fwd/l3fwd_em.c:#elif defined(__ARM_NEON)
On Fri, Mar 18, 2016 at 11:04:49AM +0100, Thomas Monjalon wrote:
> 2016-03-18 10:52, Tomasz Kulasek:
> > +#if !defined(NO_HASH_MULTI_LOOKUP) && defined(__ARM_NEON)
>
> I think we should use CONFIG_RTE_ARCH_ARM_NEON here.
> Any ARM maintainer to confirm?
__ARM_NEON should work existing GCC, but it is better to use
RTE_MACHINE_CPUFLAG_NEON as
-it has been generated by probing the compiler capabilities.
-it's future-proof solution to support clang or other gcc versions in
future
➜ [master]laptop [dpdk-master] $ aarch64-thunderx-linux-gnu-gcc -dM -E - < /dev/null | grep NEON
#define __ARM_NEON_FP 12
#define __ARM_NEON 1
make output
aarch64-thunderx-linux-gnu-gcc -Wp,-MD,./.eal.o.d.tmp -pthread
-march=armv8-a+crc -mcpu=thunderx -DRTE_MACHINE_CPUFLAG_NEON
-DRTE_MACHINE_CPUFLAG_CRC32
-DRTE_COMPILE_TIME_CPUFLAGS=RTE_CPUFLAG_NEON,RTE_CPUFLAG_CRC32
-I/export/dpdk-master/build/include -include
/export/dpdk-master/build/include/rte_config.h
-I/export/dpdk-master/lib/librte_eal/linuxapp/eal/include
-I/export/dpdk-master/lib/librte_eal/common
-I/export/dpdk-master/lib/librte_eal/common/include
-I/export/dpdk-master/lib/librte_ring
-I/export/dpdk-master/lib/librte_mempool
-I/export/dpdk-master/lib/librte_ivshmem -W -Wall -Wstrict-prototypes
-Wmissing-prototypes -Wmissing-declarations -Wold-style-definition
-Wpointer-arith -Wcast-align -Wnested-externs -Wcast-qual
-Wformat-nonliteral -Wformat-security -Wundef -Wwrite-strings -Werror
-O3 -D_GNU_SOURCE -o eal.o -c
/export/dpdk-master/lib/librte_eal/linuxapp/eal/eal.c
Jerin
>
> Note that there is already another occurence of this compiler flag:
> examples/l3fwd/l3fwd_em.c:#elif defined(__ARM_NEON)
2016-03-18 16:22, Jerin Jacob:
> On Fri, Mar 18, 2016 at 11:04:49AM +0100, Thomas Monjalon wrote:
> > 2016-03-18 10:52, Tomasz Kulasek:
> > > +#if !defined(NO_HASH_MULTI_LOOKUP) && defined(__ARM_NEON)
> >
> > I think we should use CONFIG_RTE_ARCH_ARM_NEON here.
> > Any ARM maintainer to confirm?
>
> __ARM_NEON should work existing GCC, but it is better to use
> RTE_MACHINE_CPUFLAG_NEON as
> -it has been generated by probing the compiler capabilities.
> -it's future-proof solution to support clang or other gcc versions in
> future
I agree to use RTE_MACHINE_CPUFLAG_NEON.
I just don't understand why CONFIG_RTE_ARCH_ARM_NEON has been introduced.
It seems to be used to disable NEON on ARMv7:
ifeq ($(CONFIG_RTE_ARCH_ARM_NEON),y)
MACHINE_CFLAGS += -mfpu=neon
endif
Hello Thomas, Jerin, Tomasz, all...
On Fri, 18 Mar 2016 12:00:24 +0100
Thomas Monjalon <thomas.monjalon@6wind.com> wrote:
> 2016-03-18 16:22, Jerin Jacob:
> > On Fri, Mar 18, 2016 at 11:04:49AM +0100, Thomas Monjalon wrote:
> > > 2016-03-18 10:52, Tomasz Kulasek:
> > > > +#if !defined(NO_HASH_MULTI_LOOKUP) && defined(__ARM_NEON)
> > >
> > > I think we should use CONFIG_RTE_ARCH_ARM_NEON here.
> > > Any ARM maintainer to confirm?
> >
> > __ARM_NEON should work existing GCC, but it is better to use
> > RTE_MACHINE_CPUFLAG_NEON as
> > -it has been generated by probing the compiler capabilities.
> > -it's future-proof solution to support clang or other gcc versions in
> > future
>
> I agree to use RTE_MACHINE_CPUFLAG_NEON.
>
> I just don't understand why CONFIG_RTE_ARCH_ARM_NEON has been introduced.
> It seems to be used to disable NEON on ARMv7:
This is true. You should be able to disable the NEON-specific code if it
is unwanted. Eg., the memcpy operations are not always faster with NEON.
But...
$ git grep ARM_NEON
...
lib/librte_eal/common/include/arch/arm/rte_memcpy_32.h:45:#ifdef __ARM_NEON_FP
lib/librte_eal/common/include/arch/arm/rte_memcpy_32.h:328:#endif /* __ARM_NEON_FP */
...
From this point of view, this is wrong and should be fixed to check
a different constant.
> ifeq ($(CONFIG_RTE_ARCH_ARM_NEON),y)
> MACHINE_CFLAGS += -mfpu=neon
> endif
However, there is another possible way of understanding these options.
We can (well, unlikely and I am about to say 'never') have an ARM
processor without NEON. This cannot be detected by gcc as it does not
know the target processor... So from my point of view:
* CONFIG_RTE_ARCH_ARM_NEON says "my CPU does (not) support NEON" or "I
want to enable/disable NEON" while
* RTE_MACHINE_CPUFLAG_NEON says, the _compiler_ supports NEON
I'll send a patch trying to solve this.
Regards
Jan
> -----Original Message-----
> From: Jan Viktorin [mailto:viktorin@rehivetech.com]
> Sent: Friday, March 18, 2016 12:57
> To: Thomas Monjalon <thomas.monjalon@6wind.com>
> Cc: Jerin Jacob <jerin.jacob@caviumnetworks.com>; Kulasek, TomaszX
> <tomaszx.kulasek@intel.com>; dev@dpdk.org; jianbo.liu@linaro.org
> Subject: Re: [dpdk-dev] [PATCH v5] examples/l3fwd: em path performance fix
>
> Hello Thomas, Jerin, Tomasz, all...
>
> On Fri, 18 Mar 2016 12:00:24 +0100
> Thomas Monjalon <thomas.monjalon@6wind.com> wrote:
>
> > 2016-03-18 16:22, Jerin Jacob:
> > > On Fri, Mar 18, 2016 at 11:04:49AM +0100, Thomas Monjalon wrote:
> > > > 2016-03-18 10:52, Tomasz Kulasek:
> > > > > +#if !defined(NO_HASH_MULTI_LOOKUP) && defined(__ARM_NEON)
> > > >
> > > > I think we should use CONFIG_RTE_ARCH_ARM_NEON here.
> > > > Any ARM maintainer to confirm?
> > >
> > > __ARM_NEON should work existing GCC, but it is better to use
> > > RTE_MACHINE_CPUFLAG_NEON as -it has been generated by probing the
> > > compiler capabilities.
> > > -it's future-proof solution to support clang or other gcc versions
> > > in future
> >
> > I agree to use RTE_MACHINE_CPUFLAG_NEON.
> >
> > I just don't understand why CONFIG_RTE_ARCH_ARM_NEON has been
> introduced.
> > It seems to be used to disable NEON on ARMv7:
>
> This is true. You should be able to disable the NEON-specific code if it
> is unwanted. Eg., the memcpy operations are not always faster with NEON.
> But...
>
> $ git grep ARM_NEON
> ...
> lib/librte_eal/common/include/arch/arm/rte_memcpy_32.h:45:#ifdef
> __ARM_NEON_FP
> lib/librte_eal/common/include/arch/arm/rte_memcpy_32.h:328:#endif /*
> __ARM_NEON_FP */ ...
>
> From this point of view, this is wrong and should be fixed to check a
> different constant.
>
> > ifeq ($(CONFIG_RTE_ARCH_ARM_NEON),y)
> > MACHINE_CFLAGS += -mfpu=neon
> > endif
>
> However, there is another possible way of understanding these options.
> We can (well, unlikely and I am about to say 'never') have an ARM
> processor without NEON. This cannot be detected by gcc as it does not know
> the target processor... So from my point of view:
>
> * CONFIG_RTE_ARCH_ARM_NEON says "my CPU does (not) support NEON" or "I
> want to enable/disable NEON" while
> * RTE_MACHINE_CPUFLAG_NEON says, the _compiler_ supports NEON
>
> I'll send a patch trying to solve this.
>
> Regards
> Jan
Hi
As I understand with your last patch it's safe and preferred to use RTE_MACHINE_CPUFLAG_NEON for ARM Neon detection? If so, I can include this modification for whole l3fwd in v6 of this patch.
Tomasz.
On Fri, 18 Mar 2016 12:45:03 +0000
"Kulasek, TomaszX" <tomaszx.kulasek@intel.com> wrote:
> > -----Original Message-----
> > From: Jan Viktorin [mailto:viktorin@rehivetech.com]
> > Sent: Friday, March 18, 2016 12:57
> > To: Thomas Monjalon <thomas.monjalon@6wind.com>
> > Cc: Jerin Jacob <jerin.jacob@caviumnetworks.com>; Kulasek, TomaszX
> > <tomaszx.kulasek@intel.com>; dev@dpdk.org; jianbo.liu@linaro.org
> > Subject: Re: [dpdk-dev] [PATCH v5] examples/l3fwd: em path performance fix
> >
> > Hello Thomas, Jerin, Tomasz, all...
> >
> > On Fri, 18 Mar 2016 12:00:24 +0100
> > Thomas Monjalon <thomas.monjalon@6wind.com> wrote:
> >
> > > 2016-03-18 16:22, Jerin Jacob:
> > > > On Fri, Mar 18, 2016 at 11:04:49AM +0100, Thomas Monjalon wrote:
> > > > > 2016-03-18 10:52, Tomasz Kulasek:
> > > > > > +#if !defined(NO_HASH_MULTI_LOOKUP) && defined(__ARM_NEON)
> > > > >
> > > > > I think we should use CONFIG_RTE_ARCH_ARM_NEON here.
> > > > > Any ARM maintainer to confirm?
> > > >
> > > > __ARM_NEON should work existing GCC, but it is better to use
> > > > RTE_MACHINE_CPUFLAG_NEON as -it has been generated by probing the
> > > > compiler capabilities.
> > > > -it's future-proof solution to support clang or other gcc versions
> > > > in future
> > >
> > > I agree to use RTE_MACHINE_CPUFLAG_NEON.
> > >
> > > I just don't understand why CONFIG_RTE_ARCH_ARM_NEON has been
> > introduced.
> > > It seems to be used to disable NEON on ARMv7:
> >
> > This is true. You should be able to disable the NEON-specific code if it
> > is unwanted. Eg., the memcpy operations are not always faster with NEON.
> > But...
> >
> > $ git grep ARM_NEON
> > ...
> > lib/librte_eal/common/include/arch/arm/rte_memcpy_32.h:45:#ifdef
> > __ARM_NEON_FP
> > lib/librte_eal/common/include/arch/arm/rte_memcpy_32.h:328:#endif /*
> > __ARM_NEON_FP */ ...
> >
> > From this point of view, this is wrong and should be fixed to check a
> > different constant.
> >
> > > ifeq ($(CONFIG_RTE_ARCH_ARM_NEON),y)
> > > MACHINE_CFLAGS += -mfpu=neon
> > > endif
> >
> > However, there is another possible way of understanding these options.
> > We can (well, unlikely and I am about to say 'never') have an ARM
> > processor without NEON. This cannot be detected by gcc as it does not know
> > the target processor... So from my point of view:
> >
> > * CONFIG_RTE_ARCH_ARM_NEON says "my CPU does (not) support NEON" or "I
> > want to enable/disable NEON" while
> > * RTE_MACHINE_CPUFLAG_NEON says, the _compiler_ supports NEON
> >
> > I'll send a patch trying to solve this.
> >
> > Regards
> > Jan
>
> Hi
>
> As I understand with your last patch it's safe and preferred to use RTE_MACHINE_CPUFLAG_NEON for ARM Neon detection? If so, I can include this modification for whole l3fwd in v6 of this patch.
Yes, I'd prefer this approach as well.
Jan
>
> Tomasz.
@@ -40,6 +40,10 @@
#define RTE_LOGTYPE_L3FWD RTE_LOGTYPE_USER1
+#if !defined(NO_HASH_MULTI_LOOKUP) && defined(__ARM_NEON)
+#define NO_HASH_MULTI_LOOKUP 1
+#endif
+
#define MAX_PKT_BURST 32
#define BURST_TX_DRAIN_US 100 /* TX drain every ~100us */
@@ -86,6 +90,8 @@ struct lcore_rx_queue {
struct lcore_conf {
uint16_t n_rx_queue;
struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE];
+ uint16_t n_tx_port;
+ uint16_t tx_port_id[RTE_MAX_ETHPORTS];
uint16_t tx_queue_id[RTE_MAX_ETHPORTS];
struct mbuf_table tx_mbufs[RTE_MAX_ETHPORTS];
void *ipv4_lookup_struct;
@@ -320,7 +320,7 @@ em_get_ipv6_dst_port(void *ipv6_hdr, uint8_t portid, void *lookup_struct)
* buffer optimization i.e. ENABLE_MULTI_BUFFER_OPTIMIZE=1.
*/
#if defined(__SSE4_1__)
-#ifndef HASH_MULTI_LOOKUP
+#if defined(NO_HASH_MULTI_LOOKUP)
#include "l3fwd_em_sse.h"
#else
#include "l3fwd_em_hlm_sse.h"
@@ -568,8 +568,8 @@ em_main_loop(__attribute__((unused)) void *dummy)
diff_tsc = cur_tsc - prev_tsc;
if (unlikely(diff_tsc > drain_tsc)) {
- for (i = 0; i < qconf->n_rx_queue; i++) {
- portid = qconf->rx_queue_list[i].port_id;
+ for (i = 0; i < qconf->n_tx_port; ++i) {
+ portid = qconf->tx_port_id[i];
if (qconf->tx_mbufs[portid].len == 0)
continue;
send_burst(qconf,
@@ -34,17 +34,9 @@
#ifndef __L3FWD_EM_HLM_SSE_H__
#define __L3FWD_EM_HLM_SSE_H__
-/**
- * @file
- * This is an optional implementation of packet classification in Exact-Match
- * path using rte_hash_lookup_multi method from previous implementation.
- * While sequential classification seems to be faster, it's disabled by default
- * and can be enabled with HASH_LOOKUP_MULTI global define in compilation time.
- */
-
#include "l3fwd_sse.h"
-static inline void
+static inline __attribute__((always_inline)) void
em_get_dst_port_ipv4x8(struct lcore_conf *qconf, struct rte_mbuf *m[8],
uint8_t portid, uint32_t dst_port[8])
{
@@ -168,7 +160,7 @@ get_ipv6_5tuple(struct rte_mbuf *m0, __m128i mask0,
key->xmm[2] = _mm_and_si128(tmpdata2, mask1);
}
-static inline void
+static inline __attribute__((always_inline)) void
em_get_dst_port_ipv6x8(struct lcore_conf *qconf, struct rte_mbuf *m[8],
uint8_t portid, uint32_t dst_port[8])
{
@@ -322,17 +314,17 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
} else {
dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
- dst_port[j+1] = em_get_dst_port(qconf, pkts_burst[j], portid);
- dst_port[j+2] = em_get_dst_port(qconf, pkts_burst[j], portid);
- dst_port[j+3] = em_get_dst_port(qconf, pkts_burst[j], portid);
- dst_port[j+4] = em_get_dst_port(qconf, pkts_burst[j], portid);
- dst_port[j+5] = em_get_dst_port(qconf, pkts_burst[j], portid);
- dst_port[j+6] = em_get_dst_port(qconf, pkts_burst[j], portid);
- dst_port[j+7] = em_get_dst_port(qconf, pkts_burst[j], portid);
+ dst_port[j+1] = em_get_dst_port(qconf, pkts_burst[j+1], portid);
+ dst_port[j+2] = em_get_dst_port(qconf, pkts_burst[j+2], portid);
+ dst_port[j+3] = em_get_dst_port(qconf, pkts_burst[j+3], portid);
+ dst_port[j+4] = em_get_dst_port(qconf, pkts_burst[j+4], portid);
+ dst_port[j+5] = em_get_dst_port(qconf, pkts_burst[j+5], portid);
+ dst_port[j+6] = em_get_dst_port(qconf, pkts_burst[j+6], portid);
+ dst_port[j+7] = em_get_dst_port(qconf, pkts_burst[j+7], portid);
}
}
- for (; j < n; j++)
+ for (; j < nb_rx; j++)
dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
@@ -34,6 +34,15 @@
#ifndef __L3FWD_EM_SSE_H__
#define __L3FWD_EM_SSE_H__
+/**
+ * @file
+ * This is an optional implementation of packet classification in Exact-Match
+ * path using sequential packet classification method.
+ * While hash lookup multi seems to provide better performance, it's disabled
+ * by default and can be enabled with NO_HASH_LOOKUP_MULTI global define in
+ * compilation time.
+ */
+
#include "l3fwd_sse.h"
static inline __attribute__((always_inline)) uint16_t
@@ -159,8 +159,8 @@ lpm_main_loop(__attribute__((unused)) void *dummy)
diff_tsc = cur_tsc - prev_tsc;
if (unlikely(diff_tsc > drain_tsc)) {
- for (i = 0; i < qconf->n_rx_queue; i++) {
- portid = qconf->rx_queue_list[i].port_id;
+ for (i = 0; i < qconf->n_tx_port; ++i) {
+ portid = qconf->tx_port_id[i];
if (qconf->tx_mbufs[portid].len == 0)
continue;
send_burst(qconf,
@@ -791,6 +791,7 @@ main(int argc, char **argv)
unsigned lcore_id;
uint32_t n_tx_queue, nb_lcores;
uint8_t portid, nb_rx_queue, queue, socketid;
+ uint8_t nb_tx_port;
/* init EAL */
ret = rte_eal_init(argc, argv);
@@ -830,6 +831,7 @@ main(int argc, char **argv)
rte_exit(EXIT_FAILURE, "check_port_config failed\n");
nb_lcores = rte_lcore_count();
+ nb_tx_port = 0;
/* Setup function pointers for lookup method. */
setup_l3fwd_lookup_tables();
@@ -906,8 +908,13 @@ main(int argc, char **argv)
qconf = &lcore_conf[lcore_id];
qconf->tx_queue_id[portid] = queueid;
queueid++;
+
+ qconf->n_tx_port = nb_tx_port;
+ qconf->tx_port_id[qconf->n_tx_port] = portid;
}
printf("\n");
+
+ nb_tx_port++;
}
for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {