[dpdk-dev,2/3] hash: add vectorized comparison

Message ID 1472247287-167011-3-git-send-email-pablo.de.lara.guarch@intel.com (mailing list archive)
State Superseded, archived
Headers

Commit Message

De Lara Guarch, Pablo Aug. 26, 2016, 9:34 p.m. UTC
  From: Byron Marohn <byron.marohn@intel.com>

In lookup bulk function, the signatures of all entries
are compared against the signature of the key that is being looked up.
Now that all the signatures are together, they can be compared
with vector instructions (SSE, AVX2), achieving higher lookup performance.

Also, entries per bucket are increased to 8 when using processors
with AVX2, as 256 bits can be compared at once, which is the size of
8x32-bit signatures.

Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Saikrishna Edupuganti <saikrishna.edupuganti@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 41 ++++++++++++++++++++++++++++++++++-----
 lib/librte_hash/rte_cuckoo_hash.h |  4 ++++
 2 files changed, 40 insertions(+), 5 deletions(-)
  

Comments

Thomas Monjalon Aug. 27, 2016, 8:57 a.m. UTC | #1
2016-08-26 22:34, Pablo de Lara:
> From: Byron Marohn <byron.marohn@intel.com>
> 
> In lookup bulk function, the signatures of all entries
> are compared against the signature of the key that is being looked up.
> Now that all the signatures are together, they can be compared
> with vector instructions (SSE, AVX2), achieving higher lookup performance.
> 
> Also, entries per bucket are increased to 8 when using processors
> with AVX2, as 256 bits can be compared at once, which is the size of
> 8x32-bit signatures.

Please, would it be possible to use the generic SIMD intrinsics?
We could define generic types compatible with Altivec and NEON:
	__attribute__ ((vector_size (n)))
as described in https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

> +/* 8 entries per bucket */
> +#if defined(__AVX2__)

Please prefer
	#ifdef RTE_MACHINE_CPUFLAG_AVX2
Ideally the vector support could be checked at runtime:
	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
It would allow packaging one binary using the best optimization available.

> +	*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
> +			_mm256_load_si256((__m256i const *)prim_bkt->sig_current),
> +			_mm256_set1_epi32(prim_hash)));
> +	*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
> +			_mm256_load_si256((__m256i const *)sec_bkt->sig_current),
> +			_mm256_set1_epi32(sec_hash)));
> +/* 4 entries per bucket */
> +#elif defined(__SSE2__)
> +	*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
> +			_mm_load_si128((__m128i const *)prim_bkt->sig_current),
> +			_mm_set1_epi32(prim_hash)));
> +	*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
> +			_mm_load_si128((__m128i const *)sec_bkt->sig_current),
> +			_mm_set1_epi32(sec_hash)));

In order to allow such switch based on register size, we could have an
abstraction in EAL supporting 128/256/512 width for x86/ARM/POWER.
I think aliasing RTE_MACHINE_CPUFLAG_ and RTE_CPUFLAG_ may be enough.
  
De Lara Guarch, Pablo Sept. 2, 2016, 5:05 p.m. UTC | #2
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Saturday, August 27, 2016 1:58 AM
> To: De Lara Guarch, Pablo; Marohn, Byron
> Cc: dev@dpdk.org; Richardson, Bruce; Edupuganti, Saikrishna;
> jianbo.liu@linaro.org; chaozhu@linux.vnet.ibm.com;
> jerin.jacob@caviumnetworks.com
> Subject: Re: [dpdk-dev] [PATCH 2/3] hash: add vectorized comparison
> 
> 2016-08-26 22:34, Pablo de Lara:
> > From: Byron Marohn <byron.marohn@intel.com>
> >
> > In lookup bulk function, the signatures of all entries
> > are compared against the signature of the key that is being looked up.
> > Now that all the signatures are together, they can be compared
> > with vector instructions (SSE, AVX2), achieving higher lookup performance.
> >
> > Also, entries per bucket are increased to 8 when using processors
> > with AVX2, as 256 bits can be compared at once, which is the size of
> > 8x32-bit signatures.
> 
> Please, would it be possible to use the generic SIMD intrinsics?
> We could define generic types compatible with Altivec and NEON:
> 	__attribute__ ((vector_size (n)))
> as described in https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
> 

I tried to convert these into generic code with gcc builtins,
but I couldn't find a way to translate the __mm_movemask instrinsic into a generic builtin
(which is very necessary for performance reasons).
Therefore, I think it is not possible to do this without penalizing performance.
Sure, we could try to translate the other intrinsics, but it would mean that we still need to
use #ifdefs and we would have a mix of code with x86 instrinsics and gcc builtins,
so it is better to leave it this way.

> > +/* 8 entries per bucket */
> > +#if defined(__AVX2__)
> 
> Please prefer
> 	#ifdef RTE_MACHINE_CPUFLAG_AVX2
> Ideally the vector support could be checked at runtime:
> 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> It would allow packaging one binary using the best optimization available.
> 

Good idea. Will submit a v2 with this change. It took me a bit of time to figure out
a way to do this without paying a big performance penalty.

> > +	*prim_hash_matches |=
> _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
> > +			_mm256_load_si256((__m256i const *)prim_bkt-
> >sig_current),
> > +			_mm256_set1_epi32(prim_hash)));
> > +	*sec_hash_matches |=
> _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
> > +			_mm256_load_si256((__m256i const *)sec_bkt-
> >sig_current),
> > +			_mm256_set1_epi32(sec_hash)));
> > +/* 4 entries per bucket */
> > +#elif defined(__SSE2__)
> > +	*prim_hash_matches |=
> _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
> > +			_mm_load_si128((__m128i const *)prim_bkt-
> >sig_current),
> > +			_mm_set1_epi32(prim_hash)));
> > +	*sec_hash_matches |=
> _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
> > +			_mm_load_si128((__m128i const *)sec_bkt-
> >sig_current),
> > +			_mm_set1_epi32(sec_hash)));
> 
> In order to allow such switch based on register size, we could have an
> abstraction in EAL supporting 128/256/512 width for x86/ARM/POWER.
> I think aliasing RTE_MACHINE_CPUFLAG_ and RTE_CPUFLAG_ may be
> enough.
  

Patch

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 9d507b6..98713d3 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -939,6 +939,38 @@  lookup_stage1(unsigned idx, hash_sig_t *prim_hash, hash_sig_t *sec_hash,
 	rte_prefetch0(*secondary_bkt);
 }
 
+static inline void
+compare_signatures(unsigned *prim_hash_matches, unsigned *sec_hash_matches,
+				const struct rte_hash_bucket *prim_bkt,
+				const struct rte_hash_bucket *sec_bkt,
+				hash_sig_t prim_hash, hash_sig_t sec_hash)
+{
+/* 8 entries per bucket */
+#if defined(__AVX2__)
+	*prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+			_mm256_load_si256((__m256i const *)prim_bkt->sig_current),
+			_mm256_set1_epi32(prim_hash)));
+	*sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32(
+			_mm256_load_si256((__m256i const *)sec_bkt->sig_current),
+			_mm256_set1_epi32(sec_hash)));
+/* 4 entries per bucket */
+#elif defined(__SSE2__)
+	*prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+			_mm_load_si128((__m128i const *)prim_bkt->sig_current),
+			_mm_set1_epi32(prim_hash)));
+	*sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16(
+			_mm_load_si128((__m128i const *)sec_bkt->sig_current),
+			_mm_set1_epi32(sec_hash)));
+#else
+	unsigned i;
+
+	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+		*prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
+		*sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
+	}
+#endif
+}
+
 /*
  * Lookup bulk stage 2:  Search for match hashes in primary/secondary locations
  * and prefetch first key slot
@@ -951,15 +983,14 @@  lookup_stage2(unsigned idx, hash_sig_t prim_hash, hash_sig_t sec_hash,
 		uint64_t *extra_hits_mask, const void *keys,
 		const struct rte_hash *h)
 {
-	unsigned prim_hash_matches, sec_hash_matches, key_idx, i;
+	unsigned prim_hash_matches, sec_hash_matches, key_idx;
 	unsigned total_hash_matches;
 
 	prim_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
 	sec_hash_matches = 1 << RTE_HASH_BUCKET_ENTRIES;
-	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		prim_hash_matches |= ((prim_hash == prim_bkt->sig_current[i]) << i);
-		sec_hash_matches |= ((sec_hash == sec_bkt->sig_current[i]) << i);
-	}
+
+	compare_signatures(&prim_hash_matches, &sec_hash_matches, prim_bkt,
+						sec_bkt, prim_hash, sec_hash);
 
 	key_idx = prim_bkt->key_idx[__builtin_ctzl(prim_hash_matches)];
 	if (key_idx == 0)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index fe0654f..eb57d7e 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -130,7 +130,11 @@  enum add_key_case {
 };
 
 /** Number of items per bucket. */
+#if defined(__AVX2__)
+#define RTE_HASH_BUCKET_ENTRIES		8
+#else
 #define RTE_HASH_BUCKET_ENTRIES		4
+#endif
 
 #define NULL_SIGNATURE			0