app/testpmd: improve sse based macswap

Message ID 20240713151949.832-1-vipin.varghese@amd.com (mailing list archive)
State New
Delegated to: Thomas Monjalon
Headers
Series app/testpmd: improve sse based macswap |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation fail Compilation issues
ci/intel-Testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/intel-Functional success Functional PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/github-robot: build fail github build: failed
ci/iol-abi-testing warning Testing issues
ci/iol-sample-apps-testing success Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-unit-amd64-testing fail Testing issues
ci/iol-compile-amd64-testing fail Testing issues
ci/iol-compile-arm64-testing success Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Functional success Functional Testing PASS

Commit Message

Varghese, Vipin July 13, 2024, 3:19 p.m. UTC
Goal of the patch is to improve SSE macswap on x86_64 by reducing
the stalls in backend engine. Original implementation of the SSE
macswap makes loop call to multiple load, shuffle & store. Using
SIMD ISA interleaving we can reduce the stalls for
 - load SSE token exhaustion
 - Shuffle and Load dependency

Also other changes which improves packet per second are
 - Filling access to MBUF for offload flags which is separate cacheline,
 - using register keyword

Test results:
------------
Platform: AMD EPYC SIENA 8594P @2.3GHz, no boost
DPDK: 24.03

------------------------------------------------
TEST IO 64B: baseline <NIC : MPPs>
 - mellanox CX-7 2*200Gbps : 42.0
 - intel E810 1*100Gbps : 82.0
 - intel E810 2*200Gbps (2CQ-DA2): 83.0
------------------------------------------------
TEST MACSWAP 64B: <NIC : Before : After>
 - mellanox CX-7 2*200Gbps : 31.533 : 31.90
 - intel E810 1*100Gbps : 50.380 : 47.0
 - intel E810 2*200Gbps (2CQ-DA2): 48.840 : 49.827
------------------------------------------------
TEST MACSWAP 128B: <NIC : Before: After>
 - mellanox CX-7 2*200Gbps: 30.946 : 31.770
 - intel E810 1*100Gbps: 49.386 : 46.366
 - intel E810 2*200Gbps (2CQ-DA2): 47.979 : 49.503
------------------------------------------------
TEST MACSWAP 256B: <NIC: Before: After>
 - mellanox CX-7 2*200Gbps: 32.480 : 33.150
 - intel E810 1 * 100Gbps: 45.29 : 44.571
 - intel E810 2 * 200Gbps (2CQ-DA2): 45.033 : 45.117
------------------------------------------------

using multiple queues and lcore there is linear increase in MPPs.

Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
---
 app/test-pmd/macswap_sse.h | 40 ++++++++++++++++++--------------------
 1 file changed, 19 insertions(+), 21 deletions(-)
  

Comments

Bruce Richardson July 15, 2024, 3:07 p.m. UTC | #1
On Sat, Jul 13, 2024 at 08:49:49PM +0530, Vipin Varghese wrote:
> Goal of the patch is to improve SSE macswap on x86_64 by reducing
> the stalls in backend engine. Original implementation of the SSE
> macswap makes loop call to multiple load, shuffle & store. Using
> SIMD ISA interleaving we can reduce the stalls for
>  - load SSE token exhaustion
>  - Shuffle and Load dependency
> 
> Also other changes which improves packet per second are
>  - Filling access to MBUF for offload flags which is separate cacheline,
>  - using register keyword
> 
> Test results:
> ------------
> Platform: AMD EPYC SIENA 8594P @2.3GHz, no boost
> DPDK: 24.03
> 
> ------------------------------------------------
> TEST IO 64B: baseline <NIC : MPPs>
>  - mellanox CX-7 2*200Gbps : 42.0
>  - intel E810 1*100Gbps : 82.0
>  - intel E810 2*200Gbps (2CQ-DA2): 83.0
> ------------------------------------------------
> TEST MACSWAP 64B: <NIC : Before : After>
>  - mellanox CX-7 2*200Gbps : 31.533 : 31.90
>  - intel E810 1*100Gbps : 50.380 : 47.0
>  - intel E810 2*200Gbps (2CQ-DA2): 48.840 : 49.827
> ------------------------------------------------
> TEST MACSWAP 128B: <NIC : Before: After>
>  - mellanox CX-7 2*200Gbps: 30.946 : 31.770
>  - intel E810 1*100Gbps: 49.386 : 46.366
>  - intel E810 2*200Gbps (2CQ-DA2): 47.979 : 49.503
> ------------------------------------------------
> TEST MACSWAP 256B: <NIC: Before: After>
>  - mellanox CX-7 2*200Gbps: 32.480 : 33.150
>  - intel E810 1 * 100Gbps: 45.29 : 44.571
>  - intel E810 2 * 200Gbps (2CQ-DA2): 45.033 : 45.117
> ------------------------------------------------
> 

Hi,

interesting patch. Do you know why we see regressions in some of the cases
above? For 1x100G at 64B and 128B packet sizes we see perf drops of 3mpps
vs smaller gains in the other two cases at each size (much smaller in the
64B case).

Couple of other questions inline below too.

Thanks,
/Bruce

> using multiple queues and lcore there is linear increase in MPPs.
> 
> Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
> ---
>  app/test-pmd/macswap_sse.h | 40 ++++++++++++++++++--------------------
>  1 file changed, 19 insertions(+), 21 deletions(-)
> 
> diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
> index 223f87a539..a3d3a274e5 100644
> --- a/app/test-pmd/macswap_sse.h
> +++ b/app/test-pmd/macswap_sse.h
> @@ -11,21 +11,21 @@ static inline void
>  do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
>  		struct rte_port *txp)
>  {
> -	struct rte_ether_hdr *eth_hdr[4];
> -	struct rte_mbuf *mb[4];
> +	register struct rte_ether_hdr *eth_hdr[8];
> +	register struct rte_mbuf *mb[8];

Does using "register" actually make a difference to the generated code?

Also, why increasing the array sizes from 4 to 8 - the actual code only
uses 4 elements of each array below anyway? Is it for cache alignment
purposes perhaps - if so, please use explicit cache alignment attributes to
specify this rather than having it implicit in the array sizes.

>  	uint64_t ol_flags;
>  	int i;
>  	int r;
> -	__m128i addr0, addr1, addr2, addr3;
> +	register __m128i addr0, addr1, addr2, addr3;
>  	/**
>  	 * shuffle mask be used to shuffle the 16 bytes.
>  	 * byte 0-5 wills be swapped with byte 6-11.
>  	 * byte 12-15 will keep unchanged.
>  	 */
> -	__m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
> -					5, 4, 3, 2,
> -					1, 0, 11, 10,
> -					9, 8, 7, 6);
> +	register const __m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
> +							5, 4, 3, 2,
> +							1, 0, 11, 10,
> +							9, 8, 7, 6);
>  
>  	ol_flags = ol_flags_init(txp->dev_conf.txmode.offloads);
>  	vlan_qinq_set(pkts, nb, ol_flags,
> @@ -44,23 +44,24 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
>  
>  		mb[0] = pkts[i++];
>  		eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *);
> -		addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
> -
>  		mb[1] = pkts[i++];
>  		eth_hdr[1] = rte_pktmbuf_mtod(mb[1], struct rte_ether_hdr *);
> -		addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
> -
> -
>  		mb[2] = pkts[i++];
>  		eth_hdr[2] = rte_pktmbuf_mtod(mb[2], struct rte_ether_hdr *);
> -		addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
> -
>  		mb[3] = pkts[i++];
>  		eth_hdr[3] = rte_pktmbuf_mtod(mb[3], struct rte_ether_hdr *);
> -		addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
>  
> +		/* Interleave load, shuffle & set */
> +		addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
> +		mbuf_field_set(mb[0], ol_flags);
> +		addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
> +		mbuf_field_set(mb[1], ol_flags);
>  		addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
> +		addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
> +		mbuf_field_set(mb[2], ol_flags);
>  		addr1 = _mm_shuffle_epi8(addr1, shfl_msk);
> +		addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
> +		mbuf_field_set(mb[3], ol_flags);
>  		addr2 = _mm_shuffle_epi8(addr2, shfl_msk);
>  		addr3 = _mm_shuffle_epi8(addr3, shfl_msk);
>  
> @@ -69,25 +70,22 @@ do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
>  		_mm_storeu_si128((__m128i *)eth_hdr[2], addr2);
>  		_mm_storeu_si128((__m128i *)eth_hdr[3], addr3);
>  
> -		mbuf_field_set(mb[0], ol_flags);
> -		mbuf_field_set(mb[1], ol_flags);
> -		mbuf_field_set(mb[2], ol_flags);
> -		mbuf_field_set(mb[3], ol_flags);
>  		r -= 4;
>  	}
>  
>  	for ( ; i < nb; i++) {
>  		if (i < nb - 1)
>  			rte_prefetch0(rte_pktmbuf_mtod(pkts[i+1], void *));
> +
>  		mb[0] = pkts[i];
>  		eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *);
>  
>  		/* Swap dest and src mac addresses. */
>  		addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
> +		/* MBUF and Ethernet are 2 separate cacheline */
> +		mbuf_field_set(mb[0], ol_flags);
>  		addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
>  		_mm_storeu_si128((__m128i *)eth_hdr[0], addr0);
> -
> -		mbuf_field_set(mb[0], ol_flags);
>  	}

Since final loop is only for the odd elements at the end of the array of
buffers. Does changing the instruction ordering here really make perf
difference?
  

Patch

diff --git a/app/test-pmd/macswap_sse.h b/app/test-pmd/macswap_sse.h
index 223f87a539..a3d3a274e5 100644
--- a/app/test-pmd/macswap_sse.h
+++ b/app/test-pmd/macswap_sse.h
@@ -11,21 +11,21 @@  static inline void
 do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
 		struct rte_port *txp)
 {
-	struct rte_ether_hdr *eth_hdr[4];
-	struct rte_mbuf *mb[4];
+	register struct rte_ether_hdr *eth_hdr[8];
+	register struct rte_mbuf *mb[8];
 	uint64_t ol_flags;
 	int i;
 	int r;
-	__m128i addr0, addr1, addr2, addr3;
+	register __m128i addr0, addr1, addr2, addr3;
 	/**
 	 * shuffle mask be used to shuffle the 16 bytes.
 	 * byte 0-5 wills be swapped with byte 6-11.
 	 * byte 12-15 will keep unchanged.
 	 */
-	__m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
-					5, 4, 3, 2,
-					1, 0, 11, 10,
-					9, 8, 7, 6);
+	register const __m128i shfl_msk = _mm_set_epi8(15, 14, 13, 12,
+							5, 4, 3, 2,
+							1, 0, 11, 10,
+							9, 8, 7, 6);
 
 	ol_flags = ol_flags_init(txp->dev_conf.txmode.offloads);
 	vlan_qinq_set(pkts, nb, ol_flags,
@@ -44,23 +44,24 @@  do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
 
 		mb[0] = pkts[i++];
 		eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *);
-		addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
-
 		mb[1] = pkts[i++];
 		eth_hdr[1] = rte_pktmbuf_mtod(mb[1], struct rte_ether_hdr *);
-		addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
-
-
 		mb[2] = pkts[i++];
 		eth_hdr[2] = rte_pktmbuf_mtod(mb[2], struct rte_ether_hdr *);
-		addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
-
 		mb[3] = pkts[i++];
 		eth_hdr[3] = rte_pktmbuf_mtod(mb[3], struct rte_ether_hdr *);
-		addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
 
+		/* Interleave load, shuffle & set */
+		addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
+		mbuf_field_set(mb[0], ol_flags);
+		addr1 = _mm_loadu_si128((__m128i *)eth_hdr[1]);
+		mbuf_field_set(mb[1], ol_flags);
 		addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
+		addr2 = _mm_loadu_si128((__m128i *)eth_hdr[2]);
+		mbuf_field_set(mb[2], ol_flags);
 		addr1 = _mm_shuffle_epi8(addr1, shfl_msk);
+		addr3 = _mm_loadu_si128((__m128i *)eth_hdr[3]);
+		mbuf_field_set(mb[3], ol_flags);
 		addr2 = _mm_shuffle_epi8(addr2, shfl_msk);
 		addr3 = _mm_shuffle_epi8(addr3, shfl_msk);
 
@@ -69,25 +70,22 @@  do_macswap(struct rte_mbuf *pkts[], uint16_t nb,
 		_mm_storeu_si128((__m128i *)eth_hdr[2], addr2);
 		_mm_storeu_si128((__m128i *)eth_hdr[3], addr3);
 
-		mbuf_field_set(mb[0], ol_flags);
-		mbuf_field_set(mb[1], ol_flags);
-		mbuf_field_set(mb[2], ol_flags);
-		mbuf_field_set(mb[3], ol_flags);
 		r -= 4;
 	}
 
 	for ( ; i < nb; i++) {
 		if (i < nb - 1)
 			rte_prefetch0(rte_pktmbuf_mtod(pkts[i+1], void *));
+
 		mb[0] = pkts[i];
 		eth_hdr[0] = rte_pktmbuf_mtod(mb[0], struct rte_ether_hdr *);
 
 		/* Swap dest and src mac addresses. */
 		addr0 = _mm_loadu_si128((__m128i *)eth_hdr[0]);
+		/* MBUF and Ethernet are 2 separate cacheline */
+		mbuf_field_set(mb[0], ol_flags);
 		addr0 = _mm_shuffle_epi8(addr0, shfl_msk);
 		_mm_storeu_si128((__m128i *)eth_hdr[0], addr0);
-
-		mbuf_field_set(mb[0], ol_flags);
 	}
 }