[v4,13/14] vhost: check whether disable software pre-fetch

Message ID 20191009133849.69002-14-yong.liu@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Maxime Coquelin
Headers
Series vhost packed ring performance optimization |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Marvin Liu Oct. 9, 2019, 1:38 p.m. UTC
  Disable software pre-fetch actions on Skylake and later platforms.
Hardware can fetch needed data for vhost, additional software pre-fetch
will impact performance.

Signed-off-by: Marvin Liu <yong.liu@intel.com>
  

Comments

Maxime Coquelin Oct. 11, 2019, 2:12 p.m. UTC | #1
On 10/9/19 3:38 PM, Marvin Liu wrote:
> Disable software pre-fetch actions on Skylake and later platforms.
> Hardware can fetch needed data for vhost, additional software pre-fetch
> will impact performance.
> 
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
> 
> diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> index 30839a001..5f3b42e56 100644
> --- a/lib/librte_vhost/Makefile
> +++ b/lib/librte_vhost/Makefile
> @@ -16,6 +16,12 @@ CFLAGS += -I vhost_user
>  CFLAGS += -fno-strict-aliasing
>  LDLIBS += -lpthread
>  
> +AVX512_SUPPORT=$(shell $(CC) -march=native -dM -E - </dev/null |grep AVX512F)
> +
> +ifneq ($(AVX512_SUPPORT),)
> +CFLAGS += -DDISABLE_SWPREFETCH
> +endif

That's problematic I think, because the machine running the lib may be
different from the machine building it, for example distros.

In this case, a Skylake or later may be used to build the package, but
with passing "-march=haswell". It would end-up prefetching being
disabled whereas we would expect it to be enabled.

I see several solutions:
- Check for CONFIG_RTE_ENABLE_AVX512 flag.
- Keep prefetch instructions (what would be the impact on Skylake and
  later?)
- Remove prefetch instructions (what would be the impact on pre-
  Skylake?)


But really, I think we need some figures before applying such a patch.
What performance gain do you measure with this patch?

>  ifeq ($(RTE_TOOLCHAIN), gcc)
>  ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
>  CFLAGS += -DSUPPORT_GCC_UNROLL_PRAGMA
> diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> index ddf0ee579..5c6f0c0b4 100644
> --- a/lib/librte_vhost/meson.build
> +++ b/lib/librte_vhost/meson.build
> @@ -15,6 +15,10 @@ elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
>  elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
>  	cflags += '-DSUPPORT_ICC_UNROLL_PRAGMA'
>  endif
> +r = run_command(toolchain, '-march=native', '-dM', '-E', '-', '</dev/null', '|', 'grep AVX512F')
> +if (r.stdout().strip() != '')
> +	cflags += '-DDISABLE_SWPREFETCH'
> +endif
>  dpdk_conf.set('RTE_LIBRTE_VHOST_POSTCOPY',
>  	      cc.has_header('linux/userfaultfd.h'))
>  version = 4
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 56c2080fb..046e497c2 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -1075,7 +1075,9 @@ virtio_dev_rx_batch_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  
>  	UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
>  	for (i = 0; i < PACKED_BATCH_SIZE; i++) {
> +#ifndef DISABLE_SWPREFETCH
>  		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +#endif
>  		hdrs[i] = (struct virtio_net_hdr_mrg_rxbuf *)
>  					(uintptr_t)desc_addrs[i];
>  		lens[i] = pkts[i]->pkt_len + dev->vhost_hlen;
> @@ -1144,7 +1146,9 @@ virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	uint32_t remained = count;
>  
>  	do {
> +#ifndef DISABLE_SWPREFETCH
>  		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> +#endif
>  
>  		if (remained >= PACKED_BATCH_SIZE) {
>  			if (!virtio_dev_rx_batch_packed(dev, vq, pkts)) {
> @@ -1790,7 +1794,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  
>  	UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
>  	for (i = 0; i < PACKED_BATCH_SIZE; i++) {
> +#ifndef DISABLE_SWPREFETCH
>  		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +#endif
>  		rte_memcpy(rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
>  			   (void *)(uintptr_t)(desc_addrs[i] + buf_offset),
>  			   pkts[i]->pkt_len);
> @@ -2046,7 +2052,9 @@ virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	uint32_t remained = count;
>  
>  	do {
> +#ifndef DISABLE_SWPREFETCH
>  		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> +#endif
>  
>  		if (remained >= PACKED_BATCH_SIZE) {
>  			if (!virtio_dev_tx_batch_packed(dev, vq, mbuf_pool,
>
  
Marvin Liu Oct. 14, 2019, 3:08 a.m. UTC | #2
> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Friday, October 11, 2019 10:12 PM
> To: Liu, Yong <yong.liu@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>; Wang,
> Zhihong <zhihong.wang@intel.com>; stephen@networkplumber.org;
> gavin.hu@arm.com
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v4 13/14] vhost: check whether disable software pre-
> fetch
> 
> 
> 
> On 10/9/19 3:38 PM, Marvin Liu wrote:
> > Disable software pre-fetch actions on Skylake and later platforms.
> > Hardware can fetch needed data for vhost, additional software pre-fetch
> > will impact performance.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > index 30839a001..5f3b42e56 100644
> > --- a/lib/librte_vhost/Makefile
> > +++ b/lib/librte_vhost/Makefile
> > @@ -16,6 +16,12 @@ CFLAGS += -I vhost_user
> >  CFLAGS += -fno-strict-aliasing
> >  LDLIBS += -lpthread
> >
> > +AVX512_SUPPORT=$(shell $(CC) -march=native -dM -E - </dev/null |grep
> AVX512F)
> > +
> > +ifneq ($(AVX512_SUPPORT),)
> > +CFLAGS += -DDISABLE_SWPREFETCH
> > +endif
> 
> That's problematic I think, because the machine running the lib may be
> different from the machine building it, for example distros.
> 
> In this case, a Skylake or later may be used to build the package, but
> with passing "-march=haswell". It would end-up prefetching being
> disabled whereas we would expect it to be enabled.
> 
Thanks, Maxime. Got your idea. Compiling environment and running environment maybe different. 
Performance impact on skylake is around 1% in V1 patch under vhost/virtio loopback scenario.
Since the impact is very small and has no impact in later revised version.
I'd like to remove this patch.

Regards,
Marvin

> I see several solutions:
> - Check for CONFIG_RTE_ENABLE_AVX512 flag.
> - Keep prefetch instructions (what would be the impact on Skylake and
>   later?)
> - Remove prefetch instructions (what would be the impact on pre-
>   Skylake?)
> 
> 
> But really, I think we need some figures before applying such a patch.
> What performance gain do you measure with this patch?
> 
> >  ifeq ($(RTE_TOOLCHAIN), gcc)
> >  ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
> >  CFLAGS += -DSUPPORT_GCC_UNROLL_PRAGMA
> > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > index ddf0ee579..5c6f0c0b4 100644
> > --- a/lib/librte_vhost/meson.build
> > +++ b/lib/librte_vhost/meson.build
> > @@ -15,6 +15,10 @@ elif (toolchain == 'clang' and
> cc.version().version_compare('>=3.7.0'))
> >  elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
> >  	cflags += '-DSUPPORT_ICC_UNROLL_PRAGMA'
> >  endif
> > +r = run_command(toolchain, '-march=native', '-dM', '-E', '-',
> '</dev/null', '|', 'grep AVX512F')
> > +if (r.stdout().strip() != '')
> > +	cflags += '-DDISABLE_SWPREFETCH'
> > +endif
> >  dpdk_conf.set('RTE_LIBRTE_VHOST_POSTCOPY',
> >  	      cc.has_header('linux/userfaultfd.h'))
> >  version = 4
> > diff --git a/lib/librte_vhost/virtio_net.c
> b/lib/librte_vhost/virtio_net.c
> > index 56c2080fb..046e497c2 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -1075,7 +1075,9 @@ virtio_dev_rx_batch_packed(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
> >
> >  	UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
> >  	for (i = 0; i < PACKED_BATCH_SIZE; i++) {
> > +#ifndef DISABLE_SWPREFETCH
> >  		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +#endif
> >  		hdrs[i] = (struct virtio_net_hdr_mrg_rxbuf *)
> >  					(uintptr_t)desc_addrs[i];
> >  		lens[i] = pkts[i]->pkt_len + dev->vhost_hlen;
> > @@ -1144,7 +1146,9 @@ virtio_dev_rx_packed(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
> >  	uint32_t remained = count;
> >
> >  	do {
> > +#ifndef DISABLE_SWPREFETCH
> >  		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> > +#endif
> >
> >  		if (remained >= PACKED_BATCH_SIZE) {
> >  			if (!virtio_dev_rx_batch_packed(dev, vq, pkts)) {
> > @@ -1790,7 +1794,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
> >
> >  	UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
> >  	for (i = 0; i < PACKED_BATCH_SIZE; i++) {
> > +#ifndef DISABLE_SWPREFETCH
> >  		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +#endif
> >  		rte_memcpy(rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
> >  			   (void *)(uintptr_t)(desc_addrs[i] + buf_offset),
> >  			   pkts[i]->pkt_len);
> > @@ -2046,7 +2052,9 @@ virtio_dev_tx_packed(struct virtio_net *dev, struct
> vhost_virtqueue *vq,
> >  	uint32_t remained = count;
> >
> >  	do {
> > +#ifndef DISABLE_SWPREFETCH
> >  		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> > +#endif
> >
> >  		if (remained >= PACKED_BATCH_SIZE) {
> >  			if (!virtio_dev_tx_batch_packed(dev, vq, mbuf_pool,
> >
  

Patch

diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index 30839a001..5f3b42e56 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -16,6 +16,12 @@  CFLAGS += -I vhost_user
 CFLAGS += -fno-strict-aliasing
 LDLIBS += -lpthread
 
+AVX512_SUPPORT=$(shell $(CC) -march=native -dM -E - </dev/null |grep AVX512F)
+
+ifneq ($(AVX512_SUPPORT),)
+CFLAGS += -DDISABLE_SWPREFETCH
+endif
+
 ifeq ($(RTE_TOOLCHAIN), gcc)
 ifeq ($(shell test $(GCC_VERSION) -ge 83 && echo 1), 1)
 CFLAGS += -DSUPPORT_GCC_UNROLL_PRAGMA
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index ddf0ee579..5c6f0c0b4 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -15,6 +15,10 @@  elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
 elif (toolchain == 'icc' and cc.version().version_compare('>=16.0.0'))
 	cflags += '-DSUPPORT_ICC_UNROLL_PRAGMA'
 endif
+r = run_command(toolchain, '-march=native', '-dM', '-E', '-', '</dev/null', '|', 'grep AVX512F')
+if (r.stdout().strip() != '')
+	cflags += '-DDISABLE_SWPREFETCH'
+endif
 dpdk_conf.set('RTE_LIBRTE_VHOST_POSTCOPY',
 	      cc.has_header('linux/userfaultfd.h'))
 version = 4
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 56c2080fb..046e497c2 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1075,7 +1075,9 @@  virtio_dev_rx_batch_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 	UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
 	for (i = 0; i < PACKED_BATCH_SIZE; i++) {
+#ifndef DISABLE_SWPREFETCH
 		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+#endif
 		hdrs[i] = (struct virtio_net_hdr_mrg_rxbuf *)
 					(uintptr_t)desc_addrs[i];
 		lens[i] = pkts[i]->pkt_len + dev->vhost_hlen;
@@ -1144,7 +1146,9 @@  virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	uint32_t remained = count;
 
 	do {
+#ifndef DISABLE_SWPREFETCH
 		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
+#endif
 
 		if (remained >= PACKED_BATCH_SIZE) {
 			if (!virtio_dev_rx_batch_packed(dev, vq, pkts)) {
@@ -1790,7 +1794,9 @@  virtio_dev_tx_batch_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 	UNROLL_PRAGMA(UNROLL_PRAGMA_PARAM)
 	for (i = 0; i < PACKED_BATCH_SIZE; i++) {
+#ifndef DISABLE_SWPREFETCH
 		rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+#endif
 		rte_memcpy(rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
 			   (void *)(uintptr_t)(desc_addrs[i] + buf_offset),
 			   pkts[i]->pkt_len);
@@ -2046,7 +2052,9 @@  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	uint32_t remained = count;
 
 	do {
+#ifndef DISABLE_SWPREFETCH
 		rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
+#endif
 
 		if (remained >= PACKED_BATCH_SIZE) {
 			if (!virtio_dev_tx_batch_packed(dev, vq, mbuf_pool,