[dpdk-dev,0/4] eal/common: introduce rte_memset and related test

Message ID f7c9c6c2-b0bb-a7df-cca1-abe93c5089fb@redhat.com (mailing list archive)
State Not Applicable, archived
Headers

Commit Message

Maxime Coquelin Dec. 2, 2016, 10 a.m. UTC
  Hi Zhiyong,

On 12/05/2016 09:26 AM, Zhiyong Yang wrote:
> DPDK code has met performance drop badly in some case when calling glibc
> function memset. Reference to discussions about memset in
> http://dpdk.org/ml/archives/dev/2016-October/048628.html
> It is necessary to introduce more high efficient function to fix it.
> One important thing about rte_memset is that we can get clear control
> on what instruction flow is used.
>
> This patchset introduces rte_memset to bring more high efficient
> implementation, and will bring obvious perf improvement, especially
> for small N bytes in the most application scenarios.
>
> Patch 1 implements rte_memset in the file rte_memset.h on IA platform
> The file supports three types of instruction sets including sse & avx
> (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of
> vectorization and inline function to improve the perf on IA. In addition,
> cache line and memory alignment are fully taken into consideration.
>
> Patch 2 implements functional autotest to validates the function whether
> to work in a right way.
>
> Patch 3 implements performance autotest separately in cache and memory.
>
> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring 3%~4%
> performance improvements on IA platform from virtio/vhost non-mergeable
> loopback testing.
>
> Zhiyong Yang (4):
>   eal/common: introduce rte_memset on IA platform
>   app/test: add functional autotest for rte_memset
>   app/test: add performance autotest for rte_memset
>   lib/librte_vhost: improve vhost perf using rte_memset
>
>  app/test/Makefile                                  |   3 +
>  app/test/test_memset.c                             | 158 +++++++++
>  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
>  doc/guides/rel_notes/release_17_02.rst             |  11 +
>  .../common/include/arch/x86/rte_memset.h           | 376 +++++++++++++++++++++
>  lib/librte_eal/common/include/generic/rte_memset.h |  51 +++
>  lib/librte_vhost/virtio_net.c                      |  18 +-
>  7 files changed, 958 insertions(+), 7 deletions(-)
>  create mode 100644 app/test/test_memset.c
>  create mode 100644 app/test/test_memset_perf.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memset.h
>  create mode 100644 lib/librte_eal/common/include/generic/rte_memset.h
>

Thanks for the series, idea looks good to me.

Wouldn't be worth to also use rte_memset in Virtio PMD (not
compiled/tested)? :

                 /* setup tx ring slot to point to indirect
                  * descriptor list stored in reserved region.

Cheers,
Maxime
  

Comments

Yang, Zhiyong Dec. 6, 2016, 6:33 a.m. UTC | #1
Hi, Maxime:

> -----Original Message-----

> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]

> Sent: Friday, December 2, 2016 6:01 PM

> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org

> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce

> <bruce.richardson@intel.com>; Ananyev, Konstantin

> <konstantin.ananyev@intel.com>

> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset

> and related test

> 

> Hi Zhiyong,

> 

> On 12/05/2016 09:26 AM, Zhiyong Yang wrote:

> > DPDK code has met performance drop badly in some case when calling

> > glibc function memset. Reference to discussions about memset in

> > http://dpdk.org/ml/archives/dev/2016-October/048628.html

> > It is necessary to introduce more high efficient function to fix it.

> > One important thing about rte_memset is that we can get clear control

> > on what instruction flow is used.

> >

> > This patchset introduces rte_memset to bring more high efficient

> > implementation, and will bring obvious perf improvement, especially

> > for small N bytes in the most application scenarios.

> >

> > Patch 1 implements rte_memset in the file rte_memset.h on IA platform

> > The file supports three types of instruction sets including sse & avx

> > (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of

> > vectorization and inline function to improve the perf on IA. In

> > addition, cache line and memory alignment are fully taken into

> consideration.

> >

> > Patch 2 implements functional autotest to validates the function

> > whether to work in a right way.

> >

> > Patch 3 implements performance autotest separately in cache and memory.

> >

> > Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring

> > 3%~4% performance improvements on IA platform from virtio/vhost

> > non-mergeable loopback testing.

> >

> > Zhiyong Yang (4):

> >   eal/common: introduce rte_memset on IA platform

> >   app/test: add functional autotest for rte_memset

> >   app/test: add performance autotest for rte_memset

> >   lib/librte_vhost: improve vhost perf using rte_memset

> >

> >  app/test/Makefile                                  |   3 +

> >  app/test/test_memset.c                             | 158 +++++++++

> >  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++

> >  doc/guides/rel_notes/release_17_02.rst             |  11 +

> >  .../common/include/arch/x86/rte_memset.h           | 376

> +++++++++++++++++++++

> >  lib/librte_eal/common/include/generic/rte_memset.h |  51 +++

> >  lib/librte_vhost/virtio_net.c                      |  18 +-

> >  7 files changed, 958 insertions(+), 7 deletions(-)  create mode

> > 100644 app/test/test_memset.c  create mode 100644

> > app/test/test_memset_perf.c  create mode 100644

> > lib/librte_eal/common/include/arch/x86/rte_memset.h

> >  create mode 100644

> lib/librte_eal/common/include/generic/rte_memset.h

> >

> 

> Thanks for the series, idea looks good to me.

> 

> Wouldn't be worth to also use rte_memset in Virtio PMD (not

> compiled/tested)? :

> 


I think  rte_memset  maybe can bring some benefit here,  but , I'm not clear how to
enter the branch and test it. :) 

thanks
Zhiyong

> diff --git a/drivers/net/virtio/virtio_rxtx.c

> b/drivers/net/virtio/virtio_rxtx.c

> index 22d97a4..a5f70c4 100644

> --- a/drivers/net/virtio/virtio_rxtx.c

> +++ b/drivers/net/virtio/virtio_rxtx.c

> @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq,

> struct rte_mbuf *cookie,

>                          rte_pktmbuf_prepend(cookie, head_size);

>                  /* if offload disabled, it is not zeroed below, do it now */

>                  if (offload == 0)

> -                       memset(hdr, 0, head_size);

> +                       rte_memset(hdr, 0, head_size);

>          } else if (use_indirect) {

>                  /* setup tx ring slot to point to indirect

>                   * descriptor list stored in reserved region.

> 

> Cheers,

> Maxime
  
Maxime Coquelin Dec. 6, 2016, 8:29 a.m. UTC | #2
On 12/06/2016 07:33 AM, Yang, Zhiyong wrote:
> Hi, Maxime:
>
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Friday, December 2, 2016 6:01 PM
>> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
>> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce
>> <bruce.richardson@intel.com>; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset
>> and related test
>>
>> Hi Zhiyong,
>>
>> On 12/05/2016 09:26 AM, Zhiyong Yang wrote:
>>> DPDK code has met performance drop badly in some case when calling
>>> glibc function memset. Reference to discussions about memset in
>>> http://dpdk.org/ml/archives/dev/2016-October/048628.html
>>> It is necessary to introduce more high efficient function to fix it.
>>> One important thing about rte_memset is that we can get clear control
>>> on what instruction flow is used.
>>>
>>> This patchset introduces rte_memset to bring more high efficient
>>> implementation, and will bring obvious perf improvement, especially
>>> for small N bytes in the most application scenarios.
>>>
>>> Patch 1 implements rte_memset in the file rte_memset.h on IA platform
>>> The file supports three types of instruction sets including sse & avx
>>> (128bits), avx2(256bits) and avx512(512bits). rte_memset makes use of
>>> vectorization and inline function to improve the perf on IA. In
>>> addition, cache line and memory alignment are fully taken into
>> consideration.
>>>
>>> Patch 2 implements functional autotest to validates the function
>>> whether to work in a right way.
>>>
>>> Patch 3 implements performance autotest separately in cache and memory.
>>>
>>> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring
>>> 3%~4% performance improvements on IA platform from virtio/vhost
>>> non-mergeable loopback testing.
>>>
>>> Zhiyong Yang (4):
>>>   eal/common: introduce rte_memset on IA platform
>>>   app/test: add functional autotest for rte_memset
>>>   app/test: add performance autotest for rte_memset
>>>   lib/librte_vhost: improve vhost perf using rte_memset
>>>
>>>  app/test/Makefile                                  |   3 +
>>>  app/test/test_memset.c                             | 158 +++++++++
>>>  app/test/test_memset_perf.c                        | 348 +++++++++++++++++++
>>>  doc/guides/rel_notes/release_17_02.rst             |  11 +
>>>  .../common/include/arch/x86/rte_memset.h           | 376
>> +++++++++++++++++++++
>>>  lib/librte_eal/common/include/generic/rte_memset.h |  51 +++
>>>  lib/librte_vhost/virtio_net.c                      |  18 +-
>>>  7 files changed, 958 insertions(+), 7 deletions(-)  create mode
>>> 100644 app/test/test_memset.c  create mode 100644
>>> app/test/test_memset_perf.c  create mode 100644
>>> lib/librte_eal/common/include/arch/x86/rte_memset.h
>>>  create mode 100644
>> lib/librte_eal/common/include/generic/rte_memset.h
>>>
>>
>> Thanks for the series, idea looks good to me.
>>
>> Wouldn't be worth to also use rte_memset in Virtio PMD (not
>> compiled/tested)? :
>>
>
> I think  rte_memset  maybe can bring some benefit here,  but , I'm not clear how to
> enter the branch and test it. :)

Indeed, you will need Pierre's patch:
[dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set

Thanks,
Maxime
>
> thanks
> Zhiyong
>
>> diff --git a/drivers/net/virtio/virtio_rxtx.c
>> b/drivers/net/virtio/virtio_rxtx.c
>> index 22d97a4..a5f70c4 100644
>> --- a/drivers/net/virtio/virtio_rxtx.c
>> +++ b/drivers/net/virtio/virtio_rxtx.c
>> @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq,
>> struct rte_mbuf *cookie,
>>                          rte_pktmbuf_prepend(cookie, head_size);
>>                  /* if offload disabled, it is not zeroed below, do it now */
>>                  if (offload == 0)
>> -                       memset(hdr, 0, head_size);
>> +                       rte_memset(hdr, 0, head_size);
>>          } else if (use_indirect) {
>>                  /* setup tx ring slot to point to indirect
>>                   * descriptor list stored in reserved region.
>>
>> Cheers,
>> Maxime
  
Yang, Zhiyong Dec. 7, 2016, 9:28 a.m. UTC | #3
Hi, Maxime:

> -----Original Message-----

> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]

> Sent: Tuesday, December 6, 2016 4:30 PM

> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org

> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce

> <bruce.richardson@intel.com>; Ananyev, Konstantin

> <konstantin.ananyev@intel.com>; Pierre Pfister (ppfister)

> <ppfister@cisco.com>

> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset

> and related test

> 

> 

> 

> On 12/06/2016 07:33 AM, Yang, Zhiyong wrote:

> > Hi, Maxime:

> >

> >> -----Original Message-----

> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]

> >> Sent: Friday, December 2, 2016 6:01 PM

> >> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org

> >> Cc: yuanhan.liu@linux.intel.com; Richardson, Bruce

> >> <bruce.richardson@intel.com>; Ananyev, Konstantin

> >> <konstantin.ananyev@intel.com>

> >> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset

> >> and related test

> >>

> >> Hi Zhiyong,

> >>

> >> On 12/05/2016 09:26 AM, Zhiyong Yang wrote:

> >>> DPDK code has met performance drop badly in some case when calling

> >>> glibc function memset. Reference to discussions about memset in

> >>> http://dpdk.org/ml/archives/dev/2016-October/048628.html

> >>> It is necessary to introduce more high efficient function to fix it.

> >>> One important thing about rte_memset is that we can get clear

> >>> control on what instruction flow is used.

> >>>

> >>> This patchset introduces rte_memset to bring more high efficient

> >>> implementation, and will bring obvious perf improvement, especially

> >>> for small N bytes in the most application scenarios.

> >>>

> >>> Patch 1 implements rte_memset in the file rte_memset.h on IA

> >>> platform The file supports three types of instruction sets including

> >>> sse & avx (128bits), avx2(256bits) and avx512(512bits). rte_memset

> >>> makes use of vectorization and inline function to improve the perf

> >>> on IA. In addition, cache line and memory alignment are fully taken

> >>> into

> >> consideration.

> >>>

> >>> Patch 2 implements functional autotest to validates the function

> >>> whether to work in a right way.

> >>>

> >>> Patch 3 implements performance autotest separately in cache and

> memory.

> >>>

> >>> Patch 4 Using rte_memset instead of copy_virtio_net_hdr can bring

> >>> 3%~4% performance improvements on IA platform from virtio/vhost

> >>> non-mergeable loopback testing.

> >>>

> >>> Zhiyong Yang (4):

> >>>   eal/common: introduce rte_memset on IA platform

> >>>   app/test: add functional autotest for rte_memset

> >>>   app/test: add performance autotest for rte_memset

> >>>   lib/librte_vhost: improve vhost perf using rte_memset

> >>>

> >>>  app/test/Makefile                                  |   3 +

> >>>  app/test/test_memset.c                             | 158 +++++++++

> >>>  app/test/test_memset_perf.c                        | 348

> +++++++++++++++++++

> >>>  doc/guides/rel_notes/release_17_02.rst             |  11 +

> >>>  .../common/include/arch/x86/rte_memset.h           | 376

> >> +++++++++++++++++++++

> >>>  lib/librte_eal/common/include/generic/rte_memset.h |  51 +++

> >>>  lib/librte_vhost/virtio_net.c                      |  18 +-

> >>>  7 files changed, 958 insertions(+), 7 deletions(-)  create mode

> >>> 100644 app/test/test_memset.c  create mode 100644

> >>> app/test/test_memset_perf.c  create mode 100644

> >>> lib/librte_eal/common/include/arch/x86/rte_memset.h

> >>>  create mode 100644

> >> lib/librte_eal/common/include/generic/rte_memset.h

> >>>

> >>

> >> Thanks for the series, idea looks good to me.

> >>

> >> Wouldn't be worth to also use rte_memset in Virtio PMD (not

> >> compiled/tested)? :

> >>

> >

> > I think  rte_memset  maybe can bring some benefit here,  but , I'm not

> > clear how to enter the branch and test it. :)

> 

> Indeed, you will need Pierre's patch:

> [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set

> 

> Thanks,

> Maxime

> >

Thank you Maxime.
I can see a little, but not obviously  performance improvement here.  
You know, memset(hdr, 0, head_size); only consumes  fewer cycles for virtio pmd. 
head_size only  10 or 12 bytes.
I optimize rte_memset perf further for N=8~15 bytes.
The main purpose of Introducing rte_memset is that we can use it
to avoid perf drop issue instead of glibc memset on some platform, I think. 

> >

> >> diff --git a/drivers/net/virtio/virtio_rxtx.c

> >> b/drivers/net/virtio/virtio_rxtx.c

> >> index 22d97a4..a5f70c4 100644

> >> --- a/drivers/net/virtio/virtio_rxtx.c

> >> +++ b/drivers/net/virtio/virtio_rxtx.c

> >> @@ -287,7 +287,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq,

> >> struct rte_mbuf *cookie,

> >>                          rte_pktmbuf_prepend(cookie, head_size);

> >>                  /* if offload disabled, it is not zeroed below, do it now */

> >>                  if (offload == 0)

> >> -                       memset(hdr, 0, head_size);

> >> +                       rte_memset(hdr, 0, head_size);

> >>          } else if (use_indirect) {

> >>                  /* setup tx ring slot to point to indirect

> >>                   * descriptor list stored in reserved region.

> >>

> >> Cheers,

> >> Maxime
  
Yuanhan Liu Dec. 7, 2016, 9:37 a.m. UTC | #4
On Wed, Dec 07, 2016 at 09:28:17AM +0000, Yang, Zhiyong wrote:
> > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not
> > >> compiled/tested)? :
> > >>
> > >
> > > I think  rte_memset  maybe can bring some benefit here,  but , I'm not
> > > clear how to enter the branch and test it. :)
> > 
> > Indeed, you will need Pierre's patch:
> > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set

I will apply it shortly.

> > Thanks,
> > Maxime
> > >
> Thank you Maxime.
> I can see a little, but not obviously  performance improvement here.  

Are you you have run into that code piece? FYI, you have to enable
virtio 1.0 explicitly, which is disabled by deafault.

> You know, memset(hdr, 0, head_size); only consumes  fewer cycles for virtio pmd. 
> head_size only  10 or 12 bytes.
> I optimize rte_memset perf further for N=8~15 bytes.
> The main purpose of Introducing rte_memset is that we can use it
> to avoid perf drop issue instead of glibc memset on some platform, I think. 

For this case (as well as the 4th patch), it's more about making sure
rte_memset is inlined.

	--yliu
  
Yang, Zhiyong Dec. 7, 2016, 9:43 a.m. UTC | #5
Hi, yuanhan:

> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Wednesday, December 7, 2016 5:38 PM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; dev@dpdk.org;
> Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Pierre Pfister (ppfister)
> <ppfister@cisco.com>
> Subject: Re: [dpdk-dev] [PATCH 0/4] eal/common: introduce rte_memset
> and related test
> 
> On Wed, Dec 07, 2016 at 09:28:17AM +0000, Yang, Zhiyong wrote:
> > > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not
> > > >> compiled/tested)? :
> > > >>
> > > >
> > > > I think  rte_memset  maybe can bring some benefit here,  but , I'm
> > > > not clear how to enter the branch and test it. :)
> > >
> > > Indeed, you will need Pierre's patch:
> > > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set
> 
> I will apply it shortly.
> 
> > > Thanks,
> > > Maxime
> > > >
> > Thank you Maxime.
> > I can see a little, but not obviously  performance improvement here.
> 
> Are you you have run into that code piece? FYI, you have to enable virtio 1.0
> explicitly, which is disabled by deafault.

Yes. I use the patch from Pierre and set offload  = 0 ; 
Thanks
Zhiyong

> 
> > You know, memset(hdr, 0, head_size); only consumes  fewer cycles for
> virtio pmd.
> > head_size only  10 or 12 bytes.
> > I optimize rte_memset perf further for N=8~15 bytes.
> > The main purpose of Introducing rte_memset is that we can use it to
> > avoid perf drop issue instead of glibc memset on some platform, I think.
> 
> For this case (as well as the 4th patch), it's more about making sure
> rte_memset is inlined.
> 
> 	--yliu
  
Yuanhan Liu Dec. 7, 2016, 9:48 a.m. UTC | #6
On Wed, Dec 07, 2016 at 09:43:06AM +0000, Yang, Zhiyong wrote:
> > On Wed, Dec 07, 2016 at 09:28:17AM +0000, Yang, Zhiyong wrote:
> > > > >> Wouldn't be worth to also use rte_memset in Virtio PMD (not
> > > > >> compiled/tested)? :
> > > > >>
> > > > >
> > > > > I think  rte_memset  maybe can bring some benefit here,  but , I'm
> > > > > not clear how to enter the branch and test it. :)
> > > >
> > > > Indeed, you will need Pierre's patch:
> > > > [dpdk-dev] [PATCH] virtio: tx with can_push when VERSION_1 is set
> > 
> > I will apply it shortly.
> > 
> > > > Thanks,
> > > > Maxime
> > > > >
> > > Thank you Maxime.
> > > I can see a little, but not obviously  performance improvement here.
> > 
> > Are you you have run into that code piece? FYI, you have to enable virtio 1.0
> > explicitly, which is disabled by deafault.
> 
> Yes. I use the patch from Pierre and set offload  = 0 ; 

I meant virtio 1.0. Have you added following options for the QEMU virtio-net
device?

    disable-modern=false

	--yliu
  

Patch

diff --git a/drivers/net/virtio/virtio_rxtx.c 
b/drivers/net/virtio/virtio_rxtx.c
index 22d97a4..a5f70c4 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -287,7 +287,7 @@  virtqueue_enqueue_xmit(struct virtnet_tx *txvq, 
struct rte_mbuf *cookie,
                         rte_pktmbuf_prepend(cookie, head_size);
                 /* if offload disabled, it is not zeroed below, do it 
now */
                 if (offload == 0)
-                       memset(hdr, 0, head_size);
+                       rte_memset(hdr, 0, head_size);
         } else if (use_indirect) {