[dpdk-dev,2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb()

Message ID 1512126771-27503-2-git-send-email-konstantin.ananyev@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Ananyev, Konstantin Dec. 1, 2017, 11:12 a.m. UTC
  On x86 it  is possible to use lock-prefixed instructions to get
the similar effect as mfence.
As pointed by Java guys, on most modern HW that gives a better
performance than using mfence:
https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
That patch adopts that technique for rte_smp_mb() implementation.
On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction,
i.e. from ~110 to ~55 cycles per operation.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 .../common/include/arch/x86/rte_atomic.h           | 45 +++++++++++++++++++++-
 1 file changed, 43 insertions(+), 2 deletions(-)
  

Comments

Stephen Hemminger Dec. 1, 2017, 6:04 p.m. UTC | #1
On Fri,  1 Dec 2017 11:12:51 +0000
Konstantin Ananyev <konstantin.ananyev@intel.com> wrote:

> On x86 it  is possible to use lock-prefixed instructions to get
> the similar effect as mfence.
> As pointed by Java guys, on most modern HW that gives a better
> performance than using mfence:
> https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> That patch adopts that technique for rte_smp_mb() implementation.
> On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction,
> i.e. from ~110 to ~55 cycles per operation.
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  .../common/include/arch/x86/rte_atomic.h           | 45 +++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> index 4eac66631..07b7fa7f7 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> @@ -55,12 +55,53 @@ extern "C" {
>  
>  #define	rte_rmb() _mm_lfence()
>  
> -#define rte_smp_mb() rte_mb()
> -
>  #define rte_smp_wmb() rte_compiler_barrier()
>  
>  #define rte_smp_rmb() rte_compiler_barrier()
>  
> +/*
> + * From Intel Software Development Manual; Vol 3;
> + * 8.2.2 Memory Ordering in P6 and More Recent Processor Families:
> + * ...
> + * . Reads are not reordered with other reads.
> + * . Writes are not reordered with older reads.
> + * . Writes to memory are not reordered with other writes,
> + *   with the following exceptions:
> + *   . streaming stores (writes) executed with the non-temporal move
> + *     instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
> + *   . string operations (see Section 8.2.4.1).
> + *  ...
> + * . Reads may be reordered with older writes to different locations but not
> + * with older writes to the same location.
> + * . Reads or writes cannot be reordered with I/O instructions,
> + * locked instructions, or serializing instructions.
> + * . Reads cannot pass earlier LFENCE and MFENCE instructions.
> + * . Writes ... cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
> + * . LFENCE instructions cannot pass earlier reads.
> + * . SFENCE instructions cannot pass earlier writes ...
> + * . MFENCE instructions cannot pass earlier reads, writes ...
> + *
> + * As pointed by Java guys, that makes possible to use lock-prefixed
> + * instructions to get the same effect as mfence and on most modern HW
> + * that gives a better perfomarnce than using mfence:
> + * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> + * So below we use that technique for rte_smp_mb() implementation.
> + */
> +
> +#ifdef RTE_ARCH_I686
> +#define	RTE_SP	RTE_STR(esp)
> +#else
> +#define	RTE_SP	RTE_STR(rsp)
> +#endif
> +
> +#define RTE_MB_DUMMY_MEMP	"-128(%%" RTE_SP ")"
> +
> +static __rte_always_inline void
> +rte_smp_mb(void)
> +{
> +	asm volatile("lock addl $0," RTE_MB_DUMMY_MEMP "; " ::: "memory");
> +}
> +
>  #define rte_io_mb() rte_mb()
>  
>  #define rte_io_wmb() rte_compiler_barrier()

The lock instruction is a stronger barrier than the compiler barrier
and has worse performance impact. Are you sure it is necessary to use it in DPDK.
Linux kernel has successfully used simple compiler reodering barrier for years.

Don't confuse rte_smp_mb with the required barrier for talking to I/O devices.
  
Ananyev, Konstantin Dec. 1, 2017, 11:08 p.m. UTC | #2
Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, December 1, 2017 6:04 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb()
> 
> On Fri,  1 Dec 2017 11:12:51 +0000
> Konstantin Ananyev <konstantin.ananyev@intel.com> wrote:
> 
> > On x86 it  is possible to use lock-prefixed instructions to get
> > the similar effect as mfence.
> > As pointed by Java guys, on most modern HW that gives a better
> > performance than using mfence:
> > https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> > That patch adopts that technique for rte_smp_mb() implementation.
> > On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction,
> > i.e. from ~110 to ~55 cycles per operation.
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > ---
> >  .../common/include/arch/x86/rte_atomic.h           | 45 +++++++++++++++++++++-
> >  1 file changed, 43 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > index 4eac66631..07b7fa7f7 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > @@ -55,12 +55,53 @@ extern "C" {
> >
> >  #define	rte_rmb() _mm_lfence()
> >
> > -#define rte_smp_mb() rte_mb()
> > -
> >  #define rte_smp_wmb() rte_compiler_barrier()
> >
> >  #define rte_smp_rmb() rte_compiler_barrier()
> >
> > +/*
> > + * From Intel Software Development Manual; Vol 3;
> > + * 8.2.2 Memory Ordering in P6 and More Recent Processor Families:
> > + * ...
> > + * . Reads are not reordered with other reads.
> > + * . Writes are not reordered with older reads.
> > + * . Writes to memory are not reordered with other writes,
> > + *   with the following exceptions:
> > + *   . streaming stores (writes) executed with the non-temporal move
> > + *     instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
> > + *   . string operations (see Section 8.2.4.1).
> > + *  ...
> > + * . Reads may be reordered with older writes to different locations but not
> > + * with older writes to the same location.
> > + * . Reads or writes cannot be reordered with I/O instructions,
> > + * locked instructions, or serializing instructions.
> > + * . Reads cannot pass earlier LFENCE and MFENCE instructions.
> > + * . Writes ... cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
> > + * . LFENCE instructions cannot pass earlier reads.
> > + * . SFENCE instructions cannot pass earlier writes ...
> > + * . MFENCE instructions cannot pass earlier reads, writes ...
> > + *
> > + * As pointed by Java guys, that makes possible to use lock-prefixed
> > + * instructions to get the same effect as mfence and on most modern HW
> > + * that gives a better perfomarnce than using mfence:
> > + * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> > + * So below we use that technique for rte_smp_mb() implementation.
> > + */
> > +
> > +#ifdef RTE_ARCH_I686
> > +#define	RTE_SP	RTE_STR(esp)
> > +#else
> > +#define	RTE_SP	RTE_STR(rsp)
> > +#endif
> > +
> > +#define RTE_MB_DUMMY_MEMP	"-128(%%" RTE_SP ")"
> > +
> > +static __rte_always_inline void
> > +rte_smp_mb(void)
> > +{
> > +	asm volatile("lock addl $0," RTE_MB_DUMMY_MEMP "; " ::: "memory");
> > +}
> > +
> >  #define rte_io_mb() rte_mb()
> >
> >  #define rte_io_wmb() rte_compiler_barrier()
> 
> The lock instruction is a stronger barrier than the compiler barrier
> and has worse performance impact. Are you sure it is necessary to use it in DPDK.
> Linux kernel has successfully used simple compiler reodering barrier for years.

Where do you see compiler barrier?
Right now for x86 rte_smp_mb()==rte_mb()==mfence.
So I am replacing mfence with 'lock add'.
As comment above says - on most modern x86 systems it is faster,
while allow to preserve memory ordering.
Konstantin 

> 
> Don't confuse rte_smp_mb with the required barrier for talking to I/O devices.
  
Bruce Richardson Dec. 11, 2017, 5:11 p.m. UTC | #3
On Fri, Dec 01, 2017 at 11:12:51AM +0000, Konstantin Ananyev wrote:
> On x86 it  is possible to use lock-prefixed instructions to get
> the similar effect as mfence.
> As pointed by Java guys, on most modern HW that gives a better
> performance than using mfence:
> https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> That patch adopts that technique for rte_smp_mb() implementation.
> On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction,
> i.e. from ~110 to ~55 cycles per operation.
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  .../common/include/arch/x86/rte_atomic.h           | 45 +++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 2 deletions(-)
> 
<snip>
> + * As pointed by Java guys, that makes possible to use lock-prefixed
> + * instructions to get the same effect as mfence and on most modern HW
> + * that gives a better perfomarnce than using mfence:
> + * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> + * So below we use that technique for rte_smp_mb() implementation.
> + */
> +
> +#ifdef RTE_ARCH_I686
> +#define	RTE_SP	RTE_STR(esp)
> +#else
> +#define	RTE_SP	RTE_STR(rsp)
> +#endif
> +
> +#define RTE_MB_DUMMY_MEMP	"-128(%%" RTE_SP ")"
> +
> +static __rte_always_inline void
> +rte_smp_mb(void)
> +{
> +	asm volatile("lock addl $0," RTE_MB_DUMMY_MEMP "; " ::: "memory");
> +}

Rather than #defining RTE_SP and RTE_MB_DUMMY_MEMP, why not just put the
#ifdef into the rte_smp_mb itself and have two asm volatile lines with
hard-coded register names in them? It would be shorter and I think a lot
easier to read.

/Bruce
  
Ananyev, Konstantin Dec. 11, 2017, 5:30 p.m. UTC | #4
Hi Bruce,

> -----Original Message-----
> From: Richardson, Bruce
> Sent: Monday, December 11, 2017 5:11 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb()
> 
> On Fri, Dec 01, 2017 at 11:12:51AM +0000, Konstantin Ananyev wrote:
> > On x86 it  is possible to use lock-prefixed instructions to get
> > the similar effect as mfence.
> > As pointed by Java guys, on most modern HW that gives a better
> > performance than using mfence:
> > https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> > That patch adopts that technique for rte_smp_mb() implementation.
> > On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction,
> > i.e. from ~110 to ~55 cycles per operation.
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > ---
> >  .../common/include/arch/x86/rte_atomic.h           | 45 +++++++++++++++++++++-
> >  1 file changed, 43 insertions(+), 2 deletions(-)
> >
> <snip>
> > + * As pointed by Java guys, that makes possible to use lock-prefixed
> > + * instructions to get the same effect as mfence and on most modern HW
> > + * that gives a better perfomarnce than using mfence:
> > + * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> > + * So below we use that technique for rte_smp_mb() implementation.
> > + */
> > +
> > +#ifdef RTE_ARCH_I686
> > +#define	RTE_SP	RTE_STR(esp)
> > +#else
> > +#define	RTE_SP	RTE_STR(rsp)
> > +#endif
> > +
> > +#define RTE_MB_DUMMY_MEMP	"-128(%%" RTE_SP ")"
> > +
> > +static __rte_always_inline void
> > +rte_smp_mb(void)
> > +{
> > +	asm volatile("lock addl $0," RTE_MB_DUMMY_MEMP "; " ::: "memory");
> > +}
> 
> Rather than #defining RTE_SP and RTE_MB_DUMMY_MEMP, why not just put the
> #ifdef into the rte_smp_mb itself and have two asm volatile lines with
> hard-coded register names in them? It would be shorter and I think a lot
> easier to read.

Fine by me.
Any other thoughts from anyone till I submit v2?
Konstantin
  
Ananyev, Konstantin Dec. 18, 2017, 3:34 p.m. UTC | #5
v2 changes:
  Address Bruce code-review comments:
  - get rid of macros to simplify the code
  - add more comments to the test case

Konstantin Ananyev (2):
  test/test: introduce new test-case for rte_smp_mb()
  eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb()

 .../common/include/arch/x86/rte_atomic.h           |  44 ++-
 test/test/Makefile                                 |   1 +
 test/test/test_mb.c                                | 315 +++++++++++++++++++++
 3 files changed, 358 insertions(+), 2 deletions(-)
 create mode 100644 test/test/test_mb.c
  
Stephen Hemminger March 8, 2018, 9:15 p.m. UTC | #6
On Fri, 1 Dec 2017 23:08:39 +0000
"Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:

> Hi Stephen,
> 
> > -----Original Message-----
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Friday, December 1, 2017 6:04 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb()
> > 
> > On Fri,  1 Dec 2017 11:12:51 +0000
> > Konstantin Ananyev <konstantin.ananyev@intel.com> wrote:
> >   
> > > On x86 it  is possible to use lock-prefixed instructions to get
> > > the similar effect as mfence.
> > > As pointed by Java guys, on most modern HW that gives a better
> > > performance than using mfence:
> > > https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> > > That patch adopts that technique for rte_smp_mb() implementation.
> > > On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction,
> > > i.e. from ~110 to ~55 cycles per operation.
> > >
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > ---
> > >  .../common/include/arch/x86/rte_atomic.h           | 45 +++++++++++++++++++++-
> > >  1 file changed, 43 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > > index 4eac66631..07b7fa7f7 100644
> > > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> > > @@ -55,12 +55,53 @@ extern "C" {
> > >
> > >  #define	rte_rmb() _mm_lfence()
> > >
> > > -#define rte_smp_mb() rte_mb()
> > > -
> > >  #define rte_smp_wmb() rte_compiler_barrier()
> > >
> > >  #define rte_smp_rmb() rte_compiler_barrier()
> > >
> > > +/*
> > > + * From Intel Software Development Manual; Vol 3;
> > > + * 8.2.2 Memory Ordering in P6 and More Recent Processor Families:
> > > + * ...
> > > + * . Reads are not reordered with other reads.
> > > + * . Writes are not reordered with older reads.
> > > + * . Writes to memory are not reordered with other writes,
> > > + *   with the following exceptions:
> > > + *   . streaming stores (writes) executed with the non-temporal move
> > > + *     instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
> > > + *   . string operations (see Section 8.2.4.1).
> > > + *  ...
> > > + * . Reads may be reordered with older writes to different locations but not
> > > + * with older writes to the same location.
> > > + * . Reads or writes cannot be reordered with I/O instructions,
> > > + * locked instructions, or serializing instructions.
> > > + * . Reads cannot pass earlier LFENCE and MFENCE instructions.
> > > + * . Writes ... cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
> > > + * . LFENCE instructions cannot pass earlier reads.
> > > + * . SFENCE instructions cannot pass earlier writes ...
> > > + * . MFENCE instructions cannot pass earlier reads, writes ...
> > > + *
> > > + * As pointed by Java guys, that makes possible to use lock-prefixed
> > > + * instructions to get the same effect as mfence and on most modern HW
> > > + * that gives a better perfomarnce than using mfence:
> > > + * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> > > + * So below we use that technique for rte_smp_mb() implementation.
> > > + */
> > > +
> > > +#ifdef RTE_ARCH_I686
> > > +#define	RTE_SP	RTE_STR(esp)
> > > +#else
> > > +#define	RTE_SP	RTE_STR(rsp)
> > > +#endif
> > > +
> > > +#define RTE_MB_DUMMY_MEMP	"-128(%%" RTE_SP ")"
> > > +
> > > +static __rte_always_inline void
> > > +rte_smp_mb(void)
> > > +{
> > > +	asm volatile("lock addl $0," RTE_MB_DUMMY_MEMP "; " ::: "memory");
> > > +}
> > > +
> > >  #define rte_io_mb() rte_mb()
> > >
> > >  #define rte_io_wmb() rte_compiler_barrier()  
> > 
> > The lock instruction is a stronger barrier than the compiler barrier
> > and has worse performance impact. Are you sure it is necessary to use it in DPDK.
> > Linux kernel has successfully used simple compiler reodering barrier for years.  
> 
> Where do you see compiler barrier?
> Right now for x86 rte_smp_mb()==rte_mb()==mfence.
> So I am replacing mfence with 'lock add'.
> As comment above says - on most modern x86 systems it is faster,
> while allow to preserve memory ordering.

There are cases like virtio/vhost where we could be using compiler_barrier.
The mfence to lock add conversion makes sense.
  

Patch

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
index 4eac66631..07b7fa7f7 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
@@ -55,12 +55,53 @@  extern "C" {
 
 #define	rte_rmb() _mm_lfence()
 
-#define rte_smp_mb() rte_mb()
-
 #define rte_smp_wmb() rte_compiler_barrier()
 
 #define rte_smp_rmb() rte_compiler_barrier()
 
+/*
+ * From Intel Software Development Manual; Vol 3;
+ * 8.2.2 Memory Ordering in P6 and More Recent Processor Families:
+ * ...
+ * . Reads are not reordered with other reads.
+ * . Writes are not reordered with older reads.
+ * . Writes to memory are not reordered with other writes,
+ *   with the following exceptions:
+ *   . streaming stores (writes) executed with the non-temporal move
+ *     instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
+ *   . string operations (see Section 8.2.4.1).
+ *  ...
+ * . Reads may be reordered with older writes to different locations but not
+ * with older writes to the same location.
+ * . Reads or writes cannot be reordered with I/O instructions,
+ * locked instructions, or serializing instructions.
+ * . Reads cannot pass earlier LFENCE and MFENCE instructions.
+ * . Writes ... cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
+ * . LFENCE instructions cannot pass earlier reads.
+ * . SFENCE instructions cannot pass earlier writes ...
+ * . MFENCE instructions cannot pass earlier reads, writes ...
+ *
+ * As pointed by Java guys, that makes possible to use lock-prefixed
+ * instructions to get the same effect as mfence and on most modern HW
+ * that gives a better perfomarnce than using mfence:
+ * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
+ * So below we use that technique for rte_smp_mb() implementation.
+ */
+
+#ifdef RTE_ARCH_I686
+#define	RTE_SP	RTE_STR(esp)
+#else
+#define	RTE_SP	RTE_STR(rsp)
+#endif
+
+#define RTE_MB_DUMMY_MEMP	"-128(%%" RTE_SP ")"
+
+static __rte_always_inline void
+rte_smp_mb(void)
+{
+	asm volatile("lock addl $0," RTE_MB_DUMMY_MEMP "; " ::: "memory");
+}
+
 #define rte_io_mb() rte_mb()
 
 #define rte_io_wmb() rte_compiler_barrier()