[v5] eal: add cache-line demote support

Message ID 1601512112-12577-2-git-send-email-omkar.maslekar@intel.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series [v5] eal: add cache-line demote support |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail apply issues
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-testing success Testing PASS

Commit Message

Omkar Maslekar Oct. 1, 2020, 12:28 a.m. UTC
  rte_cldemote is similar to a prefetch hint - in reverse. cldemote(addr)
enables software to hint to hardware that line is likely to be shared.
Useful in core-to-core communications where cache-line is likely to be
shared. ARM and PPC implementation is provided with NOP and can be added
if any equivalent instructions could be used for implementation on those
architectures.

Signed-off-by: Omkar Maslekar <omkar.maslekar@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>

---
v5: documentation updated
    fixed formatting issue in release notes
    added Acked-by: Bruce Richardson <bruce.richardson@intel.com>
*
v4: updated bold text for title and fixed margin in release notes
*
v3: fixed warning regarding whitespace
*
v2: documentation updated
---
---
 doc/guides/rel_notes/release_20_11.rst        |  7 +++++++
 lib/librte_eal/arm/include/rte_prefetch_32.h  |  5 +++++
 lib/librte_eal/arm/include/rte_prefetch_64.h  |  5 +++++
 lib/librte_eal/include/generic/rte_prefetch.h | 14 ++++++++++++++
 lib/librte_eal/ppc/include/rte_prefetch.h     |  5 +++++
 lib/librte_eal/x86/include/rte_prefetch.h     |  9 +++++++++
 6 files changed, 45 insertions(+)
  

Comments

David Marchand Oct. 8, 2020, 7:09 a.m. UTC | #1
On Thu, Oct 1, 2020 at 2:30 AM Omkar Maslekar <omkar.maslekar@intel.com> wrote:
>
> rte_cldemote is similar to a prefetch hint - in reverse. cldemote(addr)
> enables software to hint to hardware that line is likely to be shared.
> Useful in core-to-core communications where cache-line is likely to be
> shared. ARM and PPC implementation is provided with NOP and can be added
> if any equivalent instructions could be used for implementation on those
> architectures.
>
> Signed-off-by: Omkar Maslekar <omkar.maslekar@intel.com>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>

I find this "rte_cldemote" name too close to the Intel instruction,
but I can see no complaint from other arch maintainers, so I guess
everyone is happy with it.
In any case, this is a new API, so it should be marked experimental.

As for unit tests, not sure there is much to do, maybe rename
test_prefetch.c and call this new API too, wdyt?
  
Bruce Richardson Oct. 8, 2020, 9:02 a.m. UTC | #2
On Thu, Oct 08, 2020 at 09:09:52AM +0200, David Marchand wrote:
> On Thu, Oct 1, 2020 at 2:30 AM Omkar Maslekar <omkar.maslekar@intel.com> wrote:
> >
> > rte_cldemote is similar to a prefetch hint - in reverse. cldemote(addr)
> > enables software to hint to hardware that line is likely to be shared.
> > Useful in core-to-core communications where cache-line is likely to be
> > shared. ARM and PPC implementation is provided with NOP and can be added
> > if any equivalent instructions could be used for implementation on those
> > architectures.
> >
> > Signed-off-by: Omkar Maslekar <omkar.maslekar@intel.com>
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> I find this "rte_cldemote" name too close to the Intel instruction,
> but I can see no complaint from other arch maintainers, so I guess
> everyone is happy with it.

It is very close, alright - though the name too does fairly well convey the
likely actual done by the instruction.. Is there a suggestion for a better,
more generic name.

> In any case, this is a new API, so it should be marked experimental.
> 
Agreed.

> As for unit tests, not sure there is much to do, maybe rename
> test_prefetch.c and call this new API too, wdyt?
> 
I'm not sure how much value this would provide, but it can be done.
  
Jerin Jacob Oct. 8, 2020, 1:12 p.m. UTC | #3
On Thu, Oct 1, 2020 at 6:00 AM Omkar Maslekar <omkar.maslekar@intel.com> wrote:
>
> rte_cldemote is similar to a prefetch hint - in reverse. cldemote(addr)
> enables software to hint to hardware that line is likely to be shared.
> Useful in core-to-core communications where cache-line is likely to be
> shared. ARM and PPC implementation is provided with NOP and can be added
> if any equivalent instructions could be used for implementation on those
> architectures.
>
> Signed-off-by: Omkar Maslekar <omkar.maslekar@intel.com>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
>
> ---
> v5: documentation updated
>     fixed formatting issue in release notes
>     added Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> *
> v4: updated bold text for title and fixed margin in release notes
> *
> v3: fixed warning regarding whitespace
> *
> v2: documentation updated
> ---
> ---
>  doc/guides/rel_notes/release_20_11.rst        |  7 +++++++
>  lib/librte_eal/arm/include/rte_prefetch_32.h  |  5 +++++
>  lib/librte_eal/arm/include/rte_prefetch_64.h  |  5 +++++
>  lib/librte_eal/include/generic/rte_prefetch.h | 14 ++++++++++++++
>  lib/librte_eal/ppc/include/rte_prefetch.h     |  5 +++++
>  lib/librte_eal/x86/include/rte_prefetch.h     |  9 +++++++++
>  6 files changed, 45 insertions(+)
>
> diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
> index df227a1..dc402ab 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -55,6 +55,13 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
>
> +* **Added new function rte_cldemote in rte_prefetch.h.**
> +
> +  Added a hardware hint CLDEMOTE, which is similar to prefetch in reverse.
> +  CLDEMOTE moves the cache line to the more remote cache, where it expects
> +  sharing to be efficient. Moving the cache line to a level more distant from
> +  the processor helps to accelerate core-to-core communication.
> +
>
>  Removed Items
>  -------------
> diff --git a/lib/librte_eal/arm/include/rte_prefetch_32.h b/lib/librte_eal/arm/include/rte_prefetch_32.h
> index e53420a..ad91edd 100644
> --- a/lib/librte_eal/arm/include/rte_prefetch_32.h
> +++ b/lib/librte_eal/arm/include/rte_prefetch_32.h
> @@ -33,6 +33,11 @@ static inline void rte_prefetch_non_temporal(const volatile void *p)
>         rte_prefetch0(p);
>  }
>
> +static inline void rte_cldemote(const volatile void *p)
> +{
> +       RTE_SET_USED(p);
> +}
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/arm/include/rte_prefetch_64.h b/lib/librte_eal/arm/include/rte_prefetch_64.h
> index fc2b391..35d278a 100644
> --- a/lib/librte_eal/arm/include/rte_prefetch_64.h
> +++ b/lib/librte_eal/arm/include/rte_prefetch_64.h
> @@ -32,6 +32,11 @@ static inline void rte_prefetch_non_temporal(const volatile void *p)
>         asm volatile ("PRFM PLDL1STRM, [%0]" : : "r" (p));
>  }
>
> +static inline void rte_cldemote(const volatile void *p)
> +{
> +       RTE_SET_USED(p);
> +}

ARM64 does not have this support so NOP is fine for this.

Acked-by: Jerin Jacob <jerinj@marvell.com>





> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/include/generic/rte_prefetch.h b/lib/librte_eal/include/generic/rte_prefetch.h
> index 6e47bdf..5500cd5 100644
> --- a/lib/librte_eal/include/generic/rte_prefetch.h
> +++ b/lib/librte_eal/include/generic/rte_prefetch.h
> @@ -51,4 +51,18 @@
>   */
>  static inline void rte_prefetch_non_temporal(const volatile void *p);
>
> +/**
> + * Demote a cache line to a more distant level of cache from the processor.
> + *
> + * CLDEMOTE hints to hardware to move (demote) a cache line from the closest to
> + * the processor to a level more distant from the processor. It is a hint and
> + * not guarantee. rte_cldemote is intended to move the cache line to the more
> + * remote cache, where it expects sharing to be efficient and to indicate that a
> + * line may be accessed by a different core in the future.
> + *
> + * @param p
> + *   Address to demote
> + */
> +static inline void rte_cldemote(const volatile void *p);
> +
>  #endif /* _RTE_PREFETCH_H_ */
> diff --git a/lib/librte_eal/ppc/include/rte_prefetch.h b/lib/librte_eal/ppc/include/rte_prefetch.h
> index 9ba07c8..3fe9655 100644
> --- a/lib/librte_eal/ppc/include/rte_prefetch.h
> +++ b/lib/librte_eal/ppc/include/rte_prefetch.h
> @@ -34,6 +34,11 @@ static inline void rte_prefetch_non_temporal(const volatile void *p)
>         rte_prefetch0(p);
>  }
>
> +static inline void rte_cldemote(const volatile void *p)
> +{
> +       RTE_SET_USED(p);
> +}
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/librte_eal/x86/include/rte_prefetch.h b/lib/librte_eal/x86/include/rte_prefetch.h
> index 384c6b3..029d06e 100644
> --- a/lib/librte_eal/x86/include/rte_prefetch.h
> +++ b/lib/librte_eal/x86/include/rte_prefetch.h
> @@ -32,6 +32,15 @@ static inline void rte_prefetch_non_temporal(const volatile void *p)
>         asm volatile ("prefetchnta %[p]" : : [p] "m" (*(const volatile char *)p));
>  }
>
> +/*
> + * we're using raw byte codes for now as only the newest compiler
> + * versions support this instruction natively.
> + */
> +static inline void rte_cldemote(const volatile void *p)
> +{
> +       asm volatile(".byte 0x0f, 0x1c, 0x06" :: "S" (p));
> +}
> +
>  #ifdef __cplusplus
>  }
>  #endif
> --
> 1.8.3.1
>
  
David Marchand Oct. 12, 2020, 9:41 a.m. UTC | #4
On Thu, Oct 8, 2020 at 11:02 AM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Thu, Oct 08, 2020 at 09:09:52AM +0200, David Marchand wrote:
> > On Thu, Oct 1, 2020 at 2:30 AM Omkar Maslekar <omkar.maslekar@intel.com> wrote:
> > >
> > > rte_cldemote is similar to a prefetch hint - in reverse. cldemote(addr)
> > > enables software to hint to hardware that line is likely to be shared.
> > > Useful in core-to-core communications where cache-line is likely to be
> > > shared. ARM and PPC implementation is provided with NOP and can be added
> > > if any equivalent instructions could be used for implementation on those
> > > architectures.
> > >
> > > Signed-off-by: Omkar Maslekar <omkar.maslekar@intel.com>
> > > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> >
> > I find this "rte_cldemote" name too close to the Intel instruction,
> > but I can see no complaint from other arch maintainers, so I guess
> > everyone is happy with it.
>
> It is very close, alright - though the name too does fairly well convey the
> likely actual done by the instruction.. Is there a suggestion for a better,
> more generic name.

I don't have a better suggestion.


The prefetch API has some hints on the level of cache to put data in.
For this new API, we have no indication, would it make sense?


Is this available on all Intel CPUs supported with DPDK?
No cpuflag check needed?


>
> > In any case, this is a new API, so it should be marked experimental.
> >
> Agreed.
>
> > As for unit tests, not sure there is much to do, maybe rename
> > test_prefetch.c and call this new API too, wdyt?
> >
> I'm not sure how much value this would provide, but it can be done.

As much as the existing test, checking we can call this API.
If you think it is not worth it, we can drop the prefetch ut code.
  

Patch

diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index df227a1..dc402ab 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -55,6 +55,13 @@  New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added new function rte_cldemote in rte_prefetch.h.**
+
+  Added a hardware hint CLDEMOTE, which is similar to prefetch in reverse.
+  CLDEMOTE moves the cache line to the more remote cache, where it expects
+  sharing to be efficient. Moving the cache line to a level more distant from
+  the processor helps to accelerate core-to-core communication.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_eal/arm/include/rte_prefetch_32.h b/lib/librte_eal/arm/include/rte_prefetch_32.h
index e53420a..ad91edd 100644
--- a/lib/librte_eal/arm/include/rte_prefetch_32.h
+++ b/lib/librte_eal/arm/include/rte_prefetch_32.h
@@ -33,6 +33,11 @@  static inline void rte_prefetch_non_temporal(const volatile void *p)
 	rte_prefetch0(p);
 }
 
+static inline void rte_cldemote(const volatile void *p)
+{
+	RTE_SET_USED(p);
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/arm/include/rte_prefetch_64.h b/lib/librte_eal/arm/include/rte_prefetch_64.h
index fc2b391..35d278a 100644
--- a/lib/librte_eal/arm/include/rte_prefetch_64.h
+++ b/lib/librte_eal/arm/include/rte_prefetch_64.h
@@ -32,6 +32,11 @@  static inline void rte_prefetch_non_temporal(const volatile void *p)
 	asm volatile ("PRFM PLDL1STRM, [%0]" : : "r" (p));
 }
 
+static inline void rte_cldemote(const volatile void *p)
+{
+	RTE_SET_USED(p);
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/include/generic/rte_prefetch.h b/lib/librte_eal/include/generic/rte_prefetch.h
index 6e47bdf..5500cd5 100644
--- a/lib/librte_eal/include/generic/rte_prefetch.h
+++ b/lib/librte_eal/include/generic/rte_prefetch.h
@@ -51,4 +51,18 @@ 
  */
 static inline void rte_prefetch_non_temporal(const volatile void *p);
 
+/**
+ * Demote a cache line to a more distant level of cache from the processor.
+ *
+ * CLDEMOTE hints to hardware to move (demote) a cache line from the closest to
+ * the processor to a level more distant from the processor. It is a hint and
+ * not guarantee. rte_cldemote is intended to move the cache line to the more
+ * remote cache, where it expects sharing to be efficient and to indicate that a
+ * line may be accessed by a different core in the future.
+ *
+ * @param p
+ *   Address to demote
+ */
+static inline void rte_cldemote(const volatile void *p);
+
 #endif /* _RTE_PREFETCH_H_ */
diff --git a/lib/librte_eal/ppc/include/rte_prefetch.h b/lib/librte_eal/ppc/include/rte_prefetch.h
index 9ba07c8..3fe9655 100644
--- a/lib/librte_eal/ppc/include/rte_prefetch.h
+++ b/lib/librte_eal/ppc/include/rte_prefetch.h
@@ -34,6 +34,11 @@  static inline void rte_prefetch_non_temporal(const volatile void *p)
 	rte_prefetch0(p);
 }
 
+static inline void rte_cldemote(const volatile void *p)
+{
+	RTE_SET_USED(p);
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/x86/include/rte_prefetch.h b/lib/librte_eal/x86/include/rte_prefetch.h
index 384c6b3..029d06e 100644
--- a/lib/librte_eal/x86/include/rte_prefetch.h
+++ b/lib/librte_eal/x86/include/rte_prefetch.h
@@ -32,6 +32,15 @@  static inline void rte_prefetch_non_temporal(const volatile void *p)
 	asm volatile ("prefetchnta %[p]" : : [p] "m" (*(const volatile char *)p));
 }
 
+/*
+ * we're using raw byte codes for now as only the newest compiler
+ * versions support this instruction natively.
+ */
+static inline void rte_cldemote(const volatile void *p)
+{
+	asm volatile(".byte 0x0f, 0x1c, 0x06" :: "S" (p));
+}
+
 #ifdef __cplusplus
 }
 #endif