[RFC,1/6] eal: add power management intrinsics

Message ID 2772eb151ccba5cc17186e6161d8834176924753.1590598121.git.anatoly.burakov@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series Power-optimized RX for Ethernet devices |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Burakov, Anatoly May 27, 2020, 5:02 p.m. UTC
  Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
 .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
 lib/librte_eal/x86/rte_cpuflags.c             |   2 +
 6 files changed, 203 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
  

Comments

Ananyev, Konstantin May 28, 2020, 11:39 a.m. UTC | #1
Hi Anatoly,

> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.

Recently ARM guys introduced new generic API
for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
Probably would make sense to unite both APIs into something common
and HW transparent. 
Konstantin

> 
> Signed-off-by: Liang J. Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../include/generic/rte_power_intrinsics.h    |  64 +++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  lib/librte_eal/x86/include/rte_cpuflags.h     |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 134 ++++++++++++++++++
>  lib/librte_eal/x86/rte_cpuflags.c             |   2 +
>  6 files changed, 203 insertions(+)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
> 
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..8646c4ac16
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index bc73ec2c5c..b54a2be4f6 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -59,6 +59,7 @@ generic_headers = files(
>  	'generic/rte_memcpy.h',
>  	'generic/rte_pause.h',
>  	'generic/rte_prefetch.h',
> +	'generic/rte_power_intrinsics.h',
>  	'generic/rte_rwlock.h',
>  	'generic/rte_spinlock.h',
>  	'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>  	'rte_io.h',
>  	'rte_memcpy.h',
>  	'rte_prefetch.h',
> +	'rte_power_intrinsics.h',
>  	'rte_pause.h',
>  	'rte_rtm.h',
>  	'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
> index c1d20364d1..94d6a43763 100644
> --- a/lib/librte_eal/x86/include/rte_cpuflags.h
> +++ b/lib/librte_eal/x86/include/rte_cpuflags.h
> @@ -110,6 +110,7 @@ enum rte_cpu_flag_t {
>  	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
>  	RTE_CPUFLAG_EM64T,                  /**< EM64T */
> 
> +	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
>  	/* (EAX 80000007h) EDX features */
>  	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
> 
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..a0522400fb
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	uint64_t rflags;
> +
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +	rte_mb();
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	uint64_t rflags;
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
> index 30439e7951..0325c4b93b 100644
> --- a/lib/librte_eal/x86/rte_cpuflags.c
> +++ b/lib/librte_eal/x86/rte_cpuflags.c
> @@ -110,6 +110,8 @@ const struct feature_entry rte_cpu_feature_table[] = {
>  	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
>  	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
> 
> +	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
> +
>  	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
>  	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)
> 
> --
> 2.17.1
  
Burakov, Anatoly May 28, 2020, 2:40 p.m. UTC | #2
On 28-May-20 12:39 PM, Ananyev, Konstantin wrote:
> Hi Anatoly,
> 
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
> 
> Recently ARM guys introduced new generic API
> for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> Probably would make sense to unite both APIs into something common
> and HW transparent.
> Konstantin

Hi Konstantin,

That's not really similar purpose. This is monitoring a cacheline for 
writes, not waiting on a specific value. The "expected" value is there 
as basically a hack to get around the race condition due to the fact 
that by the time you enter monitoring state, the write you're waiting 
for may have already happened.
  
Bruce Richardson May 28, 2020, 2:58 p.m. UTC | #3
On Thu, May 28, 2020 at 03:40:18PM +0100, Burakov, Anatoly wrote:
> On 28-May-20 12:39 PM, Ananyev, Konstantin wrote:
> > Hi Anatoly,
> > 
> > > 
> > > Add two new power management intrinsics, and provide an implementation
> > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > are implemented as raw byte opcodes because there is not yet widespread
> > > compiler support for these instructions.
> > > 
> > > The power management instructions provide an architecture-specific
> > > function to either wait until a specified TSC timestamp is reached, or
> > > optionally wait until either a TSC timestamp is reached or a memory
> > > location is written to. The monitor function also provides an optional
> > > comparison, to avoid sleeping when the expected write has already
> > > happened, and no more writes are expected.
> > 
> > Recently ARM guys introduced new generic API
> > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > Probably would make sense to unite both APIs into something common
> > and HW transparent.
> > Konstantin
> 
> Hi Konstantin,
> 
> That's not really similar purpose. This is monitoring a cacheline for
> writes, not waiting on a specific value. The "expected" value is there as
> basically a hack to get around the race condition due to the fact that by
> the time you enter monitoring state, the write you're waiting for may have
> already happened.
> 
Rather than the "expected" value, is it not more useful for a general API
to check for the existing value? Since we are awaiting writes, we may not
know what new value will be written, but we do know what the value is now,
and can just check that it's not been changed.

Regards,
/Bruce
  
Ananyev, Konstantin May 28, 2020, 3:38 p.m. UTC | #4
> > Hi Anatoly,
> >
> >>
> >> Add two new power management intrinsics, and provide an implementation
> >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >> are implemented as raw byte opcodes because there is not yet widespread
> >> compiler support for these instructions.
> >>
> >> The power management instructions provide an architecture-specific
> >> function to either wait until a specified TSC timestamp is reached, or
> >> optionally wait until either a TSC timestamp is reached or a memory
> >> location is written to. The monitor function also provides an optional
> >> comparison, to avoid sleeping when the expected write has already
> >> happened, and no more writes are expected.
> >
> > Recently ARM guys introduced new generic API
> > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > Probably would make sense to unite both APIs into something common
> > and HW transparent.
> > Konstantin
> 
> Hi Konstantin,
> 
> That's not really similar purpose. This is monitoring a cacheline for
> writes, not waiting on a specific value. 

I understand that.

> The "expected" value is there
> as basically a hack to get around the race condition due to the fact
> that by the time you enter monitoring state, the write you're waiting
> for may have already happened.

AFAIK, current rte_wait_until_equal_* does pretty much the same thing:

LDXR memaddr, $reg  // an address to monitor for
if ($reg != expected_value)
   SEVL      //     arm monitor
   do {
       WFE     //      waits for write to that memory address  
       LDXR memaddr, $reg
   } while ($reg != expected_value);   
 
Looks pretty similar to what rte_power_monitor() does,
except you don't have a loop for checking the new value.
Plus rte_power_monitor() provides extra options to the user - 
timestamp and power save mode to enter.
Also I don't know what is the granularity of such events on ARM,
is it a cache-line or more/less.
Might be ARM people can comment/correct me here. 
Konstantin
  
Jerin Jacob May 29, 2020, 6:56 a.m. UTC | #5
On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
> > > Hi Anatoly,
> > >
> > >>
> > >> Add two new power management intrinsics, and provide an implementation
> > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > >> are implemented as raw byte opcodes because there is not yet widespread
> > >> compiler support for these instructions.
> > >>
> > >> The power management instructions provide an architecture-specific
> > >> function to either wait until a specified TSC timestamp is reached, or
> > >> optionally wait until either a TSC timestamp is reached or a memory
> > >> location is written to. The monitor function also provides an optional
> > >> comparison, to avoid sleeping when the expected write has already
> > >> happened, and no more writes are expected.
> > >
> > > Recently ARM guys introduced new generic API
> > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > > Probably would make sense to unite both APIs into something common
> > > and HW transparent.
> > > Konstantin
> >
> > Hi Konstantin,
> >
> > That's not really similar purpose. This is monitoring a cacheline for
> > writes, not waiting on a specific value.
>
> I understand that.
>
> > The "expected" value is there
> > as basically a hack to get around the race condition due to the fact
> > that by the time you enter monitoring state, the write you're waiting
> > for may have already happened.
>
> AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
>
> LDXR memaddr, $reg  // an address to monitor for
> if ($reg != expected_value)
>    SEVL      //     arm monitor
>    do {
>        WFE     //      waits for write to that memory address
>        LDXR memaddr, $reg
>    } while ($reg != expected_value);
>
> Looks pretty similar to what rte_power_monitor() does,
> except you don't have a loop for checking the new value.
> Plus rte_power_monitor() provides extra options to the user -
> timestamp and power save mode to enter.
> Also I don't know what is the granularity of such events on ARM,
> is it a cache-line or more/less.

As I understand it, Granularity is per the cache-line.
ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power
state until the cache line is written.

But I see UMONITOR bit different, Where _without_ other core signaling
to wakeup from wait state,
it can wake on TSC expiry. I think, that's is the main primitive on
this feature. Right?

WFE can also wake based on Timer stream events(kind of TSC in x86
analogy) but it has a configuration
bit that needs to allow for this scheme in userspace(EL0) or not?
defined by EL1(Linux kernel).
I am planning to spend time on this after understanding the value
addition of the feature/usecase[1]
[1]
http://mails.dpdk.org/archives/dev/2020-May/168888.html





> Might be ARM people can comment/correct me here.
> Konstantin
  
Ananyev, Konstantin June 2, 2020, 10:15 a.m. UTC | #6
> 
> On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> > > > Hi Anatoly,
> > > >
> > > >>
> > > >> Add two new power management intrinsics, and provide an implementation
> > > >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > >> are implemented as raw byte opcodes because there is not yet widespread
> > > >> compiler support for these instructions.
> > > >>
> > > >> The power management instructions provide an architecture-specific
> > > >> function to either wait until a specified TSC timestamp is reached, or
> > > >> optionally wait until either a TSC timestamp is reached or a memory
> > > >> location is written to. The monitor function also provides an optional
> > > >> comparison, to avoid sleeping when the expected write has already
> > > >> happened, and no more writes are expected.
> > > >
> > > > Recently ARM guys introduced new generic API
> > > > for similar (as I understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > Probably would make sense to unite both APIs into something common
> > > > and HW transparent.
> > > > Konstantin
> > >
> > > Hi Konstantin,
> > >
> > > That's not really similar purpose. This is monitoring a cacheline for
> > > writes, not waiting on a specific value.
> >
> > I understand that.
> >
> > > The "expected" value is there
> > > as basically a hack to get around the race condition due to the fact
> > > that by the time you enter monitoring state, the write you're waiting
> > > for may have already happened.
> >
> > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> >
> > LDXR memaddr, $reg  // an address to monitor for
> > if ($reg != expected_value)
> >    SEVL      //     arm monitor
> >    do {
> >        WFE     //      waits for write to that memory address
> >        LDXR memaddr, $reg
> >    } while ($reg != expected_value);
> >
> > Looks pretty similar to what rte_power_monitor() does,
> > except you don't have a loop for checking the new value.
> > Plus rte_power_monitor() provides extra options to the user -
> > timestamp and power save mode to enter.
> > Also I don't know what is the granularity of such events on ARM,
> > is it a cache-line or more/less.
> 
> As I understand it, Granularity is per the cache-line.
> ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power
> state until the cache line is written.
> 
> But I see UMONITOR bit different, Where _without_ other core signaling
> to wakeup from wait state,
> it can wake on TSC expiry. I think, that's is the main primitive on
> this feature. Right?
> 
> WFE can also wake based on Timer stream events(kind of TSC in x86
> analogy) but it has a configuration
> bit that needs to allow for this scheme in userspace(EL0) or not?
> defined by EL1(Linux kernel).
> I am planning to spend time on this after understanding the value
> addition of the feature/usecase[1]
> [1]
> http://mails.dpdk.org/archives/dev/2020-May/168888.html
> 

Ok, if there is a consensus to keep these two APIs disjoint for now -
I wouldn't insist.
Konstantin
  
Honnappa Nagarahalli June 3, 2020, 6:22 a.m. UTC | #7
<snip>

> 
> On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
> >
> >
> > > > Hi Anatoly,
> > > >
> > > >>
> > > >> Add two new power management intrinsics, and provide an
> > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions.
> > > >> The instructions are implemented as raw byte opcodes because
> > > >> there is not yet widespread compiler support for these instructions.
> > > >>
> > > >> The power management instructions provide an
> > > >> architecture-specific function to either wait until a specified
> > > >> TSC timestamp is reached, or optionally wait until either a TSC
> > > >> timestamp is reached or a memory location is written to. The
> > > >> monitor function also provides an optional comparison, to avoid
> > > >> sleeping when the expected write has already happened, and no more
> writes are expected.
> > > >
> > > > Recently ARM guys introduced new generic API for similar (as I
> > > > understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > Probably would make sense to unite both APIs into something common
> > > > and HW transparent.
> > > > Konstantin
> > >
> > > Hi Konstantin,
> > >
> > > That's not really similar purpose. This is monitoring a cacheline
> > > for writes, not waiting on a specific value.
> >
> > I understand that.
> >
> > > The "expected" value is there
> > > as basically a hack to get around the race condition due to the fact
> > > that by the time you enter monitoring state, the write you're
> > > waiting for may have already happened.
> >
> > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> >
> > LDXR memaddr, $reg  // an address to monitor for if ($reg !=
> > expected_value)
> >    SEVL      //     arm monitor
> >    do {
> >        WFE     //      waits for write to that memory address
> >        LDXR memaddr, $reg
> >    } while ($reg != expected_value);
> >
> > Looks pretty similar to what rte_power_monitor() does, except you
> > don't have a loop for checking the new value.
> > Plus rte_power_monitor() provides extra options to the user -
> > timestamp and power save mode to enter.
> > Also I don't know what is the granularity of such events on ARM, is it
> > a cache-line or more/less.
> 
> As I understand it, Granularity is per the cache-line.
> ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until
> the cache line is written.
Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity.

> 
> But I see UMONITOR bit different, Where _without_ other core signaling to
> wakeup from wait state, it can wake on TSC expiry. I think, that's is the main
> primitive on this feature. Right?
> 
> WFE can also wake based on Timer stream events(kind of TSC in x86
> analogy) but it has a configuration
> bit that needs to allow for this scheme in userspace(EL0) or not?
> defined by EL1(Linux kernel).
Timer stream events are not per CPU core. They are system wide streams.

> I am planning to spend time on this after understanding the value addition of
> the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020-
> May/168888.html
> 
> 
> 
> 
> 
> > Might be ARM people can comment/correct me here.
> > Konstantin
  
Jerin Jacob June 3, 2020, 6:31 a.m. UTC | #8
On Wed, Jun 3, 2020 at 11:53 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> >
> > On Thu, May 28, 2020 at 9:08 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> > >
> > >
> > > > > Hi Anatoly,
> > > > >
> > > > >>
> > > > >> Add two new power management intrinsics, and provide an
> > > > >> implementation in eal/x86 based on UMONITOR/UMWAIT instructions.
> > > > >> The instructions are implemented as raw byte opcodes because
> > > > >> there is not yet widespread compiler support for these instructions.
> > > > >>
> > > > >> The power management instructions provide an
> > > > >> architecture-specific function to either wait until a specified
> > > > >> TSC timestamp is reached, or optionally wait until either a TSC
> > > > >> timestamp is reached or a memory location is written to. The
> > > > >> monitor function also provides an optional comparison, to avoid
> > > > >> sleeping when the expected write has already happened, and no more
> > writes are expected.
> > > > >
> > > > > Recently ARM guys introduced new generic API for similar (as I
> > > > > understand) purposes: rte_wait_until_equal_(16|32|64).
> > > > > Probably would make sense to unite both APIs into something common
> > > > > and HW transparent.
> > > > > Konstantin
> > > >
> > > > Hi Konstantin,
> > > >
> > > > That's not really similar purpose. This is monitoring a cacheline
> > > > for writes, not waiting on a specific value.
> > >
> > > I understand that.
> > >
> > > > The "expected" value is there
> > > > as basically a hack to get around the race condition due to the fact
> > > > that by the time you enter monitoring state, the write you're
> > > > waiting for may have already happened.
> > >
> > > AFAIK, current rte_wait_until_equal_* does pretty much the same thing:
> > >
> > > LDXR memaddr, $reg  // an address to monitor for if ($reg !=
> > > expected_value)
> > >    SEVL      //     arm monitor
> > >    do {
> > >        WFE     //      waits for write to that memory address
> > >        LDXR memaddr, $reg
> > >    } while ($reg != expected_value);
> > >
> > > Looks pretty similar to what rte_power_monitor() does, except you
> > > don't have a loop for checking the new value.
> > > Plus rte_power_monitor() provides extra options to the user -
> > > timestamp and power save mode to enter.
> > > Also I don't know what is the granularity of such events on ARM, is it
> > > a cache-line or more/less.
> >
> > As I understand it, Granularity is per the cache-line.
> > ie. Load-exclusive(LDXR) followed by WFE will wait in a low-power state until
> > the cache line is written.
> Architecture allows for 16B to 2048B space. Typically, implementations use cache-line granularity.
>
> >
> > But I see UMONITOR bit different, Where _without_ other core signaling to
> > wakeup from wait state, it can wake on TSC expiry. I think, that's is the main
> > primitive on this feature. Right?
> >
> > WFE can also wake based on Timer stream events(kind of TSC in x86
> > analogy) but it has a configuration
> > bit that needs to allow for this scheme in userspace(EL0) or not?
> > defined by EL1(Linux kernel).
> Timer stream events are not per CPU core. They are system wide streams.

We may not need per core support to implement this use case.

I think, currently, kernel configured to have a WFE signal on every
100us.(System-wide).

do while{} loop can check if it is passing the requested timestamp after WFE.
But minimum granularity will be 100us.(i.e 100us worth of ticks)


>
> > I am planning to spend time on this after understanding the value addition of
> > the feature/usecase[1] [1] http://mails.dpdk.org/archives/dev/2020-
> > May/168888.html
> >
> >
> >
> >
> >
> > > Might be ARM people can comment/correct me here.
> > > Konstantin
  

Patch

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..8646c4ac16
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index bc73ec2c5c..b54a2be4f6 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -59,6 +59,7 @@  generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@  arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_cpuflags.h b/lib/librte_eal/x86/include/rte_cpuflags.h
index c1d20364d1..94d6a43763 100644
--- a/lib/librte_eal/x86/include/rte_cpuflags.h
+++ b/lib/librte_eal/x86/include/rte_cpuflags.h
@@ -110,6 +110,7 @@  enum rte_cpu_flag_t {
 	RTE_CPUFLAG_RDTSCP,                 /**< RDTSCP */
 	RTE_CPUFLAG_EM64T,                  /**< EM64T */
 
+	RTE_CPUFLAG_WAITPKG,                 /**< UMINITOR/UMWAIT/TPAUSE */
 	/* (EAX 80000007h) EDX features */
 	RTE_CPUFLAG_INVTSC,                 /**< INVTSC */
 
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..a0522400fb
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,134 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 8-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+	rte_mb();
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	uint64_t rflags;
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
diff --git a/lib/librte_eal/x86/rte_cpuflags.c b/lib/librte_eal/x86/rte_cpuflags.c
index 30439e7951..0325c4b93b 100644
--- a/lib/librte_eal/x86/rte_cpuflags.c
+++ b/lib/librte_eal/x86/rte_cpuflags.c
@@ -110,6 +110,8 @@  const struct feature_entry rte_cpu_feature_table[] = {
 	FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
 	FEAT_DEF(RDSEED, 0x00000007, 0, RTE_REG_EBX, 18)
 
+	FEAT_DEF(WAITPKG, 0x00000007, 0, RTE_REG_ECX, 5)
+
 	FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX,  0)
 	FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX,  4)