[v4,02/10] eal: add power management intrinsics

Message ID 1601647919-25312-2-git-send-email-liang.j.ma@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v4,01/10] eal: add new x86 cpuid support for WAITPKG |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Liang, Ma Oct. 2, 2020, 2:11 p.m. UTC
  Add two new power management intrinsics, and provide an implementation
in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
are implemented as raw byte opcodes because there is not yet widespread
compiler support for these instructions.

The power management instructions provide an architecture-specific
function to either wait until a specified TSC timestamp is reached, or
optionally wait until either a TSC timestamp is reached or a memory
location is written to. The monitor function also provides an optional
comparison, to avoid sleeping when the expected write has already
happened, and no more writes are expected.

For more details, Please reference Intel SDM Volume 2.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
 lib/librte_eal/include/meson.build            |   1 +
 lib/librte_eal/x86/include/meson.build        |   1 +
 .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
 4 files changed, 209 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
 create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
  

Comments

Thomas Monjalon Oct. 8, 2020, 8:33 a.m. UTC | #1
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.
> 
> For more details, Please reference Intel SDM Volume 2.

I really would like to see feedbacks from other arch maintainers.
Unfortunately they were not Cc'ed.

Also please mark the new functions as experimental.
  
Jerin Jacob Oct. 8, 2020, 8:44 a.m. UTC | #2
On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> > Add two new power management intrinsics, and provide an implementation
> > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > are implemented as raw byte opcodes because there is not yet widespread
> > compiler support for these instructions.
> >
> > The power management instructions provide an architecture-specific
> > function to either wait until a specified TSC timestamp is reached, or
> > optionally wait until either a TSC timestamp is reached or a memory
> > location is written to. The monitor function also provides an optional
> > comparison, to avoid sleeping when the expected write has already
> > happened, and no more writes are expected.
> >
> > For more details, Please reference Intel SDM Volume 2.
>
> I really would like to see feedbacks from other arch maintainers.
> Unfortunately they were not Cc'ed.

Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
http://mails.dpdk.org/archives/dev/2020-September/181646.html

> Also please mark the new functions as experimental.
>
>
  
Thomas Monjalon Oct. 8, 2020, 9:41 a.m. UTC | #3
08/10/2020 10:44, Jerin Jacob:
> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >
> > > Add two new power management intrinsics, and provide an implementation
> > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > are implemented as raw byte opcodes because there is not yet widespread
> > > compiler support for these instructions.
> > >
> > > The power management instructions provide an architecture-specific
> > > function to either wait until a specified TSC timestamp is reached, or
> > > optionally wait until either a TSC timestamp is reached or a memory
> > > location is written to. The monitor function also provides an optional
> > > comparison, to avoid sleeping when the expected write has already
> > > happened, and no more writes are expected.
> > >
> > > For more details, Please reference Intel SDM Volume 2.
> >
> > I really would like to see feedbacks from other arch maintainers.
> > Unfortunately they were not Cc'ed.
> 
> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> http://mails.dpdk.org/archives/dev/2020-September/181646.html

This comment was sent on September 18.
Later this v4 was sent without replying to the comments.
This is blocking the series.
I am considering this feature as low priority.

> > Also please mark the new functions as experimental.
  
Burakov, Anatoly Oct. 8, 2020, 1:26 p.m. UTC | #4
On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>
>>> Add two new power management intrinsics, and provide an implementation
>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>> are implemented as raw byte opcodes because there is not yet widespread
>>> compiler support for these instructions.
>>>
>>> The power management instructions provide an architecture-specific
>>> function to either wait until a specified TSC timestamp is reached, or
>>> optionally wait until either a TSC timestamp is reached or a memory
>>> location is written to. The monitor function also provides an optional
>>> comparison, to avoid sleeping when the expected write has already
>>> happened, and no more writes are expected.
>>>
>>> For more details, Please reference Intel SDM Volume 2.
>>
>> I really would like to see feedbacks from other arch maintainers.
>> Unfortunately they were not Cc'ed.
> 
> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> 
>> Also please mark the new functions as experimental.
>>
>>

Hi Jerin,

 > IMO, We must introduce some arch feature-capability _get_ scheme to tell
 > the consumer of this API is only supported on x86. Probably as 
functions[1]
 > or macro flags scheme and have a stub for the other architectures as the
 > API marked as generic ie rte_power_* not rte_x86_..
 >
 > This will help the consumer to create workers based on the 
instruction features
 > which can NOT be abstracted as a generic feature across the 
architectures.

I'm not entirely sure what you mean by that.

I mean, yes, we should have added stubs for other architectures, and we 
will add those in future revisions, but what does your proposed runtime 
check accomplish that cannot currently be done with CPUID flags?

If you look at patch 1 [1], we added CPUID flags that the user can 
check, and in fact this is precisely what we do in patch 4 [2] before 
enabling the UMWAIT path. We could perhaps document this better and 
outline the dependency on the WAITPKG CPUID flag more explicitly, but 
otherwise i don't see how what you're proposing isn't already possible 
to do.

[1] http://patches.dpdk.org/patch/79539/
[2] http://patches.dpdk.org/patch/79540/ , function 
rte_power_pmd_mgmt_queue_enable()
  
Jerin Jacob Oct. 8, 2020, 3:13 p.m. UTC | #5
On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>
> >>> Add two new power management intrinsics, and provide an implementation
> >>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>> are implemented as raw byte opcodes because there is not yet widespread
> >>> compiler support for these instructions.
> >>>
> >>> The power management instructions provide an architecture-specific
> >>> function to either wait until a specified TSC timestamp is reached, or
> >>> optionally wait until either a TSC timestamp is reached or a memory
> >>> location is written to. The monitor function also provides an optional
> >>> comparison, to avoid sleeping when the expected write has already
> >>> happened, and no more writes are expected.
> >>>
> >>> For more details, Please reference Intel SDM Volume 2.
> >>
> >> I really would like to see feedbacks from other arch maintainers.
> >> Unfortunately they were not Cc'ed.
> >
> > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >
> >> Also please mark the new functions as experimental.
> >>
> >>
>
> Hi Jerin,

Hi Anatoly,

>
>  > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>  > the consumer of this API is only supported on x86. Probably as
> functions[1]
>  > or macro flags scheme and have a stub for the other architectures as the
>  > API marked as generic ie rte_power_* not rte_x86_..
>  >
>  > This will help the consumer to create workers based on the
> instruction features
>  > which can NOT be abstracted as a generic feature across the
> architectures.
>
> I'm not entirely sure what you mean by that.
>
> I mean, yes, we should have added stubs for other architectures, and we
> will add those in future revisions, but what does your proposed runtime
> check accomplish that cannot currently be done with CPUID flags?


RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.



>
> If you look at patch 1 [1], we added CPUID flags that the user can
> check, and in fact this is precisely what we do in patch 4 [2] before
> enabling the UMWAIT path. We could perhaps document this better and
> outline the dependency on the WAITPKG CPUID flag more explicitly, but
> otherwise i don't see how what you're proposing isn't already possible
> to do.
>
> [1] http://patches.dpdk.org/patch/79539/
> [2] http://patches.dpdk.org/patch/79540/ , function
> rte_power_pmd_mgmt_queue_enable()
>
> --
> Thanks,
> Anatoly
  
Ananyev, Konstantin Oct. 8, 2020, 5:07 p.m. UTC | #6
> 
> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
> >
> > On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > >>
> > >>> Add two new power management intrinsics, and provide an implementation
> > >>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > >>> are implemented as raw byte opcodes because there is not yet widespread
> > >>> compiler support for these instructions.
> > >>>
> > >>> The power management instructions provide an architecture-specific
> > >>> function to either wait until a specified TSC timestamp is reached, or
> > >>> optionally wait until either a TSC timestamp is reached or a memory
> > >>> location is written to. The monitor function also provides an optional
> > >>> comparison, to avoid sleeping when the expected write has already
> > >>> happened, and no more writes are expected.
> > >>>
> > >>> For more details, Please reference Intel SDM Volume 2.
> > >>
> > >> I really would like to see feedbacks from other arch maintainers.
> > >> Unfortunately they were not Cc'ed.
> > >
> > > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> > >
> > >> Also please mark the new functions as experimental.
> > >>
> > >>
> >
> > Hi Jerin,
> 
> Hi Anatoly,
> 
> >
> >  > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >  > the consumer of this API is only supported on x86. Probably as
> > functions[1]
> >  > or macro flags scheme and have a stub for the other architectures as the
> >  > API marked as generic ie rte_power_* not rte_x86_..
> >  >
> >  > This will help the consumer to create workers based on the
> > instruction features
> >  > which can NOT be abstracted as a generic feature across the
> > architectures.
> >
> > I'm not entirely sure what you mean by that.
> >
> > I mean, yes, we should have added stubs for other architectures, and we
> > will add those in future revisions, but what does your proposed runtime
> > check accomplish that cannot currently be done with CPUID flags?
> 
> 
> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.


I am agree with Jerin, that we need some generic way to
figure-out does platform supports power_monitor() or not.
Though not sure do we need to create a new feature-get framework here...
Might be just something like:
 rte_power_monitor(...) == -ENOTSUP
be enough indication for that?
So user can just do:
if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
	/* not supported  path */
}

To check is that feature supported or not.

> >
> > If you look at patch 1 [1], we added CPUID flags that the user can
> > check, and in fact this is precisely what we do in patch 4 [2] before
> > enabling the UMWAIT path. We could perhaps document this better and
> > outline the dependency on the WAITPKG CPUID flag more explicitly, but
> > otherwise i don't see how what you're proposing isn't already possible
> > to do.
> >
> > [1] http://patches.dpdk.org/patch/79539/
> > [2] http://patches.dpdk.org/patch/79540/ , function
> > rte_power_pmd_mgmt_queue_enable()
> >
> > --
> > Thanks,
> > Anatoly
  
Ananyev, Konstantin Oct. 8, 2020, 5:15 p.m. UTC | #7
> 
> Add two new power management intrinsics, and provide an implementation
> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> are implemented as raw byte opcodes because there is not yet widespread
> compiler support for these instructions.
> 
> The power management instructions provide an architecture-specific
> function to either wait until a specified TSC timestamp is reached, or
> optionally wait until either a TSC timestamp is reached or a memory
> location is written to. The monitor function also provides an optional
> comparison, to avoid sleeping when the expected write has already
> happened, and no more writes are expected.

I think what this API is missing - a function to wakeup sleeping core.
If user can/should use some system call to achieve that, then at least
it has to be clearly documented, even better some wrapper provided.

> 
> For more details, Please reference Intel SDM Volume 2.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  .../include/generic/rte_power_intrinsics.h    |  64 ++++++++
>  lib/librte_eal/include/meson.build            |   1 +
>  lib/librte_eal/x86/include/meson.build        |   1 +
>  .../x86/include/rte_power_intrinsics.h        | 143 ++++++++++++++++++
>  4 files changed, 209 insertions(+)
>  create mode 100644 lib/librte_eal/include/generic/rte_power_intrinsics.h
>  create mode 100644 lib/librte_eal/x86/include/rte_power_intrinsics.h
> 
> diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..cd7f8070ac
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_H_
> +#define _RTE_POWER_INTRINSIC_H_
> +
> +#include <inttypes.h>
> +
> +/**
> + * @file
> + * Advanced power management operations.
> + *
> + * This file define APIs for advanced power management,
> + * which are architecture-dependent.
> + */
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp);
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for. Note that the wait behavior is
> + *   architecture-dependent.
> + *
> + * @return
> + *   Architecture-dependent return value.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp);
> +
> +#endif /* _RTE_POWER_INTRINSIC_H_ */
> diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
> index cd09027958..3a12e87e19 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -60,6 +60,7 @@ generic_headers = files(
>  	'generic/rte_memcpy.h',
>  	'generic/rte_pause.h',
>  	'generic/rte_prefetch.h',
> +	'generic/rte_power_intrinsics.h',
>  	'generic/rte_rwlock.h',
>  	'generic/rte_spinlock.h',
>  	'generic/rte_ticketlock.h',
> diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
> index f0e998c2fe..494a8142a2 100644
> --- a/lib/librte_eal/x86/include/meson.build
> +++ b/lib/librte_eal/x86/include/meson.build
> @@ -13,6 +13,7 @@ arch_headers = files(
>  	'rte_io.h',
>  	'rte_memcpy.h',
>  	'rte_prefetch.h',
> +	'rte_power_intrinsics.h',
>  	'rte_pause.h',
>  	'rte_rtm.h',
>  	'rte_rwlock.h',
> diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> new file mode 100644
> index 0000000000..6dd1cdc939
> --- /dev/null
> +++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Intel Corporation
> + */
> +
> +#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
> +#define _RTE_POWER_INTRINSIC_X86_64_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_atomic.h>
> +#include <rte_common.h>
> +
> +#include "generic/rte_power_intrinsics.h"
> +
> +/**
> + * Monitor specific address for changes. This will cause the CPU to enter an
> + * architecture-defined optimized power state until either the specified
> + * memory address is written to, or a certain TSC timestamp is reached.
> + *
> + * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
> + * mask is non-zero, the current value pointed to by the `p` pointer will be
> + * checked against the expected value, and if they match, the entering of
> + * optimized power state may be aborted.
> + *
> + * This function uses UMONITOR/UMWAIT instructions. For more information about
> + * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
> + * Developer's Manual.
> + *
> + * @param p
> + *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
> + * @param expected_value
> + *   Before attempting the monitoring, the `p` address may be read and compared
> + *   against this value. If `value_mask` is zero, this step will be skipped.
> + * @param value_mask
> + *   The 64-bit mask to use to extract current value from `p`.
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> +static inline int rte_power_monitor(const volatile void *p,
> +		const uint64_t expected_value, const uint64_t value_mask,
> +		const uint32_t state, const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +	/*
> +	 * we're using raw byte codes for now as only the newest compiler
> +	 * versions support this instruction natively.
> +	 */
> +
> +	/* set address for UMONITOR */
> +	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
> +			:
> +			: "D"(p));
> +
> +	if (value_mask) {
> +		const uint64_t cur_value = *(const volatile uint64_t *)p;
> +		const uint64_t masked = cur_value & value_mask;
> +		/* if the masked value is already matching, abort */
> +		if (masked == expected_value)
> +			return 0;
> +	}
> +	/* execute UMWAIT */
> +	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
> +		/*
> +		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
> +		 * onto the stack, then pop them back into `rflags` so that
> +		 * we can read it.
> +		 */
> +		"pushf;\n"
> +		"pop %0;\n"
> +		: "=r"(rflags)
> +		: "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +/**
> + * Enter an architecture-defined optimized power state until a certain TSC
> + * timestamp is reached.
> + *
> + * This function uses TPAUSE instruction. For more information about its usage,
> + * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
> + * Manual.
> + *
> + * @param state
> + *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
> + *   1 (C0.1).
> + * @param tsc_timestamp
> + *   Maximum TSC timestamp to wait for.
> + *
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to other reasons.
> + */
> +static inline int rte_power_pause(const uint32_t state,
> +		const uint64_t tsc_timestamp)
> +{
> +	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
> +	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
> +	/* the rflags need match native register size */
> +#ifdef RTE_ARCH_I686
> +	uint32_t rflags;
> +#else
> +	uint64_t rflags;
> +#endif
> +
> +	/* execute TPAUSE */
> +	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
> +		     /*
> +		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
> +		      * onto the stack, then pop them back into `rflags` so that
> +		      * we can read it.
> +		      */
> +		     "pushf;\n"
> +		     "pop %0;\n"
> +		     : "=r"(rflags)
> +		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
> +
> +	/* we're interested in the first bit (the carry flag) */
> +	return rflags & 0x1;
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */
> --
> 2.17.1
  
Jerin Jacob Oct. 9, 2020, 5:42 a.m. UTC | #8
On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
> >
> > On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> > >
> > > On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > > > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > >>
> > > >>> Add two new power management intrinsics, and provide an implementation
> > > >>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > >>> are implemented as raw byte opcodes because there is not yet widespread
> > > >>> compiler support for these instructions.
> > > >>>
> > > >>> The power management instructions provide an architecture-specific
> > > >>> function to either wait until a specified TSC timestamp is reached, or
> > > >>> optionally wait until either a TSC timestamp is reached or a memory
> > > >>> location is written to. The monitor function also provides an optional
> > > >>> comparison, to avoid sleeping when the expected write has already
> > > >>> happened, and no more writes are expected.
> > > >>>
> > > >>> For more details, Please reference Intel SDM Volume 2.
> > > >>
> > > >> I really would like to see feedbacks from other arch maintainers.
> > > >> Unfortunately they were not Cc'ed.
> > > >
> > > > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > > > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> > > >
> > > >> Also please mark the new functions as experimental.
> > > >>
> > > >>
> > >
> > > Hi Jerin,
> >
> > Hi Anatoly,
> >
> > >
> > >  > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> > >  > the consumer of this API is only supported on x86. Probably as
> > > functions[1]
> > >  > or macro flags scheme and have a stub for the other architectures as the
> > >  > API marked as generic ie rte_power_* not rte_x86_..
> > >  >
> > >  > This will help the consumer to create workers based on the
> > > instruction features
> > >  > which can NOT be abstracted as a generic feature across the
> > > architectures.
> > >
> > > I'm not entirely sure what you mean by that.
> > >
> > > I mean, yes, we should have added stubs for other architectures, and we
> > > will add those in future revisions, but what does your proposed runtime
> > > check accomplish that cannot currently be done with CPUID flags?
> >
> >
> > RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> > i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> > and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> > I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>
>
> I am agree with Jerin, that we need some generic way to
> figure-out does platform supports power_monitor() or not.
> Though not sure do we need to create a new feature-get framework here...

That's works too. Some means of generic probing is fine. Following
schemed needs
more documentation on that usage, as, it is not straight forward compare to
feature-get framework. Also, on the other thread, we are adding the
new instructions like
demote cacheline etc, maybe if the user wants to KNOW if the arch
supports it then
the feature-get framework is good.
If we think, there is no other usecase for generic arch feature-get
framework then
we can keep the below scheme else generic arch feature is better for
more forward
looking use cases.

> Might be just something like:
>  rte_power_monitor(...) == -ENOTSUP
> be enough indication for that?
> So user can just do:
> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>         /* not supported  path */
> }
>
> To check is that feature supported or not.


>
> > >
> > > If you look at patch 1 [1], we added CPUID flags that the user can
> > > check, and in fact this is precisely what we do in patch 4 [2] before
> > > enabling the UMWAIT path. We could perhaps document this better and
> > > outline the dependency on the WAITPKG CPUID flag more explicitly, but
> > > otherwise i don't see how what you're proposing isn't already possible
> > > to do.
> > >
> > > [1] http://patches.dpdk.org/patch/79539/
> > > [2] http://patches.dpdk.org/patch/79540/ , function
> > > rte_power_pmd_mgmt_queue_enable()
> > >
> > > --
> > > Thanks,
> > > Anatoly
  
Burakov, Anatoly Oct. 9, 2020, 9:11 a.m. UTC | #9
On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
>>
>> Add two new power management intrinsics, and provide an implementation
>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>> are implemented as raw byte opcodes because there is not yet widespread
>> compiler support for these instructions.
>>
>> The power management instructions provide an architecture-specific
>> function to either wait until a specified TSC timestamp is reached, or
>> optionally wait until either a TSC timestamp is reached or a memory
>> location is written to. The monitor function also provides an optional
>> comparison, to avoid sleeping when the expected write has already
>> happened, and no more writes are expected.
> 
> I think what this API is missing - a function to wakeup sleeping core.
> If user can/should use some system call to achieve that, then at least
> it has to be clearly documented, even better some wrapper provided.

I don't think it's possible to do that without severely overcomplicating 
the intrinsic and its usage, because AFAIK the only way to wake up a 
sleeping core would be to send some kind of interrupt to the core, or 
trigger a write to the cache-line in question.
  
Burakov, Anatoly Oct. 9, 2020, 9:25 a.m. UTC | #10
On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> <konstantin.ananyev@intel.com> wrote:
>>
>>>
>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>> <anatoly.burakov@intel.com> wrote:
>>>>
>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>
>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>> compiler support for these instructions.
>>>>>>>
>>>>>>> The power management instructions provide an architecture-specific
>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>> happened, and no more writes are expected.
>>>>>>>
>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>
>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>> Unfortunately they were not Cc'ed.
>>>>>
>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>
>>>>>> Also please mark the new functions as experimental.
>>>>>>
>>>>>>
>>>>
>>>> Hi Jerin,
>>>
>>> Hi Anatoly,
>>>
>>>>
>>>>   > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>   > the consumer of this API is only supported on x86. Probably as
>>>> functions[1]
>>>>   > or macro flags scheme and have a stub for the other architectures as the
>>>>   > API marked as generic ie rte_power_* not rte_x86_..
>>>>   >
>>>>   > This will help the consumer to create workers based on the
>>>> instruction features
>>>>   > which can NOT be abstracted as a generic feature across the
>>>> architectures.
>>>>
>>>> I'm not entirely sure what you mean by that.
>>>>
>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>> will add those in future revisions, but what does your proposed runtime
>>>> check accomplish that cannot currently be done with CPUID flags?
>>>
>>>
>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>
>>
>> I am agree with Jerin, that we need some generic way to
>> figure-out does platform supports power_monitor() or not.
>> Though not sure do we need to create a new feature-get framework here...
> 
> That's works too. Some means of generic probing is fine. Following
> schemed needs
> more documentation on that usage, as, it is not straight forward compare to
> feature-get framework. Also, on the other thread, we are adding the
> new instructions like
> demote cacheline etc, maybe if the user wants to KNOW if the arch
> supports it then
> the feature-get framework is good.
> If we think, there is no other usecase for generic arch feature-get
> framework then
> we can keep the below scheme else generic arch feature is better for
> more forward
> looking use cases.
> 
>> Might be just something like:
>>   rte_power_monitor(...) == -ENOTSUP
>> be enough indication for that?
>> So user can just do:
>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>          /* not supported  path */
>> }
>>
>> To check is that feature supported or not.
> 
> 

Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think 
we can safely make this intrinsic as a noop on other archs as well, as 
it's functionally identical to waking up immediately.

If we're not creating this for CLDEMOTE, we don't need it here as well. 
If we do need it for this, then we arguably need it for CLDEMOTE too.

>>
>>>>
>>>> If you look at patch 1 [1], we added CPUID flags that the user can
>>>> check, and in fact this is precisely what we do in patch 4 [2] before
>>>> enabling the UMWAIT path. We could perhaps document this better and
>>>> outline the dependency on the WAITPKG CPUID flag more explicitly, but
>>>> otherwise i don't see how what you're proposing isn't already possible
>>>> to do.
>>>>
>>>> [1] http://patches.dpdk.org/patch/79539/
>>>> [2] http://patches.dpdk.org/patch/79540/ , function
>>>> rte_power_pmd_mgmt_queue_enable()
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly
  
Thomas Monjalon Oct. 9, 2020, 9:29 a.m. UTC | #11
09/10/2020 11:25, Burakov, Anatoly:
> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> > On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> > <konstantin.ananyev@intel.com> wrote:
> >>
> >>>
> >>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>> <anatoly.burakov@intel.com> wrote:
> >>>>
> >>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>
> >>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>> compiler support for these instructions.
> >>>>>>>
> >>>>>>> The power management instructions provide an architecture-specific
> >>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>> happened, and no more writes are expected.
> >>>>>>>
> >>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>
> >>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>> Unfortunately they were not Cc'ed.
> >>>>>
> >>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>
> >>>>>> Also please mark the new functions as experimental.
> >>>>>>
> >>>>>>
> >>>>
> >>>> Hi Jerin,
> >>>
> >>> Hi Anatoly,
> >>>
> >>>>
> >>>>   > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>   > the consumer of this API is only supported on x86. Probably as
> >>>> functions[1]
> >>>>   > or macro flags scheme and have a stub for the other architectures as the
> >>>>   > API marked as generic ie rte_power_* not rte_x86_..
> >>>>   >
> >>>>   > This will help the consumer to create workers based on the
> >>>> instruction features
> >>>>   > which can NOT be abstracted as a generic feature across the
> >>>> architectures.
> >>>>
> >>>> I'm not entirely sure what you mean by that.
> >>>>
> >>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>> will add those in future revisions, but what does your proposed runtime
> >>>> check accomplish that cannot currently be done with CPUID flags?
> >>>
> >>>
> >>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>
> >>
> >> I am agree with Jerin, that we need some generic way to
> >> figure-out does platform supports power_monitor() or not.
> >> Though not sure do we need to create a new feature-get framework here...
> > 
> > That's works too. Some means of generic probing is fine. Following
> > schemed needs
> > more documentation on that usage, as, it is not straight forward compare to
> > feature-get framework. Also, on the other thread, we are adding the
> > new instructions like
> > demote cacheline etc, maybe if the user wants to KNOW if the arch
> > supports it then
> > the feature-get framework is good.
> > If we think, there is no other usecase for generic arch feature-get
> > framework then
> > we can keep the below scheme else generic arch feature is better for
> > more forward
> > looking use cases.
> > 
> >> Might be just something like:
> >>   rte_power_monitor(...) == -ENOTSUP
> >> be enough indication for that?
> >> So user can just do:
> >> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>          /* not supported  path */
> >> }
> >>
> >> To check is that feature supported or not.
> > 
> > 
> 
> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think 
> we can safely make this intrinsic as a noop on other archs as well, as 
> it's functionally identical to waking up immediately.
> 
> If we're not creating this for CLDEMOTE, we don't need it here as well. 
> If we do need it for this, then we arguably need it for CLDEMOTE too.

Sorry I don't understand what you mean, too many "it" and "this" :)
  
Burakov, Anatoly Oct. 9, 2020, 9:40 a.m. UTC | #12
On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> 09/10/2020 11:25, Burakov, Anatoly:
>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>> <konstantin.ananyev@intel.com> wrote:
>>>>
>>>>>
>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>
>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>
>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>> compiler support for these instructions.
>>>>>>>>>
>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>
>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>
>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>
>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>
>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> Hi Jerin,
>>>>>
>>>>> Hi Anatoly,
>>>>>
>>>>>>
>>>>>>    > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>    > the consumer of this API is only supported on x86. Probably as
>>>>>> functions[1]
>>>>>>    > or macro flags scheme and have a stub for the other architectures as the
>>>>>>    > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>    >
>>>>>>    > This will help the consumer to create workers based on the
>>>>>> instruction features
>>>>>>    > which can NOT be abstracted as a generic feature across the
>>>>>> architectures.
>>>>>>
>>>>>> I'm not entirely sure what you mean by that.
>>>>>>
>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>
>>>>>
>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>
>>>>
>>>> I am agree with Jerin, that we need some generic way to
>>>> figure-out does platform supports power_monitor() or not.
>>>> Though not sure do we need to create a new feature-get framework here...
>>>
>>> That's works too. Some means of generic probing is fine. Following
>>> schemed needs
>>> more documentation on that usage, as, it is not straight forward compare to
>>> feature-get framework. Also, on the other thread, we are adding the
>>> new instructions like
>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>> supports it then
>>> the feature-get framework is good.
>>> If we think, there is no other usecase for generic arch feature-get
>>> framework then
>>> we can keep the below scheme else generic arch feature is better for
>>> more forward
>>> looking use cases.
>>>
>>>> Might be just something like:
>>>>    rte_power_monitor(...) == -ENOTSUP
>>>> be enough indication for that?
>>>> So user can just do:
>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>           /* not supported  path */
>>>> }
>>>>
>>>> To check is that feature supported or not.
>>>
>>>
>>
>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>> we can safely make this intrinsic as a noop on other archs as well, as
>> it's functionally identical to waking up immediately.
>>
>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> 
> Sorry I don't understand what you mean, too many "it" and "this" :)
> 

Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't 
exist on other archs, this doesn't too, so it's a fairly similar 
situation. Stubbing UMWAIT with a noop is a valid approach because it's 
equivalent to sleeping and then immediately waking up (which can happen 
for a host of reasons unrelated to the code itself).

I'm not against a generic feature-get framework, i'm just pointing out 
that if this is what's preventing the merge, it should prevent the merge 
of CLDEMOTE as well, yet Jerin has acked that one and has explicitly 
stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
  
Jerin Jacob Oct. 9, 2020, 9:54 a.m. UTC | #13
On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> > 09/10/2020 11:25, Burakov, Anatoly:
> >> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>> <konstantin.ananyev@intel.com> wrote:
> >>>>
> >>>>>
> >>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>
> >>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>
> >>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>> compiler support for these instructions.
> >>>>>>>>>
> >>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>
> >>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>
> >>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>
> >>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>
> >>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>> Hi Jerin,
> >>>>>
> >>>>> Hi Anatoly,
> >>>>>
> >>>>>>
> >>>>>>    > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>    > the consumer of this API is only supported on x86. Probably as
> >>>>>> functions[1]
> >>>>>>    > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>    > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>    >
> >>>>>>    > This will help the consumer to create workers based on the
> >>>>>> instruction features
> >>>>>>    > which can NOT be abstracted as a generic feature across the
> >>>>>> architectures.
> >>>>>>
> >>>>>> I'm not entirely sure what you mean by that.
> >>>>>>
> >>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>
> >>>>>
> >>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>
> >>>>
> >>>> I am agree with Jerin, that we need some generic way to
> >>>> figure-out does platform supports power_monitor() or not.
> >>>> Though not sure do we need to create a new feature-get framework here...
> >>>
> >>> That's works too. Some means of generic probing is fine. Following
> >>> schemed needs
> >>> more documentation on that usage, as, it is not straight forward compare to
> >>> feature-get framework. Also, on the other thread, we are adding the
> >>> new instructions like
> >>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>> supports it then
> >>> the feature-get framework is good.
> >>> If we think, there is no other usecase for generic arch feature-get
> >>> framework then
> >>> we can keep the below scheme else generic arch feature is better for
> >>> more forward
> >>> looking use cases.
> >>>
> >>>> Might be just something like:
> >>>>    rte_power_monitor(...) == -ENOTSUP
> >>>> be enough indication for that?
> >>>> So user can just do:
> >>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>           /* not supported  path */
> >>>> }
> >>>>
> >>>> To check is that feature supported or not.
> >>>
> >>>
> >>
> >> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >> we can safely make this intrinsic as a noop on other archs as well, as
> >> it's functionally identical to waking up immediately.
> >>
> >> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >
> > Sorry I don't understand what you mean, too many "it" and "this" :)
> >
>
> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> exist on other archs, this doesn't too, so it's a fairly similar
> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> equivalent to sleeping and then immediately waking up (which can happen
> for a host of reasons unrelated to the code itself).

If we are keeping the following return in the public API then it can not be NOP
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */

Also, we need to fix compilation issue if any with
http://patches.dpdk.org/patch/79540/
as it has direct reference to if
(!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
Either we need to add -ENOTSUP return or generic feature-get framework.


>
> I'm not against a generic feature-get framework, i'm just pointing out
> that if this is what's preventing the merge, it should prevent the merge
> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
>
> --
> Thanks,
> Anatoly
  
Burakov, Anatoly Oct. 9, 2020, 10:03 a.m. UTC | #14
On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
>>
>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>
>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>
>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>
>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>
>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>
>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>
>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Jerin,
>>>>>>>
>>>>>>> Hi Anatoly,
>>>>>>>
>>>>>>>>
>>>>>>>>     > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>     > the consumer of this API is only supported on x86. Probably as
>>>>>>>> functions[1]
>>>>>>>>     > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>     > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>     >
>>>>>>>>     > This will help the consumer to create workers based on the
>>>>>>>> instruction features
>>>>>>>>     > which can NOT be abstracted as a generic feature across the
>>>>>>>> architectures.
>>>>>>>>
>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>
>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>
>>>>>>>
>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>
>>>>>>
>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>
>>>>> That's works too. Some means of generic probing is fine. Following
>>>>> schemed needs
>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>> new instructions like
>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>> supports it then
>>>>> the feature-get framework is good.
>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>> framework then
>>>>> we can keep the below scheme else generic arch feature is better for
>>>>> more forward
>>>>> looking use cases.
>>>>>
>>>>>> Might be just something like:
>>>>>>     rte_power_monitor(...) == -ENOTSUP
>>>>>> be enough indication for that?
>>>>>> So user can just do:
>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>            /* not supported  path */
>>>>>> }
>>>>>>
>>>>>> To check is that feature supported or not.
>>>>>
>>>>>
>>>>
>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>> it's functionally identical to waking up immediately.
>>>>
>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>
>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>
>>
>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>> exist on other archs, this doesn't too, so it's a fairly similar
>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>> equivalent to sleeping and then immediately waking up (which can happen
>> for a host of reasons unrelated to the code itself).
> 
> If we are keeping the following return in the public API then it can not be NOP
> + * @return
> + *   - 1 if wakeup was due to TSC timeout expiration.
> + *   - 0 if wakeup was due to memory write or other reasons.
> + */
> 

In the generic header, it is specified that return value is 
implementation-defined (i.e. arch-specific). I guess we could remove 
that and set return value to either 0 or -ENOTSUP if that would resolve 
the issue?

> Also, we need to fix compilation issue if any with
> http://patches.dpdk.org/patch/79540/
> as it has direct reference to if
> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> Either we need to add -ENOTSUP return or generic feature-get framework.

IIRC power library isn't compiled on anything other than x86, so this 
code wouldn't get compiled.

> 
> 
>>
>> I'm not against a generic feature-get framework, i'm just pointing out
>> that if this is what's preventing the merge, it should prevent the merge
>> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
>>
>> --
>> Thanks,
>> Anatoly
  
Thomas Monjalon Oct. 9, 2020, 10:17 a.m. UTC | #15
09/10/2020 12:03, Burakov, Anatoly:
> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> > On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> >> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>
> >>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>
> >>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>
> >>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>
> >>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>
> >>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jerin,
> >>>>>>>
> >>>>>>> Hi Anatoly,
> >>>>>>>
> >>>>>>>>
> >>>>>>>>     > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>     > the consumer of this API is only supported on x86. Probably as
> >>>>>>>> functions[1]
> >>>>>>>>     > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>     > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>     >
> >>>>>>>>     > This will help the consumer to create workers based on the
> >>>>>>>> instruction features
> >>>>>>>>     > which can NOT be abstracted as a generic feature across the
> >>>>>>>> architectures.
> >>>>>>>>
> >>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>
> >>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>
> >>>>>>>
> >>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>
> >>>>>>
> >>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>
> >>>>> That's works too. Some means of generic probing is fine. Following
> >>>>> schemed needs
> >>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>> new instructions like
> >>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>> supports it then
> >>>>> the feature-get framework is good.
> >>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>> framework then
> >>>>> we can keep the below scheme else generic arch feature is better for
> >>>>> more forward
> >>>>> looking use cases.
> >>>>>
> >>>>>> Might be just something like:
> >>>>>>     rte_power_monitor(...) == -ENOTSUP
> >>>>>> be enough indication for that?
> >>>>>> So user can just do:
> >>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>            /* not supported  path */
> >>>>>> }
> >>>>>>
> >>>>>> To check is that feature supported or not.
> >>>>>
> >>>>>
> >>>>
> >>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>> it's functionally identical to waking up immediately.
> >>>>
> >>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>
> >>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>
> >>
> >> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >> exist on other archs, this doesn't too, so it's a fairly similar
> >> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >> equivalent to sleeping and then immediately waking up (which can happen
> >> for a host of reasons unrelated to the code itself).
> > 
> > If we are keeping the following return in the public API then it can not be NOP
> > + * @return
> > + *   - 1 if wakeup was due to TSC timeout expiration.
> > + *   - 0 if wakeup was due to memory write or other reasons.
> > + */
> > 
> 
> In the generic header, it is specified that return value is 
> implementation-defined (i.e. arch-specific).

Obviously an API definition should *never* be "implementation-defined".


> I guess we could remove 
> that and set return value to either 0 or -ENOTSUP if that would resolve 
> the issue?
> 
> > Also, we need to fix compilation issue if any with
> > http://patches.dpdk.org/patch/79540/
> > as it has direct reference to if
> > (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> > Either we need to add -ENOTSUP return or generic feature-get framework.
> 
> IIRC power library isn't compiled on anything other than x86, so this 
> code wouldn't get compiled.

It is not call "power-x86", so we must assume it could work
on any architecture.


> >> I'm not against a generic feature-get framework, i'm just pointing out
> >> that if this is what's preventing the merge, it should prevent the merge
> >> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> >> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.

CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
that's why the expectations may be different.
  
Jerin Jacob Oct. 9, 2020, 10:19 a.m. UTC | #16
On Fri, Oct 9, 2020 at 3:33 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> > On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> >>
> >> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>>
> >>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>
> >>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>
> >>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>
> >>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>
> >>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>
> >>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jerin,
> >>>>>>>
> >>>>>>> Hi Anatoly,
> >>>>>>>
> >>>>>>>>
> >>>>>>>>     > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>     > the consumer of this API is only supported on x86. Probably as
> >>>>>>>> functions[1]
> >>>>>>>>     > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>     > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>     >
> >>>>>>>>     > This will help the consumer to create workers based on the
> >>>>>>>> instruction features
> >>>>>>>>     > which can NOT be abstracted as a generic feature across the
> >>>>>>>> architectures.
> >>>>>>>>
> >>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>
> >>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>
> >>>>>>>
> >>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>
> >>>>>>
> >>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>
> >>>>> That's works too. Some means of generic probing is fine. Following
> >>>>> schemed needs
> >>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>> new instructions like
> >>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>> supports it then
> >>>>> the feature-get framework is good.
> >>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>> framework then
> >>>>> we can keep the below scheme else generic arch feature is better for
> >>>>> more forward
> >>>>> looking use cases.
> >>>>>
> >>>>>> Might be just something like:
> >>>>>>     rte_power_monitor(...) == -ENOTSUP
> >>>>>> be enough indication for that?
> >>>>>> So user can just do:
> >>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>            /* not supported  path */
> >>>>>> }
> >>>>>>
> >>>>>> To check is that feature supported or not.
> >>>>>
> >>>>>
> >>>>
> >>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>> it's functionally identical to waking up immediately.
> >>>>
> >>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>
> >>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>
> >>
> >> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >> exist on other archs, this doesn't too, so it's a fairly similar
> >> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >> equivalent to sleeping and then immediately waking up (which can happen
> >> for a host of reasons unrelated to the code itself).
> >
> > If we are keeping the following return in the public API then it can not be NOP
> > + * @return
> > + *   - 1 if wakeup was due to TSC timeout expiration.
> > + *   - 0 if wakeup was due to memory write or other reasons.
> > + */
> >
>
> In the generic header, it is specified that return value is
> implementation-defined (i.e. arch-specific). I guess we could remove
> that and set return value to either 0 or -ENOTSUP if that would resolve
> the issue?

returning -ENOTSUP should be OK if we don't  to take generic
feature-get framework  path.

>
> > Also, we need to fix compilation issue if any with
> > http://patches.dpdk.org/patch/79540/
> > as it has direct reference to if
> > (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> > Either we need to add -ENOTSUP return or generic feature-get framework.
>
> IIRC power library isn't compiled on anything other than x86, so this
> code wouldn't get compiled.

Just checked now, librte_power compiles at least for arm64.


>
> >
> >
> >>
> >> I'm not against a generic feature-get framework, i'm just pointing out
> >> that if this is what's preventing the merge, it should prevent the merge
> >> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> >> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> >>
> >> --
> >> Thanks,
> >> Anatoly
>
>
> --
> Thanks,
> Anatoly
  
Burakov, Anatoly Oct. 9, 2020, 10:22 a.m. UTC | #17
On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> 09/10/2020 12:03, Burakov, Anatoly:
>> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
>>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
>>> <anatoly.burakov@intel.com> wrote:
>>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>>>
>>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>>>
>>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>>>
>>>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Jerin,
>>>>>>>>>
>>>>>>>>> Hi Anatoly,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>      > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>>>      > the consumer of this API is only supported on x86. Probably as
>>>>>>>>>> functions[1]
>>>>>>>>>>      > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>>>      > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>>>      >
>>>>>>>>>>      > This will help the consumer to create workers based on the
>>>>>>>>>> instruction features
>>>>>>>>>>      > which can NOT be abstracted as a generic feature across the
>>>>>>>>>> architectures.
>>>>>>>>>>
>>>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>>>
>>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>>>
>>>>>>>>
>>>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>>>
>>>>>>> That's works too. Some means of generic probing is fine. Following
>>>>>>> schemed needs
>>>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>>>> new instructions like
>>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>>>> supports it then
>>>>>>> the feature-get framework is good.
>>>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>>>> framework then
>>>>>>> we can keep the below scheme else generic arch feature is better for
>>>>>>> more forward
>>>>>>> looking use cases.
>>>>>>>
>>>>>>>> Might be just something like:
>>>>>>>>      rte_power_monitor(...) == -ENOTSUP
>>>>>>>> be enough indication for that?
>>>>>>>> So user can just do:
>>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>>>             /* not supported  path */
>>>>>>>> }
>>>>>>>>
>>>>>>>> To check is that feature supported or not.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>>>> it's functionally identical to waking up immediately.
>>>>>>
>>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>>>
>>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>>>
>>>>
>>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>>>> exist on other archs, this doesn't too, so it's a fairly similar
>>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>>>> equivalent to sleeping and then immediately waking up (which can happen
>>>> for a host of reasons unrelated to the code itself).
>>>
>>> If we are keeping the following return in the public API then it can not be NOP
>>> + * @return
>>> + *   - 1 if wakeup was due to TSC timeout expiration.
>>> + *   - 0 if wakeup was due to memory write or other reasons.
>>> + */
>>>
>>
>> In the generic header, it is specified that return value is
>> implementation-defined (i.e. arch-specific).
> 
> Obviously an API definition should *never* be "implementation-defined".

If there isn't a meaningful return value, we could either make it a 
void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid 
result for a UMWAIT, and there are no side-effects to the intrinsic 
itself (it's basically a fancy rte_pause).

> 
> 
>> I guess we could remove
>> that and set return value to either 0 or -ENOTSUP if that would resolve
>> the issue?
>>
>>> Also, we need to fix compilation issue if any with
>>> http://patches.dpdk.org/patch/79540/
>>> as it has direct reference to if
>>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>>> Either we need to add -ENOTSUP return or generic feature-get framework.
>>
>> IIRC power library isn't compiled on anything other than x86, so this
>> code wouldn't get compiled.
> 
> It is not call "power-x86", so we must assume it could work
> on any architecture.

#ifdef it is!

> 
> 
>>>> I'm not against a generic feature-get framework, i'm just pointing out
>>>> that if this is what's preventing the merge, it should prevent the merge
>>>> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
>>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> 
> CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
> that's why the expectations may be different.
> 

UMWAIT is a best-effort mechanism with no side-effects. It's perfectly 
legal for a UMWAIT to not sleep at all, thus rendering it effectively a 
noop. So i don't think it's all that different.
  
Jerin Jacob Oct. 9, 2020, 10:45 a.m. UTC | #18
On Fri, Oct 9, 2020 at 3:53 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> > 09/10/2020 12:03, Burakov, Anatoly:
> >> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> >>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> >>> <anatoly.burakov@intel.com> wrote:
> >>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>>>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>>>
> >>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>>>
> >>>>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Jerin,
> >>>>>>>>>
> >>>>>>>>> Hi Anatoly,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>      > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>>>      > the consumer of this API is only supported on x86. Probably as
> >>>>>>>>>> functions[1]
> >>>>>>>>>>      > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>>>      > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>>>      >
> >>>>>>>>>>      > This will help the consumer to create workers based on the
> >>>>>>>>>> instruction features
> >>>>>>>>>>      > which can NOT be abstracted as a generic feature across the
> >>>>>>>>>> architectures.
> >>>>>>>>>>
> >>>>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>>>
> >>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>>>
> >>>>>>> That's works too. Some means of generic probing is fine. Following
> >>>>>>> schemed needs
> >>>>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>>>> new instructions like
> >>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>>>> supports it then
> >>>>>>> the feature-get framework is good.
> >>>>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>>>> framework then
> >>>>>>> we can keep the below scheme else generic arch feature is better for
> >>>>>>> more forward
> >>>>>>> looking use cases.
> >>>>>>>
> >>>>>>>> Might be just something like:
> >>>>>>>>      rte_power_monitor(...) == -ENOTSUP
> >>>>>>>> be enough indication for that?
> >>>>>>>> So user can just do:
> >>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>>>             /* not supported  path */
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> To check is that feature supported or not.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>>>> it's functionally identical to waking up immediately.
> >>>>>>
> >>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>>>
> >>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>>>
> >>>>
> >>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >>>> exist on other archs, this doesn't too, so it's a fairly similar
> >>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >>>> equivalent to sleeping and then immediately waking up (which can happen
> >>>> for a host of reasons unrelated to the code itself).
> >>>
> >>> If we are keeping the following return in the public API then it can not be NOP
> >>> + * @return
> >>> + *   - 1 if wakeup was due to TSC timeout expiration.
> >>> + *   - 0 if wakeup was due to memory write or other reasons.
> >>> + */
> >>>
> >>
> >> In the generic header, it is specified that return value is
> >> implementation-defined (i.e. arch-specific).
> >
> > Obviously an API definition should *never* be "implementation-defined".
>
> If there isn't a meaningful return value, we could either make it a
> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
> result for a UMWAIT, and there are no side-effects to the intrinsic
> itself (it's basically a fancy rte_pause).
>
> >
> >
> >> I guess we could remove
> >> that and set return value to either 0 or -ENOTSUP if that would resolve
> >> the issue?
> >>
> >>> Also, we need to fix compilation issue if any with
> >>> http://patches.dpdk.org/patch/79540/
> >>> as it has direct reference to if
> >>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> >>> Either we need to add -ENOTSUP return or generic feature-get framework.
> >>
> >> IIRC power library isn't compiled on anything other than x86, so this
> >> code wouldn't get compiled.
> >
> > It is not call "power-x86", so we must assume it could work
> > on any architecture.
>
> #ifdef it is!
>
> >
> >
> >>>> I'm not against a generic feature-get framework, i'm just pointing out
> >>>> that if this is what's preventing the merge, it should prevent the merge
> >>>> of CLDEMOTE as well, yet Jerin has acked that one and has explicitly
> >>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> >
> > CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
> > that's why the expectations may be different.
> >
>
> UMWAIT is a best-effort mechanism with no side-effects. It's perfectly
> legal for a UMWAIT to not sleep at all, thus rendering it effectively a
> noop. So i don't think it's all that different.

If a platform does not support UMWAIT in ALL case IMO, no consumer takes this
the path for power saving. So IMO, t is different than CLDEMOTE

>
> --
> Thanks,
> Anatoly
  
Ananyev, Konstantin Oct. 9, 2020, 10:48 a.m. UTC | #19
> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> > 09/10/2020 12:03, Burakov, Anatoly:
> >> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> >>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> >>> <anatoly.burakov@intel.com> wrote:
> >>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> >>>>> 09/10/2020 11:25, Burakov, Anatoly:
> >>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> >>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> >>>>>>> <konstantin.ananyev@intel.com> wrote:
> >>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> >>>>>>>>> <anatoly.burakov@intel.com> wrote:
> >>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> >>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>>>>>>>>> compiler support for these instructions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>>>>>>>>> location is written to. The monitor function also provides an optional
> >>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
> >>>>>>>>>>>> Unfortunately they were not Cc'ed.
> >>>>>>>>>>>
> >>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> >>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
> >>>>>>>>>>>
> >>>>>>>>>>>> Also please mark the new functions as experimental.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Jerin,
> >>>>>>>>>
> >>>>>>>>> Hi Anatoly,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>      > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> >>>>>>>>>>      > the consumer of this API is only supported on x86. Probably as
> >>>>>>>>>> functions[1]
> >>>>>>>>>>      > or macro flags scheme and have a stub for the other architectures as the
> >>>>>>>>>>      > API marked as generic ie rte_power_* not rte_x86_..
> >>>>>>>>>>      >
> >>>>>>>>>>      > This will help the consumer to create workers based on the
> >>>>>>>>>> instruction features
> >>>>>>>>>>      > which can NOT be abstracted as a generic feature across the
> >>>>>>>>>> architectures.
> >>>>>>>>>>
> >>>>>>>>>> I'm not entirely sure what you mean by that.
> >>>>>>>>>>
> >>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
> >>>>>>>>>> will add those in future revisions, but what does your proposed runtime
> >>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> >>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> >>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> >>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I am agree with Jerin, that we need some generic way to
> >>>>>>>> figure-out does platform supports power_monitor() or not.
> >>>>>>>> Though not sure do we need to create a new feature-get framework here...
> >>>>>>>
> >>>>>>> That's works too. Some means of generic probing is fine. Following
> >>>>>>> schemed needs
> >>>>>>> more documentation on that usage, as, it is not straight forward compare to
> >>>>>>> feature-get framework. Also, on the other thread, we are adding the
> >>>>>>> new instructions like
> >>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
> >>>>>>> supports it then
> >>>>>>> the feature-get framework is good.
> >>>>>>> If we think, there is no other usecase for generic arch feature-get
> >>>>>>> framework then
> >>>>>>> we can keep the below scheme else generic arch feature is better for
> >>>>>>> more forward
> >>>>>>> looking use cases.
> >>>>>>>
> >>>>>>>> Might be just something like:
> >>>>>>>>      rte_power_monitor(...) == -ENOTSUP
> >>>>>>>> be enough indication for that?
> >>>>>>>> So user can just do:
> >>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> >>>>>>>>             /* not supported  path */
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> To check is that feature supported or not.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> >>>>>> we can safely make this intrinsic as a noop on other archs as well, as
> >>>>>> it's functionally identical to waking up immediately.
> >>>>>>
> >>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
> >>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
> >>>>>
> >>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
> >>>>>
> >>>>
> >>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> >>>> exist on other archs, this doesn't too, so it's a fairly similar
> >>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
> >>>> equivalent to sleeping and then immediately waking up (which can happen
> >>>> for a host of reasons unrelated to the code itself).
> >>>
> >>> If we are keeping the following return in the public API then it can not be NOP
> >>> + * @return
> >>> + *   - 1 if wakeup was due to TSC timeout expiration.
> >>> + *   - 0 if wakeup was due to memory write or other reasons.
> >>> + */
> >>>
> >>
> >> In the generic header, it is specified that return value is
> >> implementation-defined (i.e. arch-specific).
> >
> > Obviously an API definition should *never* be "implementation-defined".
> 
> If there isn't a meaningful return value, we could either make it a
> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
> result for a UMWAIT, and there are no side-effects to the intrinsic
> itself (it's basically a fancy rte_pause).
> 
> >
> >
> >> I guess we could remove
> >> that and set return value to either 0 or -ENOTSUP if that would resolve
> >> the issue?
> >>
> >>> Also, we need to fix compilation issue if any with
> >>> http://patches.dpdk.org/patch/79540/
> >>> as it has direct reference to if
> >>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> >>> Either we need to add -ENOTSUP return or generic feature-get framework.
> >>
> >> IIRC power library isn't compiled on anything other than x86, so this
> >> code wouldn't get compiled.
> >
> > It is not call "power-x86", so we must assume it could work
> > on any architecture.
> 
> #ifdef it is!
> 
> >
> >
> >>>> I'm not against a generic feature-get framework, i'm just pointing out
> >>>> that if this is what's preventing the merge, it should prevent the merge
> >>>> of CLDEMOTE as well

I wouldn't consider these two as totally equal.
Yes, both are just hints to CPU, that can be ignored,
but if not, then consequences of executing are quite different.
If UMWAIT is not supported by cpu at all, then user might want to use some  
different power saving mechanism (pause, frequence scaling, etc.).
Without information is UMWAIT supported or not, user can't make
such choice.  

>, yet Jerin has acked that one and has explicitly
> >>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
> >
> > CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
> > that's why the expectations may be different.
> >
> 
> UMWAIT is a best-effort mechanism with no side-effects. It's perfectly
> legal for a UMWAIT to not sleep at all, thus rendering it effectively a
> noop. So i don't think it's all that different.
  
Burakov, Anatoly Oct. 9, 2020, 11:12 a.m. UTC | #20
On 09-Oct-20 11:48 AM, Ananyev, Konstantin wrote:
>> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
>>> 09/10/2020 12:03, Burakov, Anatoly:
>>>> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
>>>>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>>>>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jerin,
>>>>>>>>>>>
>>>>>>>>>>> Hi Anatoly,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>       > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>>>>>       > the consumer of this API is only supported on x86. Probably as
>>>>>>>>>>>> functions[1]
>>>>>>>>>>>>       > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>>>>>       > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>>>>>       >
>>>>>>>>>>>>       > This will help the consumer to create workers based on the
>>>>>>>>>>>> instruction features
>>>>>>>>>>>>       > which can NOT be abstracted as a generic feature across the
>>>>>>>>>>>> architectures.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>>>>>
>>>>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>>>>>
>>>>>>>>> That's works too. Some means of generic probing is fine. Following
>>>>>>>>> schemed needs
>>>>>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>>>>>> new instructions like
>>>>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>>>>>> supports it then
>>>>>>>>> the feature-get framework is good.
>>>>>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>>>>>> framework then
>>>>>>>>> we can keep the below scheme else generic arch feature is better for
>>>>>>>>> more forward
>>>>>>>>> looking use cases.
>>>>>>>>>
>>>>>>>>>> Might be just something like:
>>>>>>>>>>       rte_power_monitor(...) == -ENOTSUP
>>>>>>>>>> be enough indication for that?
>>>>>>>>>> So user can just do:
>>>>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>>>>>              /* not supported  path */
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> To check is that feature supported or not.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>>>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>>>>>> it's functionally identical to waking up immediately.
>>>>>>>>
>>>>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>>>>>
>>>>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>>>>>
>>>>>>
>>>>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>>>>>> exist on other archs, this doesn't too, so it's a fairly similar
>>>>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>>>>>> equivalent to sleeping and then immediately waking up (which can happen
>>>>>> for a host of reasons unrelated to the code itself).
>>>>>
>>>>> If we are keeping the following return in the public API then it can not be NOP
>>>>> + * @return
>>>>> + *   - 1 if wakeup was due to TSC timeout expiration.
>>>>> + *   - 0 if wakeup was due to memory write or other reasons.
>>>>> + */
>>>>>
>>>>
>>>> In the generic header, it is specified that return value is
>>>> implementation-defined (i.e. arch-specific).
>>>
>>> Obviously an API definition should *never* be "implementation-defined".
>>
>> If there isn't a meaningful return value, we could either make it a
>> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
>> result for a UMWAIT, and there are no side-effects to the intrinsic
>> itself (it's basically a fancy rte_pause).
>>
>>>
>>>
>>>> I guess we could remove
>>>> that and set return value to either 0 or -ENOTSUP if that would resolve
>>>> the issue?
>>>>
>>>>> Also, we need to fix compilation issue if any with
>>>>> http://patches.dpdk.org/patch/79540/
>>>>> as it has direct reference to if
>>>>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>>>>> Either we need to add -ENOTSUP return or generic feature-get framework.
>>>>
>>>> IIRC power library isn't compiled on anything other than x86, so this
>>>> code wouldn't get compiled.
>>>
>>> It is not call "power-x86", so we must assume it could work
>>> on any architecture.
>>
>> #ifdef it is!
>>
>>>
>>>
>>>>>> I'm not against a generic feature-get framework, i'm just pointing out
>>>>>> that if this is what's preventing the merge, it should prevent the merge
>>>>>> of CLDEMOTE as well
> 
> I wouldn't consider these two as totally equal.
> Yes, both are just hints to CPU, that can be ignored,
> but if not, then consequences of executing are quite different.
> If UMWAIT is not supported by cpu at all, then user might want to use some
> different power saving mechanism (pause, frequence scaling, etc.).
> Without information is UMWAIT supported or not, user can't make
> such choice.

After some attempts at implementing this, i actually came to the 
conclusion that some generic way to check support for this feature is 
necessary, because we end up with a usability inconsistency:

1) on non-x86, if you call the function, it returns -ENOTSUP
2) on x86, since we're not checking CPUID flags on every single call, 
it'll either succeed, or crash the process - the burden is on the user 
to check for CPUID flags, but it can't be done in an arch agnostic way 
because the CPUID flags are only defined for x86, thus requiring a 
special code path for x86

Where would be the best place to add such an infrastructure? I'm 
thinking rte_cpuflags.h?

> 
>> , yet Jerin has acked that one and has explicitly
>>>>>> stated that he's OK with leaving CLDEMOTE as a noop on other architectures.
>>>
>>> CLDEMOTE is used for optimization, while UMWAIT can be used in a logic,
>>> that's why the expectations may be different.
>>>
>>
>> UMWAIT is a best-effort mechanism with no side-effects. It's perfectly
>> legal for a UMWAIT to not sleep at all, thus rendering it effectively a
>> noop. So i don't think it's all that different.
> 
> 
> 
>
  
Bruce Richardson Oct. 9, 2020, 11:36 a.m. UTC | #21
On Fri, Oct 09, 2020 at 12:12:56PM +0100, Burakov, Anatoly wrote:
> On 09-Oct-20 11:48 AM, Ananyev, Konstantin wrote:
> > > On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
> > > > 09/10/2020 12:03, Burakov, Anatoly:
> > > > > On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
> > > > > > On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
> > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
> > > > > > > > 09/10/2020 11:25, Burakov, Anatoly:
> > > > > > > > > On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
> > > > > > > > > > On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
> > > > > > > > > > <konstantin.ananyev@intel.com> wrote:
> > > > > > > > > > > > On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
> > > > > > > > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > > > > > > > On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
> > > > > > > > > > > > > > On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Add two new power management intrinsics, and provide an implementation
> > > > > > > > > > > > > > > > in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> > > > > > > > > > > > > > > > are implemented as raw byte opcodes because there is not yet widespread
> > > > > > > > > > > > > > > > compiler support for these instructions.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The power management instructions provide an architecture-specific
> > > > > > > > > > > > > > > > function to either wait until a specified TSC timestamp is reached, or
> > > > > > > > > > > > > > > > optionally wait until either a TSC timestamp is reached or a memory
> > > > > > > > > > > > > > > > location is written to. The monitor function also provides an optional
> > > > > > > > > > > > > > > > comparison, to avoid sleeping when the expected write has already
> > > > > > > > > > > > > > > > happened, and no more writes are expected.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > For more details, Please reference Intel SDM Volume 2.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I really would like to see feedbacks from other arch maintainers.
> > > > > > > > > > > > > > > Unfortunately they were not Cc'ed.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
> > > > > > > > > > > > > > http://mails.dpdk.org/archives/dev/2020-September/181646.html
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Also please mark the new functions as experimental.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hi Jerin,
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Anatoly,
> > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > >       > IMO, We must introduce some arch feature-capability _get_ scheme to tell
> > > > > > > > > > > > >       > the consumer of this API is only supported on x86. Probably as
> > > > > > > > > > > > > functions[1]
> > > > > > > > > > > > >       > or macro flags scheme and have a stub for the other architectures as the
> > > > > > > > > > > > >       > API marked as generic ie rte_power_* not rte_x86_..
> > > > > > > > > > > > >       >
> > > > > > > > > > > > >       > This will help the consumer to create workers based on the
> > > > > > > > > > > > > instruction features
> > > > > > > > > > > > >       > which can NOT be abstracted as a generic feature across the
> > > > > > > > > > > > > architectures.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I'm not entirely sure what you mean by that.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I mean, yes, we should have added stubs for other architectures, and we
> > > > > > > > > > > > > will add those in future revisions, but what does your proposed runtime
> > > > > > > > > > > > > check accomplish that cannot currently be done with CPUID flags?
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
> > > > > > > > > > > > i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
> > > > > > > > > > > > and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
> > > > > > > > > > > > I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > I am agree with Jerin, that we need some generic way to
> > > > > > > > > > > figure-out does platform supports power_monitor() or not.
> > > > > > > > > > > Though not sure do we need to create a new feature-get framework here...
> > > > > > > > > > 
> > > > > > > > > > That's works too. Some means of generic probing is fine. Following
> > > > > > > > > > schemed needs
> > > > > > > > > > more documentation on that usage, as, it is not straight forward compare to
> > > > > > > > > > feature-get framework. Also, on the other thread, we are adding the
> > > > > > > > > > new instructions like
> > > > > > > > > > demote cacheline etc, maybe if the user wants to KNOW if the arch
> > > > > > > > > > supports it then
> > > > > > > > > > the feature-get framework is good.
> > > > > > > > > > If we think, there is no other usecase for generic arch feature-get
> > > > > > > > > > framework then
> > > > > > > > > > we can keep the below scheme else generic arch feature is better for
> > > > > > > > > > more forward
> > > > > > > > > > looking use cases.
> > > > > > > > > > 
> > > > > > > > > > > Might be just something like:
> > > > > > > > > > >       rte_power_monitor(...) == -ENOTSUP
> > > > > > > > > > > be enough indication for that?
> > > > > > > > > > > So user can just do:
> > > > > > > > > > > if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
> > > > > > > > > > >              /* not supported  path */
> > > > > > > > > > > }
> > > > > > > > > > > 
> > > > > > > > > > > To check is that feature supported or not.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
> > > > > > > > > we can safely make this intrinsic as a noop on other archs as well, as
> > > > > > > > > it's functionally identical to waking up immediately.
> > > > > > > > > 
> > > > > > > > > If we're not creating this for CLDEMOTE, we don't need it here as well.
> > > > > > > > > If we do need it for this, then we arguably need it for CLDEMOTE too.
> > > > > > > > 
> > > > > > > > Sorry I don't understand what you mean, too many "it" and "this" :)
> > > > > > > > 
> > > > > > > 
> > > > > > > Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
> > > > > > > exist on other archs, this doesn't too, so it's a fairly similar
> > > > > > > situation. Stubbing UMWAIT with a noop is a valid approach because it's
> > > > > > > equivalent to sleeping and then immediately waking up (which can happen
> > > > > > > for a host of reasons unrelated to the code itself).
> > > > > > 
> > > > > > If we are keeping the following return in the public API then it can not be NOP
> > > > > > + * @return
> > > > > > + *   - 1 if wakeup was due to TSC timeout expiration.
> > > > > > + *   - 0 if wakeup was due to memory write or other reasons.
> > > > > > + */
> > > > > > 
> > > > > 
> > > > > In the generic header, it is specified that return value is
> > > > > implementation-defined (i.e. arch-specific).
> > > > 
> > > > Obviously an API definition should *never* be "implementation-defined".
> > > 
> > > If there isn't a meaningful return value, we could either make it a
> > > void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
> > > result for a UMWAIT, and there are no side-effects to the intrinsic
> > > itself (it's basically a fancy rte_pause).
> > > 
> > > > 
> > > > 
> > > > > I guess we could remove
> > > > > that and set return value to either 0 or -ENOTSUP if that would resolve
> > > > > the issue?
> > > > > 
> > > > > > Also, we need to fix compilation issue if any with
> > > > > > http://patches.dpdk.org/patch/79540/
> > > > > > as it has direct reference to if
> > > > > > (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
> > > > > > Either we need to add -ENOTSUP return or generic feature-get framework.
> > > > > 
> > > > > IIRC power library isn't compiled on anything other than x86, so this
> > > > > code wouldn't get compiled.
> > > > 
> > > > It is not call "power-x86", so we must assume it could work
> > > > on any architecture.
> > > 
> > > #ifdef it is!
> > > 
> > > > 
> > > > 
> > > > > > > I'm not against a generic feature-get framework, i'm just pointing out
> > > > > > > that if this is what's preventing the merge, it should prevent the merge
> > > > > > > of CLDEMOTE as well
> > 
> > I wouldn't consider these two as totally equal.
> > Yes, both are just hints to CPU, that can be ignored,
> > but if not, then consequences of executing are quite different.
> > If UMWAIT is not supported by cpu at all, then user might want to use some
> > different power saving mechanism (pause, frequence scaling, etc.).
> > Without information is UMWAIT supported or not, user can't make
> > such choice.
> 
> After some attempts at implementing this, i actually came to the conclusion
> that some generic way to check support for this feature is necessary,
> because we end up with a usability inconsistency:
> 
> 1) on non-x86, if you call the function, it returns -ENOTSUP
> 2) on x86, since we're not checking CPUID flags on every single call, it'll
> either succeed, or crash the process - the burden is on the user to check
> for CPUID flags, but it can't be done in an arch agnostic way because the
> CPUID flags are only defined for x86, thus requiring a special code path for
> x86
> 
> Where would be the best place to add such an infrastructure? I'm thinking
> rte_cpuflags.h?
> 
Time to relook at some of the contents of patchset:
http://patches.dpdk.org/project/dpdk/list/?series=4811&archive=both&state=*

The idea of that set (IIRC) was to replace the per-architecture enums with
just strings to avoid situations like this - or at least make them less
awkward.

/Bruce
  
Burakov, Anatoly Oct. 9, 2020, 11:42 a.m. UTC | #22
On 09-Oct-20 12:36 PM, Bruce Richardson wrote:
> On Fri, Oct 09, 2020 at 12:12:56PM +0100, Burakov, Anatoly wrote:
>> On 09-Oct-20 11:48 AM, Ananyev, Konstantin wrote:
>>>> On 09-Oct-20 11:17 AM, Thomas Monjalon wrote:
>>>>> 09/10/2020 12:03, Burakov, Anatoly:
>>>>>> On 09-Oct-20 10:54 AM, Jerin Jacob wrote:
>>>>>>> On Fri, Oct 9, 2020 at 3:10 PM Burakov, Anatoly
>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>> On 09-Oct-20 10:29 AM, Thomas Monjalon wrote:
>>>>>>>>> 09/10/2020 11:25, Burakov, Anatoly:
>>>>>>>>>> On 09-Oct-20 6:42 AM, Jerin Jacob wrote:
>>>>>>>>>>> On Thu, Oct 8, 2020 at 10:38 PM Ananyev, Konstantin
>>>>>>>>>>> <konstantin.ananyev@intel.com> wrote:
>>>>>>>>>>>>> On Thu, Oct 8, 2020 at 6:57 PM Burakov, Anatoly
>>>>>>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>>>>>> On 08-Oct-20 9:44 AM, Jerin Jacob wrote:
>>>>>>>>>>>>>>> On Thu, Oct 8, 2020 at 2:04 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>>>>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For more details, Please reference Intel SDM Volume 2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I really would like to see feedbacks from other arch maintainers.
>>>>>>>>>>>>>>>> Unfortunately they were not Cc'ed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Shared the feedback from the arm64 perspective here. Yet to get a reply on this.
>>>>>>>>>>>>>>> http://mails.dpdk.org/archives/dev/2020-September/181646.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also please mark the new functions as experimental.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Jerin,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Anatoly,
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        > IMO, We must introduce some arch feature-capability _get_ scheme to tell
>>>>>>>>>>>>>>        > the consumer of this API is only supported on x86. Probably as
>>>>>>>>>>>>>> functions[1]
>>>>>>>>>>>>>>        > or macro flags scheme and have a stub for the other architectures as the
>>>>>>>>>>>>>>        > API marked as generic ie rte_power_* not rte_x86_..
>>>>>>>>>>>>>>        >
>>>>>>>>>>>>>>        > This will help the consumer to create workers based on the
>>>>>>>>>>>>>> instruction features
>>>>>>>>>>>>>>        > which can NOT be abstracted as a generic feature across the
>>>>>>>>>>>>>> architectures.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not entirely sure what you mean by that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I mean, yes, we should have added stubs for other architectures, and we
>>>>>>>>>>>>>> will add those in future revisions, but what does your proposed runtime
>>>>>>>>>>>>>> check accomplish that cannot currently be done with CPUID flags?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> RTE_CPUFLAG_WAITPKG  flag definition is not available in other architectures.
>>>>>>>>>>>>> i.e RTE_CPUFLAG_WAITPKG defined in lib/librte_eal/x86/include/rte_cpuflags.h
>>>>>>>>>>>>> and it is used in http://patches.dpdk.org/patch/79540/ as generic API.
>>>>>>>>>>>>> I doubt http://patches.dpdk.org/patch/79540/  would compile on non-x86.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am agree with Jerin, that we need some generic way to
>>>>>>>>>>>> figure-out does platform supports power_monitor() or not.
>>>>>>>>>>>> Though not sure do we need to create a new feature-get framework here...
>>>>>>>>>>>
>>>>>>>>>>> That's works too. Some means of generic probing is fine. Following
>>>>>>>>>>> schemed needs
>>>>>>>>>>> more documentation on that usage, as, it is not straight forward compare to
>>>>>>>>>>> feature-get framework. Also, on the other thread, we are adding the
>>>>>>>>>>> new instructions like
>>>>>>>>>>> demote cacheline etc, maybe if the user wants to KNOW if the arch
>>>>>>>>>>> supports it then
>>>>>>>>>>> the feature-get framework is good.
>>>>>>>>>>> If we think, there is no other usecase for generic arch feature-get
>>>>>>>>>>> framework then
>>>>>>>>>>> we can keep the below scheme else generic arch feature is better for
>>>>>>>>>>> more forward
>>>>>>>>>>> looking use cases.
>>>>>>>>>>>
>>>>>>>>>>>> Might be just something like:
>>>>>>>>>>>>        rte_power_monitor(...) == -ENOTSUP
>>>>>>>>>>>> be enough indication for that?
>>>>>>>>>>>> So user can just do:
>>>>>>>>>>>> if (rte_power_monitor(NULL, 0, 0, 0, 0) == -ENOTSUP) {
>>>>>>>>>>>>               /* not supported  path */
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> To check is that feature supported or not.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Looking at CLDEMOTE patches, CLDEMOTE is a noop on other archs. I think
>>>>>>>>>> we can safely make this intrinsic as a noop on other archs as well, as
>>>>>>>>>> it's functionally identical to waking up immediately.
>>>>>>>>>>
>>>>>>>>>> If we're not creating this for CLDEMOTE, we don't need it here as well.
>>>>>>>>>> If we do need it for this, then we arguably need it for CLDEMOTE too.
>>>>>>>>>
>>>>>>>>> Sorry I don't understand what you mean, too many "it" and "this" :)
>>>>>>>>>
>>>>>>>>
>>>>>>>> Sorry, i meant "the generic feature-get framework". CLDEMOTE doesn't
>>>>>>>> exist on other archs, this doesn't too, so it's a fairly similar
>>>>>>>> situation. Stubbing UMWAIT with a noop is a valid approach because it's
>>>>>>>> equivalent to sleeping and then immediately waking up (which can happen
>>>>>>>> for a host of reasons unrelated to the code itself).
>>>>>>>
>>>>>>> If we are keeping the following return in the public API then it can not be NOP
>>>>>>> + * @return
>>>>>>> + *   - 1 if wakeup was due to TSC timeout expiration.
>>>>>>> + *   - 0 if wakeup was due to memory write or other reasons.
>>>>>>> + */
>>>>>>>
>>>>>>
>>>>>> In the generic header, it is specified that return value is
>>>>>> implementation-defined (i.e. arch-specific).
>>>>>
>>>>> Obviously an API definition should *never* be "implementation-defined".
>>>>
>>>> If there isn't a meaningful return value, we could either make it a
>>>> void, or return 0/-ENOTSUP so. I'm OK with either as nop is a valid
>>>> result for a UMWAIT, and there are no side-effects to the intrinsic
>>>> itself (it's basically a fancy rte_pause).
>>>>
>>>>>
>>>>>
>>>>>> I guess we could remove
>>>>>> that and set return value to either 0 or -ENOTSUP if that would resolve
>>>>>> the issue?
>>>>>>
>>>>>>> Also, we need to fix compilation issue if any with
>>>>>>> http://patches.dpdk.org/patch/79540/
>>>>>>> as it has direct reference to if
>>>>>>> (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_WAITPKG)) {
>>>>>>> Either we need to add -ENOTSUP return or generic feature-get framework.
>>>>>>
>>>>>> IIRC power library isn't compiled on anything other than x86, so this
>>>>>> code wouldn't get compiled.
>>>>>
>>>>> It is not call "power-x86", so we must assume it could work
>>>>> on any architecture.
>>>>
>>>> #ifdef it is!
>>>>
>>>>>
>>>>>
>>>>>>>> I'm not against a generic feature-get framework, i'm just pointing out
>>>>>>>> that if this is what's preventing the merge, it should prevent the merge
>>>>>>>> of CLDEMOTE as well
>>>
>>> I wouldn't consider these two as totally equal.
>>> Yes, both are just hints to CPU, that can be ignored,
>>> but if not, then consequences of executing are quite different.
>>> If UMWAIT is not supported by cpu at all, then user might want to use some
>>> different power saving mechanism (pause, frequence scaling, etc.).
>>> Without information is UMWAIT supported or not, user can't make
>>> such choice.
>>
>> After some attempts at implementing this, i actually came to the conclusion
>> that some generic way to check support for this feature is necessary,
>> because we end up with a usability inconsistency:
>>
>> 1) on non-x86, if you call the function, it returns -ENOTSUP
>> 2) on x86, since we're not checking CPUID flags on every single call, it'll
>> either succeed, or crash the process - the burden is on the user to check
>> for CPUID flags, but it can't be done in an arch agnostic way because the
>> CPUID flags are only defined for x86, thus requiring a special code path for
>> x86
>>
>> Where would be the best place to add such an infrastructure? I'm thinking
>> rte_cpuflags.h?
>>
> Time to relook at some of the contents of patchset:
> http://patches.dpdk.org/project/dpdk/list/?series=4811&archive=both&state=*
> 
> The idea of that set (IIRC) was to replace the per-architecture enums with
> just strings to avoid situations like this - or at least make them less
> awkward.
> 
> /Bruce
> 

Yes, that patchset looks like it would work nicely in this case. If 
there is consensus to resurrect it and make it work, i'll drop the 
"generic feature get" thing, but for now i'll continue working on it - 
easier to remove code than to add it :D
  
Ananyev, Konstantin Oct. 9, 2020, 3:39 p.m. UTC | #23
> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
> >>
> >> Add two new power management intrinsics, and provide an implementation
> >> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >> are implemented as raw byte opcodes because there is not yet widespread
> >> compiler support for these instructions.
> >>
> >> The power management instructions provide an architecture-specific
> >> function to either wait until a specified TSC timestamp is reached, or
> >> optionally wait until either a TSC timestamp is reached or a memory
> >> location is written to. The monitor function also provides an optional
> >> comparison, to avoid sleeping when the expected write has already
> >> happened, and no more writes are expected.
> >
> > I think what this API is missing - a function to wakeup sleeping core.
> > If user can/should use some system call to achieve that, then at least
> > it has to be clearly documented, even better some wrapper provided.
> 
> I don't think it's possible to do that without severely overcomplicating
> the intrinsic and its usage, because AFAIK the only way to wake up a
> sleeping core would be to send some kind of interrupt to the core, or
> trigger a write to the cache-line in question.
> 

Yes, I think we either need a syscall that would do an IPI for us
(on top of my head - membarrier() does that, might be there are some other syscalls too),
or something hand-made. For hand-made, I wonder would something like that
be safe and sufficient:
uint64_t val = atomic_load(addr);
CAS(addr, val, &val);
?
Anyway, one way or another - I think ability to wakeup core we put to sleep
have to be an essential part of this feature. 
As I understand linux kernel will limit max amount of sleep time for these instructions:
https://lwn.net/Articles/790920/
But relying just on that, seems too vague for me:
- user can adjust that value
- wouldn't apply to older kernels and non-linux cases
Konstantin
  
Burakov, Anatoly Oct. 9, 2020, 4:10 p.m. UTC | #24
On 09-Oct-20 4:39 PM, Ananyev, Konstantin wrote:
> 
>> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
>>>>
>>>> Add two new power management intrinsics, and provide an implementation
>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>> compiler support for these instructions.
>>>>
>>>> The power management instructions provide an architecture-specific
>>>> function to either wait until a specified TSC timestamp is reached, or
>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>> location is written to. The monitor function also provides an optional
>>>> comparison, to avoid sleeping when the expected write has already
>>>> happened, and no more writes are expected.
>>>
>>> I think what this API is missing - a function to wakeup sleeping core.
>>> If user can/should use some system call to achieve that, then at least
>>> it has to be clearly documented, even better some wrapper provided.
>>
>> I don't think it's possible to do that without severely overcomplicating
>> the intrinsic and its usage, because AFAIK the only way to wake up a
>> sleeping core would be to send some kind of interrupt to the core, or
>> trigger a write to the cache-line in question.
>>
> 
> Yes, I think we either need a syscall that would do an IPI for us
> (on top of my head - membarrier() does that, might be there are some other syscalls too),
> or something hand-made. For hand-made, I wonder would something like that
> be safe and sufficient:
> uint64_t val = atomic_load(addr);
> CAS(addr, val, &val);
> ?
> Anyway, one way or another - I think ability to wakeup core we put to sleep
> have to be an essential part of this feature.
> As I understand linux kernel will limit max amount of sleep time for these instructions:
> https://lwn.net/Articles/790920/
> But relying just on that, seems too vague for me:
> - user can adjust that value
> - wouldn't apply to older kernels and non-linux cases
> Konstantin
> 

This implies knowing the value the core is sleeping on. That's not 
always the case - with this particular PMD power management scheme, we 
get the address from the PMD and it stays inside the callback.
  
Ananyev, Konstantin Oct. 9, 2020, 4:56 p.m. UTC | #25
> 
> On 09-Oct-20 4:39 PM, Ananyev, Konstantin wrote:
> >
> >> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
> >>>>
> >>>> Add two new power management intrinsics, and provide an implementation
> >>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>> compiler support for these instructions.
> >>>>
> >>>> The power management instructions provide an architecture-specific
> >>>> function to either wait until a specified TSC timestamp is reached, or
> >>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>> location is written to. The monitor function also provides an optional
> >>>> comparison, to avoid sleeping when the expected write has already
> >>>> happened, and no more writes are expected.
> >>>
> >>> I think what this API is missing - a function to wakeup sleeping core.
> >>> If user can/should use some system call to achieve that, then at least
> >>> it has to be clearly documented, even better some wrapper provided.
> >>
> >> I don't think it's possible to do that without severely overcomplicating
> >> the intrinsic and its usage, because AFAIK the only way to wake up a
> >> sleeping core would be to send some kind of interrupt to the core, or
> >> trigger a write to the cache-line in question.
> >>
> >
> > Yes, I think we either need a syscall that would do an IPI for us
> > (on top of my head - membarrier() does that, might be there are some other syscalls too),
> > or something hand-made. For hand-made, I wonder would something like that
> > be safe and sufficient:
> > uint64_t val = atomic_load(addr);
> > CAS(addr, val, &val);
> > ?
> > Anyway, one way or another - I think ability to wakeup core we put to sleep
> > have to be an essential part of this feature.
> > As I understand linux kernel will limit max amount of sleep time for these instructions:
> > https://lwn.net/Articles/790920/
> > But relying just on that, seems too vague for me:
> > - user can adjust that value
> > - wouldn't apply to older kernels and non-linux cases
> > Konstantin
> >
> 
> This implies knowing the value the core is sleeping on.

You don't the value to wait for, you just need an address.
And you can make wakeup function to accept address as a parameter,
same as monitor() does. 

> That's not
> always the case - with this particular PMD power management scheme, we
> get the address from the PMD and it stays inside the callback.

That's fine - you can store address inside you callback metadata 
and do wakeup as part of _disable_ function.
  
Burakov, Anatoly Oct. 9, 2020, 4:59 p.m. UTC | #26
On 09-Oct-20 5:56 PM, Ananyev, Konstantin wrote:
> 
>>
>> On 09-Oct-20 4:39 PM, Ananyev, Konstantin wrote:
>>>
>>>> On 08-Oct-20 6:15 PM, Ananyev, Konstantin wrote:
>>>>>>
>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>> compiler support for these instructions.
>>>>>>
>>>>>> The power management instructions provide an architecture-specific
>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>> location is written to. The monitor function also provides an optional
>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>> happened, and no more writes are expected.
>>>>>
>>>>> I think what this API is missing - a function to wakeup sleeping core.
>>>>> If user can/should use some system call to achieve that, then at least
>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>
>>>> I don't think it's possible to do that without severely overcomplicating
>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>> sleeping core would be to send some kind of interrupt to the core, or
>>>> trigger a write to the cache-line in question.
>>>>
>>>
>>> Yes, I think we either need a syscall that would do an IPI for us
>>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
>>> or something hand-made. For hand-made, I wonder would something like that
>>> be safe and sufficient:
>>> uint64_t val = atomic_load(addr);
>>> CAS(addr, val, &val);
>>> ?
>>> Anyway, one way or another - I think ability to wakeup core we put to sleep
>>> have to be an essential part of this feature.
>>> As I understand linux kernel will limit max amount of sleep time for these instructions:
>>> https://lwn.net/Articles/790920/
>>> But relying just on that, seems too vague for me:
>>> - user can adjust that value
>>> - wouldn't apply to older kernels and non-linux cases
>>> Konstantin
>>>
>>
>> This implies knowing the value the core is sleeping on.
> 
> You don't the value to wait for, you just need an address.
> And you can make wakeup function to accept address as a parameter,
> same as monitor() does.

Sorry, i meant the address. We don't know the address we're sleeping on.

> 
>> That's not
>> always the case - with this particular PMD power management scheme, we
>> get the address from the PMD and it stays inside the callback.
> 
> That's fine - you can store address inside you callback metadata
> and do wakeup as part of _disable_ function.
> 

The address may be different, and by the time we access the address it 
may become stale, so i don't see how that would help unless you're 
suggesting to have some kind of synchronization mechanism there.
  
Ananyev, Konstantin Oct. 10, 2020, 1:19 p.m. UTC | #27
> >>>>>> Add two new power management intrinsics, and provide an implementation
> >>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>> are implemented as raw byte opcodes because there is not yet widespread
> >>>>>> compiler support for these instructions.
> >>>>>>
> >>>>>> The power management instructions provide an architecture-specific
> >>>>>> function to either wait until a specified TSC timestamp is reached, or
> >>>>>> optionally wait until either a TSC timestamp is reached or a memory
> >>>>>> location is written to. The monitor function also provides an optional
> >>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>> happened, and no more writes are expected.
> >>>>>
> >>>>> I think what this API is missing - a function to wakeup sleeping core.
> >>>>> If user can/should use some system call to achieve that, then at least
> >>>>> it has to be clearly documented, even better some wrapper provided.
> >>>>
> >>>> I don't think it's possible to do that without severely overcomplicating
> >>>> the intrinsic and its usage, because AFAIK the only way to wake up a
> >>>> sleeping core would be to send some kind of interrupt to the core, or
> >>>> trigger a write to the cache-line in question.
> >>>>
> >>>
> >>> Yes, I think we either need a syscall that would do an IPI for us
> >>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
> >>> or something hand-made. For hand-made, I wonder would something like that
> >>> be safe and sufficient:
> >>> uint64_t val = atomic_load(addr);
> >>> CAS(addr, val, &val);
> >>> ?
> >>> Anyway, one way or another - I think ability to wakeup core we put to sleep
> >>> have to be an essential part of this feature.
> >>> As I understand linux kernel will limit max amount of sleep time for these instructions:
> >>> https://lwn.net/Articles/790920/
> >>> But relying just on that, seems too vague for me:
> >>> - user can adjust that value
> >>> - wouldn't apply to older kernels and non-linux cases
> >>> Konstantin
> >>>
> >>
> >> This implies knowing the value the core is sleeping on.
> >
> > You don't the value to wait for, you just need an address.
> > And you can make wakeup function to accept address as a parameter,
> > same as monitor() does.
> 
> Sorry, i meant the address. We don't know the address we're sleeping on.
> 
> >
> >> That's not
> >> always the case - with this particular PMD power management scheme, we
> >> get the address from the PMD and it stays inside the callback.
> >
> > That's fine - you can store address inside you callback metadata
> > and do wakeup as part of _disable_ function.
> >
> 
> The address may be different, and by the time we access the address it
> may become stale, so i don't see how that would help unless you're
> suggesting to have some kind of synchronization mechanism there.

Yes, we'll need something to sync here for sure.
Sorry, I should say it straightway, to avoid further misunderstanding.
Let say, associate a spin_lock with monitor(), by analogy with pthread_cond_wait().  
Konstantin
  
Burakov, Anatoly Oct. 12, 2020, 10:35 a.m. UTC | #28
On 10-Oct-20 2:19 PM, Ananyev, Konstantin wrote:
> 
> 
>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>> compiler support for these instructions.
>>>>>>>>
>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>> happened, and no more writes are expected.
>>>>>>>
>>>>>>> I think what this API is missing - a function to wakeup sleeping core.
>>>>>>> If user can/should use some system call to achieve that, then at least
>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>
>>>>>> I don't think it's possible to do that without severely overcomplicating
>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>> sleeping core would be to send some kind of interrupt to the core, or
>>>>>> trigger a write to the cache-line in question.
>>>>>>
>>>>>
>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
>>>>> or something hand-made. For hand-made, I wonder would something like that
>>>>> be safe and sufficient:
>>>>> uint64_t val = atomic_load(addr);
>>>>> CAS(addr, val, &val);
>>>>> ?
>>>>> Anyway, one way or another - I think ability to wakeup core we put to sleep
>>>>> have to be an essential part of this feature.
>>>>> As I understand linux kernel will limit max amount of sleep time for these instructions:
>>>>> https://lwn.net/Articles/790920/
>>>>> But relying just on that, seems too vague for me:
>>>>> - user can adjust that value
>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>> Konstantin
>>>>>
>>>>
>>>> This implies knowing the value the core is sleeping on.
>>>
>>> You don't the value to wait for, you just need an address.
>>> And you can make wakeup function to accept address as a parameter,
>>> same as monitor() does.
>>
>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>
>>>
>>>> That's not
>>>> always the case - with this particular PMD power management scheme, we
>>>> get the address from the PMD and it stays inside the callback.
>>>
>>> That's fine - you can store address inside you callback metadata
>>> and do wakeup as part of _disable_ function.
>>>
>>
>> The address may be different, and by the time we access the address it
>> may become stale, so i don't see how that would help unless you're
>> suggesting to have some kind of synchronization mechanism there.
> 
> Yes, we'll need something to sync here for sure.
> Sorry, I should say it straightway, to avoid further misunderstanding.
> Let say, associate a spin_lock with monitor(), by analogy with pthread_cond_wait().
> Konstantin
> 

The idea was to provide an intrinsic-like function - as in, raw 
instruction call, without anything extra. We even added the masks/values 
etc. only because there's no race-less way to combine UMONITOR/UMWAIT 
without those.

Perhaps we can provide a synchronize-able wrapper around it to avoid 
adding overhead to calls that function but doesn't need the sync mechanism?
  
Burakov, Anatoly Oct. 12, 2020, 10:36 a.m. UTC | #29
On 12-Oct-20 11:35 AM, Burakov, Anatoly wrote:
> On 10-Oct-20 2:19 PM, Ananyev, Konstantin wrote:
>>
>>
>>>>>>>>> Add two new power management intrinsics, and provide an 
>>>>>>>>> implementation
>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>> are implemented as raw byte opcodes because there is not yet 
>>>>>>>>> widespread
>>>>>>>>> compiler support for these instructions.
>>>>>>>>>
>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>> function to either wait until a specified TSC timestamp is 
>>>>>>>>> reached, or
>>>>>>>>> optionally wait until either a TSC timestamp is reached or a 
>>>>>>>>> memory
>>>>>>>>> location is written to. The monitor function also provides an 
>>>>>>>>> optional
>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>
>>>>>>>> I think what this API is missing - a function to wakeup sleeping 
>>>>>>>> core.
>>>>>>>> If user can/should use some system call to achieve that, then at 
>>>>>>>> least
>>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>>
>>>>>>> I don't think it's possible to do that without severely 
>>>>>>> overcomplicating
>>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>>> sleeping core would be to send some kind of interrupt to the 
>>>>>>> core, or
>>>>>>> trigger a write to the cache-line in question.
>>>>>>>
>>>>>>
>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>> (on top of my head - membarrier() does that, might be there are 
>>>>>> some other syscalls too),
>>>>>> or something hand-made. For hand-made, I wonder would something 
>>>>>> like that
>>>>>> be safe and sufficient:
>>>>>> uint64_t val = atomic_load(addr);
>>>>>> CAS(addr, val, &val);
>>>>>> ?
>>>>>> Anyway, one way or another - I think ability to wakeup core we put 
>>>>>> to sleep
>>>>>> have to be an essential part of this feature.
>>>>>> As I understand linux kernel will limit max amount of sleep time 
>>>>>> for these instructions:
>>>>>> https://lwn.net/Articles/790920/
>>>>>> But relying just on that, seems too vague for me:
>>>>>> - user can adjust that value
>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>> Konstantin
>>>>>>
>>>>>
>>>>> This implies knowing the value the core is sleeping on.
>>>>
>>>> You don't the value to wait for, you just need an address.
>>>> And you can make wakeup function to accept address as a parameter,
>>>> same as monitor() does.
>>>
>>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>>
>>>>
>>>>> That's not
>>>>> always the case - with this particular PMD power management scheme, we
>>>>> get the address from the PMD and it stays inside the callback.
>>>>
>>>> That's fine - you can store address inside you callback metadata
>>>> and do wakeup as part of _disable_ function.
>>>>
>>>
>>> The address may be different, and by the time we access the address it
>>> may become stale, so i don't see how that would help unless you're
>>> suggesting to have some kind of synchronization mechanism there.
>>
>> Yes, we'll need something to sync here for sure.
>> Sorry, I should say it straightway, to avoid further misunderstanding.
>> Let say, associate a spin_lock with monitor(), by analogy with 
>> pthread_cond_wait().
>> Konstantin
>>
> 
> The idea was to provide an intrinsic-like function - as in, raw 
> instruction call, without anything extra. We even added the masks/values 
> etc. only because there's no race-less way to combine UMONITOR/UMWAIT 
> without those.
> 
> Perhaps we can provide a synchronize-able wrapper around it to avoid 
> adding overhead to calls that function but doesn't need the sync mechanism?
> 

Also, how would having a spinlock help to synchronize? Are you 
suggesting we do UMWAIT on a spinlock address, or something to that effect?
  
Ananyev, Konstantin Oct. 12, 2020, 12:50 p.m. UTC | #30
> >>
> >>>>>>>>> Add two new power management intrinsics, and provide an
> >>>>>>>>> implementation
> >>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
> >>>>>>>>> are implemented as raw byte opcodes because there is not yet
> >>>>>>>>> widespread
> >>>>>>>>> compiler support for these instructions.
> >>>>>>>>>
> >>>>>>>>> The power management instructions provide an architecture-specific
> >>>>>>>>> function to either wait until a specified TSC timestamp is
> >>>>>>>>> reached, or
> >>>>>>>>> optionally wait until either a TSC timestamp is reached or a
> >>>>>>>>> memory
> >>>>>>>>> location is written to. The monitor function also provides an
> >>>>>>>>> optional
> >>>>>>>>> comparison, to avoid sleeping when the expected write has already
> >>>>>>>>> happened, and no more writes are expected.
> >>>>>>>>
> >>>>>>>> I think what this API is missing - a function to wakeup sleeping
> >>>>>>>> core.
> >>>>>>>> If user can/should use some system call to achieve that, then at
> >>>>>>>> least
> >>>>>>>> it has to be clearly documented, even better some wrapper provided.
> >>>>>>>
> >>>>>>> I don't think it's possible to do that without severely
> >>>>>>> overcomplicating
> >>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
> >>>>>>> sleeping core would be to send some kind of interrupt to the
> >>>>>>> core, or
> >>>>>>> trigger a write to the cache-line in question.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, I think we either need a syscall that would do an IPI for us
> >>>>>> (on top of my head - membarrier() does that, might be there are
> >>>>>> some other syscalls too),
> >>>>>> or something hand-made. For hand-made, I wonder would something
> >>>>>> like that
> >>>>>> be safe and sufficient:
> >>>>>> uint64_t val = atomic_load(addr);
> >>>>>> CAS(addr, val, &val);
> >>>>>> ?
> >>>>>> Anyway, one way or another - I think ability to wakeup core we put
> >>>>>> to sleep
> >>>>>> have to be an essential part of this feature.
> >>>>>> As I understand linux kernel will limit max amount of sleep time
> >>>>>> for these instructions:
> >>>>>> https://lwn.net/Articles/790920/
> >>>>>> But relying just on that, seems too vague for me:
> >>>>>> - user can adjust that value
> >>>>>> - wouldn't apply to older kernels and non-linux cases
> >>>>>> Konstantin
> >>>>>>
> >>>>>
> >>>>> This implies knowing the value the core is sleeping on.
> >>>>
> >>>> You don't the value to wait for, you just need an address.
> >>>> And you can make wakeup function to accept address as a parameter,
> >>>> same as monitor() does.
> >>>
> >>> Sorry, i meant the address. We don't know the address we're sleeping on.
> >>>
> >>>>
> >>>>> That's not
> >>>>> always the case - with this particular PMD power management scheme, we
> >>>>> get the address from the PMD and it stays inside the callback.
> >>>>
> >>>> That's fine - you can store address inside you callback metadata
> >>>> and do wakeup as part of _disable_ function.
> >>>>
> >>>
> >>> The address may be different, and by the time we access the address it
> >>> may become stale, so i don't see how that would help unless you're
> >>> suggesting to have some kind of synchronization mechanism there.
> >>
> >> Yes, we'll need something to sync here for sure.
> >> Sorry, I should say it straightway, to avoid further misunderstanding.
> >> Let say, associate a spin_lock with monitor(), by analogy with
> >> pthread_cond_wait().
> >> Konstantin
> >>
> >
> > The idea was to provide an intrinsic-like function - as in, raw
> > instruction call, without anything extra. We even added the masks/values
> > etc. only because there's no race-less way to combine UMONITOR/UMWAIT
> > without those.
>>
> > Perhaps we can provide a synchronize-able wrapper around it to avoid
> > adding overhead to calls that function but doesn't need the sync mechanism?

Yes, might be two flavours, something like
rte_power_monitor() and rte_power_monitor_sync() 
or whatever would be a better name.

> >
> 
> Also, how would having a spinlock help to synchronize? Are you
> suggesting we do UMWAIT on a spinlock address, or something to that effect?
> 

I thought about something very similar to cond_wait() working model:

/*
 * Caller has to obtain lock before calling that function.
 */
static inline int rte_power_monitor_sync(const volatile void *p,
                const uint64_t expected_value, const uint64_t value_mask,
                const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck)
{
	/* do whatever preparations are needed */
               ....
	umonitor(p);

	if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) {
		return 0;
 	}
	
	/* release lock and go to sleep */
	rte_spinlock_unlock(lck);
	rflags = umwait();

	/* grab lock back after wakeup */
	rte_spinlock_lock(lck);

	/* do rest of processing */
	....
}

/* similar go cond_signal */
static inline void rte_power_monitor_wakeup(volatile void *p)
{
	uint64_t v;

	v = __atomic_load_n(p, __ATOMIC_RELAXED);
	__atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
}               


Now in librte_power:

struct pmd_queue_cfg {
       /* to protect state and wait_addr */
       rte_spinlock_t lck;
       enum pmd_mgmt_state pwr_mgmt_state;
       void *wait_addr;
       /* rest fields */
      ....
} __rte_cache_aligned;


static uint16_t
rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
                struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
                uint16_t max_pkts __rte_unused, void *_  __rte_unused)
{

        struct pmd_queue_cfg *q_conf;
        q_conf = &port_cfg[port_id].queue_cfg[qidx];

        if (unlikely(nb_rx == 0)) {
                q_conf->empty_poll_stats++;
                if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
                        volatile void *target_addr;
                        uint64_t expected, mask;
                        uint16_t ret;
		
	         /* grab the lock and check the state */
                       rte_spinlock_lock(&q_conf->lck);
	         If (q-conf->state == ENABLED) {
	                        ret = rte_eth_get_wake_addr(port_id, qidx,
                                                    &target_addr, &expected, &mask);
		          If (ret == 0) {
			q_conf->wait_addr = target_addr;
			rte_power_monitor(target_addr, ..., &q_conf->lck);
		         }	
		          /* reset the wait_addr */
		          q_conf->wait_addr = NULL;
	         }
	         rte_spinlock_unlock(&q_conf->lck);	
	         ....
}

nt
rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
                                uint16_t port_id,
                                uint16_t queue_id)
{
	...
	/* grab the lock and change the state */
               rte_spinlock_lock(&q_conf->lck);
	queue_cfg->state = DISABLED;

	/* wakeup if necessary */
	If (queue_cfg->wakeup_addr != NULL)
		rte_power_monitor_wakeup(queue_cfg->wakeup_addr);

	rte_spinlock_unlock(&q_conf->lck);
	...
}
  
Burakov, Anatoly Oct. 12, 2020, 1:13 p.m. UTC | #31
On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote:
> 
>>>>
>>>>>>>>>>> Add two new power management intrinsics, and provide an
>>>>>>>>>>> implementation
>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet
>>>>>>>>>>> widespread
>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>
>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>> function to either wait until a specified TSC timestamp is
>>>>>>>>>>> reached, or
>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a
>>>>>>>>>>> memory
>>>>>>>>>>> location is written to. The monitor function also provides an
>>>>>>>>>>> optional
>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>
>>>>>>>>>> I think what this API is missing - a function to wakeup sleeping
>>>>>>>>>> core.
>>>>>>>>>> If user can/should use some system call to achieve that, then at
>>>>>>>>>> least
>>>>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>>>>
>>>>>>>>> I don't think it's possible to do that without severely
>>>>>>>>> overcomplicating
>>>>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>>>>> sleeping core would be to send some kind of interrupt to the
>>>>>>>>> core, or
>>>>>>>>> trigger a write to the cache-line in question.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>>>> (on top of my head - membarrier() does that, might be there are
>>>>>>>> some other syscalls too),
>>>>>>>> or something hand-made. For hand-made, I wonder would something
>>>>>>>> like that
>>>>>>>> be safe and sufficient:
>>>>>>>> uint64_t val = atomic_load(addr);
>>>>>>>> CAS(addr, val, &val);
>>>>>>>> ?
>>>>>>>> Anyway, one way or another - I think ability to wakeup core we put
>>>>>>>> to sleep
>>>>>>>> have to be an essential part of this feature.
>>>>>>>> As I understand linux kernel will limit max amount of sleep time
>>>>>>>> for these instructions:
>>>>>>>> https://lwn.net/Articles/790920/
>>>>>>>> But relying just on that, seems too vague for me:
>>>>>>>> - user can adjust that value
>>>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>>>> Konstantin
>>>>>>>>
>>>>>>>
>>>>>>> This implies knowing the value the core is sleeping on.
>>>>>>
>>>>>> You don't the value to wait for, you just need an address.
>>>>>> And you can make wakeup function to accept address as a parameter,
>>>>>> same as monitor() does.
>>>>>
>>>>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>>>>
>>>>>>
>>>>>>> That's not
>>>>>>> always the case - with this particular PMD power management scheme, we
>>>>>>> get the address from the PMD and it stays inside the callback.
>>>>>>
>>>>>> That's fine - you can store address inside you callback metadata
>>>>>> and do wakeup as part of _disable_ function.
>>>>>>
>>>>>
>>>>> The address may be different, and by the time we access the address it
>>>>> may become stale, so i don't see how that would help unless you're
>>>>> suggesting to have some kind of synchronization mechanism there.
>>>>
>>>> Yes, we'll need something to sync here for sure.
>>>> Sorry, I should say it straightway, to avoid further misunderstanding.
>>>> Let say, associate a spin_lock with monitor(), by analogy with
>>>> pthread_cond_wait().
>>>> Konstantin
>>>>
>>>
>>> The idea was to provide an intrinsic-like function - as in, raw
>>> instruction call, without anything extra. We even added the masks/values
>>> etc. only because there's no race-less way to combine UMONITOR/UMWAIT
>>> without those.
>>>
>>> Perhaps we can provide a synchronize-able wrapper around it to avoid
>>> adding overhead to calls that function but doesn't need the sync mechanism?
> 
> Yes, might be two flavours, something like
> rte_power_monitor() and rte_power_monitor_sync()
> or whatever would be a better name.
> 
>>>
>>
>> Also, how would having a spinlock help to synchronize? Are you
>> suggesting we do UMWAIT on a spinlock address, or something to that effect?
>>
> 
> I thought about something very similar to cond_wait() working model:
> 
> /*
>   * Caller has to obtain lock before calling that function.
>   */
> static inline int rte_power_monitor_sync(const volatile void *p,
>                  const uint64_t expected_value, const uint64_t value_mask,
>                  const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck)
> {
> /* do whatever preparations are needed */
>                 ....
> umonitor(p);
> 
> if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) {
> return 0;
>   }
> 
> /* release lock and go to sleep */
> rte_spinlock_unlock(lck);
> rflags = umwait();
> 
> /* grab lock back after wakeup */
> rte_spinlock_lock(lck);
> 
> /* do rest of processing */
> ....
> }
> 
> /* similar go cond_signal */
> static inline void rte_power_monitor_wakeup(volatile void *p)
> {
> uint64_t v;
> 
> v = __atomic_load_n(p, __ATOMIC_RELAXED);
> __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
> }
> 
> 
> Now in librte_power:
> 
> struct pmd_queue_cfg {
>         /* to protect state and wait_addr */
>         rte_spinlock_t lck;
>         enum pmd_mgmt_state pwr_mgmt_state;
>         void *wait_addr;
>         /* rest fields */
>        ....
> } __rte_cache_aligned;
> 
> 
> static uint16_t
> rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>                  uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> {
> 
>          struct pmd_queue_cfg *q_conf;
>          q_conf = &port_cfg[port_id].queue_cfg[qidx];
> 
>          if (unlikely(nb_rx == 0)) {
>                  q_conf->empty_poll_stats++;
>                  if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>                          volatile void *target_addr;
>                          uint64_t expected, mask;
>                          uint16_t ret;
> 
>           /* grab the lock and check the state */
>                         rte_spinlock_lock(&q_conf->lck);
>           If (q-conf->state == ENABLED) {
>                          ret = rte_eth_get_wake_addr(port_id, qidx,
>                                                      &target_addr, &expected, &mask);
>            If (ret == 0) {
> q_conf->wait_addr = target_addr;
> rte_power_monitor(target_addr, ..., &q_conf->lck);
>           }
>            /* reset the wait_addr */
>            q_conf->wait_addr = NULL;
>           }
>           rte_spinlock_unlock(&q_conf->lck);
>           ....
> }
> 
> nt
> rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>                                  uint16_t port_id,
>                                  uint16_t queue_id)
> {
> ...
> /* grab the lock and change the state */
>                 rte_spinlock_lock(&q_conf->lck);
> queue_cfg->state = DISABLED;
> 
> /* wakeup if necessary */
> If (queue_cfg->wakeup_addr != NULL)
> rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
> 
> rte_spinlock_unlock(&q_conf->lck);
> ...
> }
> 

Yeah, seems that i understood you correctly the first time then. I'm not 
completely convinced that this overhead and complexity is worth the 
trouble, to be honest. I mean, it's not like we're going to sleep 
indefinitely, this isn't like pthread wait - the biggest sleep time i've 
seen was around half a second and i'm not sure there is a use case for 
enabling/disabling this functionality willy nilly ever 5 seconds.
  
Burakov, Anatoly Oct. 13, 2020, 9:45 a.m. UTC | #32
On 12-Oct-20 2:13 PM, Burakov, Anatoly wrote:
> On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote:
>>
>>>>>
>>>>>>>>>>>> Add two new power management intrinsics, and provide an
>>>>>>>>>>>> implementation
>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The 
>>>>>>>>>>>> instructions
>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet
>>>>>>>>>>>> widespread
>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>
>>>>>>>>>>>> The power management instructions provide an 
>>>>>>>>>>>> architecture-specific
>>>>>>>>>>>> function to either wait until a specified TSC timestamp is
>>>>>>>>>>>> reached, or
>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a
>>>>>>>>>>>> memory
>>>>>>>>>>>> location is written to. The monitor function also provides an
>>>>>>>>>>>> optional
>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has 
>>>>>>>>>>>> already
>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>
>>>>>>>>>>> I think what this API is missing - a function to wakeup sleeping
>>>>>>>>>>> core.
>>>>>>>>>>> If user can/should use some system call to achieve that, then at
>>>>>>>>>>> least
>>>>>>>>>>> it has to be clearly documented, even better some wrapper 
>>>>>>>>>>> provided.
>>>>>>>>>>
>>>>>>>>>> I don't think it's possible to do that without severely
>>>>>>>>>> overcomplicating
>>>>>>>>>> the intrinsic and its usage, because AFAIK the only way to 
>>>>>>>>>> wake up a
>>>>>>>>>> sleeping core would be to send some kind of interrupt to the
>>>>>>>>>> core, or
>>>>>>>>>> trigger a write to the cache-line in question.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>>>>> (on top of my head - membarrier() does that, might be there are
>>>>>>>>> some other syscalls too),
>>>>>>>>> or something hand-made. For hand-made, I wonder would something
>>>>>>>>> like that
>>>>>>>>> be safe and sufficient:
>>>>>>>>> uint64_t val = atomic_load(addr);
>>>>>>>>> CAS(addr, val, &val);
>>>>>>>>> ?
>>>>>>>>> Anyway, one way or another - I think ability to wakeup core we put
>>>>>>>>> to sleep
>>>>>>>>> have to be an essential part of this feature.
>>>>>>>>> As I understand linux kernel will limit max amount of sleep time
>>>>>>>>> for these instructions:
>>>>>>>>> https://lwn.net/Articles/790920/
>>>>>>>>> But relying just on that, seems too vague for me:
>>>>>>>>> - user can adjust that value
>>>>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>>>>> Konstantin
>>>>>>>>>
>>>>>>>>
>>>>>>>> This implies knowing the value the core is sleeping on.
>>>>>>>
>>>>>>> You don't the value to wait for, you just need an address.
>>>>>>> And you can make wakeup function to accept address as a parameter,
>>>>>>> same as monitor() does.
>>>>>>
>>>>>> Sorry, i meant the address. We don't know the address we're 
>>>>>> sleeping on.
>>>>>>
>>>>>>>
>>>>>>>> That's not
>>>>>>>> always the case - with this particular PMD power management 
>>>>>>>> scheme, we
>>>>>>>> get the address from the PMD and it stays inside the callback.
>>>>>>>
>>>>>>> That's fine - you can store address inside you callback metadata
>>>>>>> and do wakeup as part of _disable_ function.
>>>>>>>
>>>>>>
>>>>>> The address may be different, and by the time we access the 
>>>>>> address it
>>>>>> may become stale, so i don't see how that would help unless you're
>>>>>> suggesting to have some kind of synchronization mechanism there.
>>>>>
>>>>> Yes, we'll need something to sync here for sure.
>>>>> Sorry, I should say it straightway, to avoid further misunderstanding.
>>>>> Let say, associate a spin_lock with monitor(), by analogy with
>>>>> pthread_cond_wait().
>>>>> Konstantin
>>>>>
>>>>
>>>> The idea was to provide an intrinsic-like function - as in, raw
>>>> instruction call, without anything extra. We even added the 
>>>> masks/values
>>>> etc. only because there's no race-less way to combine UMONITOR/UMWAIT
>>>> without those.
>>>>
>>>> Perhaps we can provide a synchronize-able wrapper around it to avoid
>>>> adding overhead to calls that function but doesn't need the sync 
>>>> mechanism?
>>
>> Yes, might be two flavours, something like
>> rte_power_monitor() and rte_power_monitor_sync()
>> or whatever would be a better name.
>>
>>>>
>>>
>>> Also, how would having a spinlock help to synchronize? Are you
>>> suggesting we do UMWAIT on a spinlock address, or something to that 
>>> effect?
>>>
>>
>> I thought about something very similar to cond_wait() working model:
>>
>> /*
>>   * Caller has to obtain lock before calling that function.
>>   */
>> static inline int rte_power_monitor_sync(const volatile void *p,
>>                  const uint64_t expected_value, const uint64_t 
>> value_mask,
>>                  const uint32_t state, const uint64_t tsc_timestamp, 
>> rte_spinlock_t *lck)
>> {
>> /* do whatever preparations are needed */
>>                 ....
>> umonitor(p);
>>
>> if (value_mask != 0 && *((const uint64_t *)p) & value_mask == 
>> expected_value) {
>> return 0;
>>   }
>>
>> /* release lock and go to sleep */
>> rte_spinlock_unlock(lck);
>> rflags = umwait();
>>
>> /* grab lock back after wakeup */
>> rte_spinlock_lock(lck);
>>
>> /* do rest of processing */
>> ....
>> }
>>
>> /* similar go cond_signal */
>> static inline void rte_power_monitor_wakeup(volatile void *p)
>> {
>> uint64_t v;
>>
>> v = __atomic_load_n(p, __ATOMIC_RELAXED);
>> __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, 
>> __ATOMIC_RELAXED);
>> }
>>
>>
>> Now in librte_power:
>>
>> struct pmd_queue_cfg {
>>         /* to protect state and wait_addr */
>>         rte_spinlock_t lck;
>>         enum pmd_mgmt_state pwr_mgmt_state;
>>         void *wait_addr;
>>         /* rest fields */
>>        ....
>> } __rte_cache_aligned;
>>
>>
>> static uint16_t
>> rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>>                  uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> {
>>
>>          struct pmd_queue_cfg *q_conf;
>>          q_conf = &port_cfg[port_id].queue_cfg[qidx];
>>
>>          if (unlikely(nb_rx == 0)) {
>>                  q_conf->empty_poll_stats++;
>>                  if (unlikely(q_conf->empty_poll_stats > 
>> EMPTYPOLL_MAX)) {
>>                          volatile void *target_addr;
>>                          uint64_t expected, mask;
>>                          uint16_t ret;
>>
>>           /* grab the lock and check the state */
>>                         rte_spinlock_lock(&q_conf->lck);
>>           If (q-conf->state == ENABLED) {
>>                          ret = rte_eth_get_wake_addr(port_id, qidx,
>>                                                      &target_addr, 
>> &expected, &mask);
>>            If (ret == 0) {
>> q_conf->wait_addr = target_addr;
>> rte_power_monitor(target_addr, ..., &q_conf->lck);
>>           }
>>            /* reset the wait_addr */
>>            q_conf->wait_addr = NULL;
>>           }
>>           rte_spinlock_unlock(&q_conf->lck);
>>           ....
>> }
>>
>> nt
>> rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>>                                  uint16_t port_id,
>>                                  uint16_t queue_id)
>> {
>> ...
>> /* grab the lock and change the state */
>>                 rte_spinlock_lock(&q_conf->lck);
>> queue_cfg->state = DISABLED;
>>
>> /* wakeup if necessary */
>> If (queue_cfg->wakeup_addr != NULL)
>> rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
>>
>> rte_spinlock_unlock(&q_conf->lck);
>> ...
>> }
>>
> 
> Yeah, seems that i understood you correctly the first time then. I'm not 
> completely convinced that this overhead and complexity is worth the 
> trouble, to be honest. I mean, it's not like we're going to sleep 
> indefinitely, this isn't like pthread wait - the biggest sleep time i've 
> seen was around half a second and i'm not sure there is a use case for 
> enabling/disabling this functionality willy nilly ever 5 seconds.
> 

Back story: we've had a little internal chat and basically agreed to 
Konstantin's proposal, with slight modifications. That is, we need to be 
able to wake up the core because otherwise we have no deterministic way 
of stopping the sleeping RX path, however there is no need to expose 
this mechanism in a public API and it can be kept inside the power 
library instead.
  

Patch

diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
new file mode 100644
index 0000000000..cd7f8070ac
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -0,0 +1,64 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_H_
+#define _RTE_POWER_INTRINSIC_H_
+
+#include <inttypes.h>
+
+/**
+ * @file
+ * Advanced power management operations.
+ *
+ * This file define APIs for advanced power management,
+ * which are architecture-dependent.
+ */
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp);
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for. Note that the wait behavior is
+ *   architecture-dependent.
+ *
+ * @return
+ *   Architecture-dependent return value.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp);
+
+#endif /* _RTE_POWER_INTRINSIC_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index cd09027958..3a12e87e19 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -60,6 +60,7 @@  generic_headers = files(
 	'generic/rte_memcpy.h',
 	'generic/rte_pause.h',
 	'generic/rte_prefetch.h',
+	'generic/rte_power_intrinsics.h',
 	'generic/rte_rwlock.h',
 	'generic/rte_spinlock.h',
 	'generic/rte_ticketlock.h',
diff --git a/lib/librte_eal/x86/include/meson.build b/lib/librte_eal/x86/include/meson.build
index f0e998c2fe..494a8142a2 100644
--- a/lib/librte_eal/x86/include/meson.build
+++ b/lib/librte_eal/x86/include/meson.build
@@ -13,6 +13,7 @@  arch_headers = files(
 	'rte_io.h',
 	'rte_memcpy.h',
 	'rte_prefetch.h',
+	'rte_power_intrinsics.h',
 	'rte_pause.h',
 	'rte_rtm.h',
 	'rte_rwlock.h',
diff --git a/lib/librte_eal/x86/include/rte_power_intrinsics.h b/lib/librte_eal/x86/include/rte_power_intrinsics.h
new file mode 100644
index 0000000000..6dd1cdc939
--- /dev/null
+++ b/lib/librte_eal/x86/include/rte_power_intrinsics.h
@@ -0,0 +1,143 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#ifndef _RTE_POWER_INTRINSIC_X86_64_H_
+#define _RTE_POWER_INTRINSIC_X86_64_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_atomic.h>
+#include <rte_common.h>
+
+#include "generic/rte_power_intrinsics.h"
+
+/**
+ * Monitor specific address for changes. This will cause the CPU to enter an
+ * architecture-defined optimized power state until either the specified
+ * memory address is written to, or a certain TSC timestamp is reached.
+ *
+ * Additionally, an `expected` 64-bit value and 64-bit mask are provided. If
+ * mask is non-zero, the current value pointed to by the `p` pointer will be
+ * checked against the expected value, and if they match, the entering of
+ * optimized power state may be aborted.
+ *
+ * This function uses UMONITOR/UMWAIT instructions. For more information about
+ * their usage, please refer to Intel(R) 64 and IA-32 Architectures Software
+ * Developer's Manual.
+ *
+ * @param p
+ *   Address to monitor for changes. Must be aligned on an 64-byte boundary.
+ * @param expected_value
+ *   Before attempting the monitoring, the `p` address may be read and compared
+ *   against this value. If `value_mask` is zero, this step will be skipped.
+ * @param value_mask
+ *   The 64-bit mask to use to extract current value from `p`.
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to memory write or other reasons.
+ */
+static inline int rte_power_monitor(const volatile void *p,
+		const uint64_t expected_value, const uint64_t value_mask,
+		const uint32_t state, const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+	/*
+	 * we're using raw byte codes for now as only the newest compiler
+	 * versions support this instruction natively.
+	 */
+
+	/* set address for UMONITOR */
+	asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf7;"
+			:
+			: "D"(p));
+
+	if (value_mask) {
+		const uint64_t cur_value = *(const volatile uint64_t *)p;
+		const uint64_t masked = cur_value & value_mask;
+		/* if the masked value is already matching, abort */
+		if (masked == expected_value)
+			return 0;
+	}
+	/* execute UMWAIT */
+	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;\n"
+		/*
+		 * UMWAIT sets CF flag in RFLAGS, so PUSHF to push them
+		 * onto the stack, then pop them back into `rflags` so that
+		 * we can read it.
+		 */
+		"pushf;\n"
+		"pop %0;\n"
+		: "=r"(rflags)
+		: "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+/**
+ * Enter an architecture-defined optimized power state until a certain TSC
+ * timestamp is reached.
+ *
+ * This function uses TPAUSE instruction. For more information about its usage,
+ * please refer to Intel(R) 64 and IA-32 Architectures Software Developer's
+ * Manual.
+ *
+ * @param state
+ *   Architecture-dependent optimized power state number. Can be 0 (C0.2) or
+ *   1 (C0.1).
+ * @param tsc_timestamp
+ *   Maximum TSC timestamp to wait for.
+ *
+ * @return
+ *   - 1 if wakeup was due to TSC timeout expiration.
+ *   - 0 if wakeup was due to other reasons.
+ */
+static inline int rte_power_pause(const uint32_t state,
+		const uint64_t tsc_timestamp)
+{
+	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
+	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	/* the rflags need match native register size */
+#ifdef RTE_ARCH_I686
+	uint32_t rflags;
+#else
+	uint64_t rflags;
+#endif
+
+	/* execute TPAUSE */
+	asm volatile(".byte 0x66, 0x0f, 0xae, 0xf7;\n"
+		     /*
+		      * TPAUSE sets CF flag in RFLAGS, so PUSHF to push them
+		      * onto the stack, then pop them back into `rflags` so that
+		      * we can read it.
+		      */
+		     "pushf;\n"
+		     "pop %0;\n"
+		     : "=r"(rflags)
+		     : "D"(state), "a"(tsc_l), "d"(tsc_h));
+
+	/* we're interested in the first bit (the carry flag) */
+	return rflags & 0x1;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_POWER_INTRINSIC_X86_64_H_ */