[v1,1/7] power_intrinsics: allow monitor checks inversion

Message ID 8007029ea9e5129ea43f0c11708169406a16727f.1622548381.git.anatoly.burakov@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series Enhancements for PMD power management |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Burakov, Anatoly June 1, 2021, noon UTC
  Previously, the semantics of power monitor were such that we were
checking current value against the expected value, and if they matched,
then the sleep was aborted. This is somewhat inflexible, because it only
allowed us to check for a specific value.

This commit adds an option to reverse the check, so that we can have
monitor sleep aborted if the expected value *doesn't* match what's in
memory. This allows us to both implement all currently implemented
driver code, as well as support more use cases which don't easily map to
previous semantics (such as waiting on writes to AF_XDP counter value).

Since the old behavior is the default, no need to adjust existing
implementations.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
 lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
 2 files changed, 8 insertions(+), 1 deletion(-)
  

Comments

Ananyev, Konstantin June 21, 2021, 12:56 p.m. UTC | #1
Hi Anatoly,

> Previously, the semantics of power monitor were such that we were
> checking current value against the expected value, and if they matched,
> then the sleep was aborted. This is somewhat inflexible, because it only
> allowed us to check for a specific value.
> 
> This commit adds an option to reverse the check, so that we can have
> monitor sleep aborted if the expected value *doesn't* match what's in
> memory. This allows us to both implement all currently implemented
> driver code, as well as support more use cases which don't easily map to
> previous semantics (such as waiting on writes to AF_XDP counter value).
> 
> Since the old behavior is the default, no need to adjust existing
> implementations.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>  lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> index dddca3d41c..1006c2edfc 100644
> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>  	                  *   4, or 8. Supplying any other value will result in
>  	                  *   an error.
>  	                  */
> +	uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> +	                  *   checking if `val` matches something, check if
> +	                  *   `val` *doesn't* match a particular value)
> +	                  */
>  };
> 
>  /**
> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> index 39ea9fdecd..5d944e9aa4 100644
> --- a/lib/eal/x86/rte_power_intrinsics.c
> +++ b/lib/eal/x86/rte_power_intrinsics.c
> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>  		const uint64_t masked = cur_value & pmc->mask;
> 
>  		/* if the masked value is already matching, abort */
> -		if (masked == pmc->val)
> +		if (!pmc->invert && masked == pmc->val)
> +			goto end;
> +		/* same, but for inverse check */
> +		if (pmc->invert && masked != pmc->val)
>  			goto end;
>  	}
> 

Hmm..., such approach looks too 'patchy'...
Can we at least replace 'inver' with something like:
enum rte_power_monitor_cond_op {
	_EQ, NEQ,...
};
Then at least new comparions ops can be added in future.
Even better I think would be to just leave to PMD to provide a comparison callback.
Will make things really simple and generic:
struct rte_power_monitor_cond {
     volatile void *addr;
     int (*cmp)(uint64_t val);
     uint8_t size;
};
And then in rte_power_monitor(...):
....
const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
if (pmc->cmp(cur_value) != 0)
	goto end;
....
  
Burakov, Anatoly June 23, 2021, 9:43 a.m. UTC | #2
On 21-Jun-21 1:56 PM, Ananyev, Konstantin wrote:
> 
> Hi Anatoly,
> 
>> Previously, the semantics of power monitor were such that we were
>> checking current value against the expected value, and if they matched,
>> then the sleep was aborted. This is somewhat inflexible, because it only
>> allowed us to check for a specific value.
>>
>> This commit adds an option to reverse the check, so that we can have
>> monitor sleep aborted if the expected value *doesn't* match what's in
>> memory. This allows us to both implement all currently implemented
>> driver code, as well as support more use cases which don't easily map to
>> previous semantics (such as waiting on writes to AF_XDP counter value).
>>
>> Since the old behavior is the default, no need to adjust existing
>> implementations.
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>>   lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>>   2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
>> index dddca3d41c..1006c2edfc 100644
>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>>                          *   4, or 8. Supplying any other value will result in
>>                          *   an error.
>>                          */
>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
>> +                       *   checking if `val` matches something, check if
>> +                       *   `val` *doesn't* match a particular value)
>> +                       */
>>   };
>>
>>   /**
>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>> index 39ea9fdecd..5d944e9aa4 100644
>> --- a/lib/eal/x86/rte_power_intrinsics.c
>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>                const uint64_t masked = cur_value & pmc->mask;
>>
>>                /* if the masked value is already matching, abort */
>> -             if (masked == pmc->val)
>> +             if (!pmc->invert && masked == pmc->val)
>> +                     goto end;
>> +             /* same, but for inverse check */
>> +             if (pmc->invert && masked != pmc->val)
>>                        goto end;
>>        }
>>
> 
> Hmm..., such approach looks too 'patchy'...
> Can we at least replace 'inver' with something like:
> enum rte_power_monitor_cond_op {
>          _EQ, NEQ,...
> };
> Then at least new comparions ops can be added in future.
> Even better I think would be to just leave to PMD to provide a comparison callback.
> Will make things really simple and generic:
> struct rte_power_monitor_cond {
>       volatile void *addr;
>       int (*cmp)(uint64_t val);
>       uint8_t size;
> };
> And then in rte_power_monitor(...):
> ....
> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> if (pmc->cmp(cur_value) != 0)
>          goto end;
> ....
> 

I like the idea of a callback, but these are supposed to be 
intrinsic-like functions, so putting too much into them is contrary to 
their goal, and it's going to make the API hard to use in simpler cases 
(e.g. when we're explicitly calling rte_power_monitor as opposed to 
letting the RX callback do it for us). For example, event/dlb code calls 
rte_power_monitor explicitly.

It's going to be especially "fun" to do these indirect function calls 
from inside transactional region on call to multi-monitor. I'm not 
opposed to having a callback here, but maybe others have more thoughts 
on this?
  
Ananyev, Konstantin June 23, 2021, 9:55 a.m. UTC | #3
> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Wednesday, June 23, 2021 10:43 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Loftus, Ciara <ciara.loftus@intel.com>; Hunt, David <david.hunt@intel.com>
> Subject: Re: [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
> 
> On 21-Jun-21 1:56 PM, Ananyev, Konstantin wrote:
> >
> > Hi Anatoly,
> >
> >> Previously, the semantics of power monitor were such that we were
> >> checking current value against the expected value, and if they matched,
> >> then the sleep was aborted. This is somewhat inflexible, because it only
> >> allowed us to check for a specific value.
> >>
> >> This commit adds an option to reverse the check, so that we can have
> >> monitor sleep aborted if the expected value *doesn't* match what's in
> >> memory. This allows us to both implement all currently implemented
> >> driver code, as well as support more use cases which don't easily map to
> >> previous semantics (such as waiting on writes to AF_XDP counter value).
> >>
> >> Since the old behavior is the default, no need to adjust existing
> >> implementations.
> >>
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
> >>   lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
> >>   lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
> >>   2 files changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> >> index dddca3d41c..1006c2edfc 100644
> >> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> >> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> >> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
> >>                          *   4, or 8. Supplying any other value will result in
> >>                          *   an error.
> >>                          */
> >> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> >> +                       *   checking if `val` matches something, check if
> >> +                       *   `val` *doesn't* match a particular value)
> >> +                       */
> >>   };
> >>
> >>   /**
> >> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >> index 39ea9fdecd..5d944e9aa4 100644
> >> --- a/lib/eal/x86/rte_power_intrinsics.c
> >> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> >>                const uint64_t masked = cur_value & pmc->mask;
> >>
> >>                /* if the masked value is already matching, abort */
> >> -             if (masked == pmc->val)
> >> +             if (!pmc->invert && masked == pmc->val)
> >> +                     goto end;
> >> +             /* same, but for inverse check */
> >> +             if (pmc->invert && masked != pmc->val)
> >>                        goto end;
> >>        }
> >>
> >
> > Hmm..., such approach looks too 'patchy'...
> > Can we at least replace 'inver' with something like:
> > enum rte_power_monitor_cond_op {
> >          _EQ, NEQ,...
> > };
> > Then at least new comparions ops can be added in future.
> > Even better I think would be to just leave to PMD to provide a comparison callback.
> > Will make things really simple and generic:
> > struct rte_power_monitor_cond {
> >       volatile void *addr;
> >       int (*cmp)(uint64_t val);
> >       uint8_t size;
> > };
> > And then in rte_power_monitor(...):
> > ....
> > const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> > if (pmc->cmp(cur_value) != 0)
> >          goto end;
> > ....
> >
> 
> I like the idea of a callback, but these are supposed to be
> intrinsic-like functions, so putting too much into them is contrary to
> their goal, and it's going to make the API hard to use in simpler cases
> (e.g. when we're explicitly calling rte_power_monitor as opposed to
> letting the RX callback do it for us). For example, event/dlb code calls
> rte_power_monitor explicitly.

Good point, I didn't know that.
Would be interesting to see how do they use it.

> 
> It's going to be especially "fun" to do these indirect function calls
> from inside transactional region on call to multi-monitor.

But the callback is not supposed to do any memory reads/writes.
Just mask/compare of the provided value with some constant.

> I'm not
> opposed to having a callback here, but maybe others have more thoughts
> on this?
> 
> --
> Thanks,
> Anatoly
  
Burakov, Anatoly June 23, 2021, 10 a.m. UTC | #4
On 23-Jun-21 10:55 AM, Ananyev, Konstantin wrote:
> 
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Wednesday, June 23, 2021 10:43 AM
>> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Richardson, Bruce <bruce.richardson@intel.com>
>> Cc: Loftus, Ciara <ciara.loftus@intel.com>; Hunt, David <david.hunt@intel.com>
>> Subject: Re: [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion
>>
>> On 21-Jun-21 1:56 PM, Ananyev, Konstantin wrote:
>>>
>>> Hi Anatoly,
>>>
>>>> Previously, the semantics of power monitor were such that we were
>>>> checking current value against the expected value, and if they matched,
>>>> then the sleep was aborted. This is somewhat inflexible, because it only
>>>> allowed us to check for a specific value.
>>>>
>>>> This commit adds an option to reverse the check, so that we can have
>>>> monitor sleep aborted if the expected value *doesn't* match what's in
>>>> memory. This allows us to both implement all currently implemented
>>>> driver code, as well as support more use cases which don't easily map to
>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
>>>>
>>>> Since the old behavior is the default, no need to adjust existing
>>>> implementations.
>>>>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> ---
>>>>    lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>>>>    lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>>>>    2 files changed, 8 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
>>>> index dddca3d41c..1006c2edfc 100644
>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>>>>                           *   4, or 8. Supplying any other value will result in
>>>>                           *   an error.
>>>>                           */
>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
>>>> +                       *   checking if `val` matches something, check if
>>>> +                       *   `val` *doesn't* match a particular value)
>>>> +                       */
>>>>    };
>>>>
>>>>    /**
>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>>>> index 39ea9fdecd..5d944e9aa4 100644
>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>>>                 const uint64_t masked = cur_value & pmc->mask;
>>>>
>>>>                 /* if the masked value is already matching, abort */
>>>> -             if (masked == pmc->val)
>>>> +             if (!pmc->invert && masked == pmc->val)
>>>> +                     goto end;
>>>> +             /* same, but for inverse check */
>>>> +             if (pmc->invert && masked != pmc->val)
>>>>                         goto end;
>>>>         }
>>>>
>>>
>>> Hmm..., such approach looks too 'patchy'...
>>> Can we at least replace 'inver' with something like:
>>> enum rte_power_monitor_cond_op {
>>>           _EQ, NEQ,...
>>> };
>>> Then at least new comparions ops can be added in future.
>>> Even better I think would be to just leave to PMD to provide a comparison callback.
>>> Will make things really simple and generic:
>>> struct rte_power_monitor_cond {
>>>        volatile void *addr;
>>>        int (*cmp)(uint64_t val);
>>>        uint8_t size;
>>> };
>>> And then in rte_power_monitor(...):
>>> ....
>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
>>> if (pmc->cmp(cur_value) != 0)
>>>           goto end;
>>> ....
>>>
>>
>> I like the idea of a callback, but these are supposed to be
>> intrinsic-like functions, so putting too much into them is contrary to
>> their goal, and it's going to make the API hard to use in simpler cases
>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
>> letting the RX callback do it for us). For example, event/dlb code calls
>> rte_power_monitor explicitly.
> 
> Good point, I didn't know that.
> Would be interesting to see how do they use it.

To be fair, it should be possible to rewrite their code using a 
callback. Perhaps adding a (void *) parameter for any custom data 
related to the callback (because C doesn't have closures...), but 
otherwise it should be doable, so the question isn't that it's 
impossible to rewrite event/dlb code to use callbacks, it's more of an 
issue with complicating usage of already-not-quite-straightforward API 
even more.

> 
>>
>> It's going to be especially "fun" to do these indirect function calls
>> from inside transactional region on call to multi-monitor.
> 
> But the callback is not supposed to do any memory reads/writes.
> Just mask/compare of the provided value with some constant.

Yeah, but with callbacks we can't really control that, can we? I mean i 
guess a *sane* implementation wouldn't do that, but still, it's 
theoretically possible to perform more complex checks and even touch 
some unrelated data in the process.

> 
>> I'm not
>> opposed to having a callback here, but maybe others have more thoughts
>> on this?
>>
>> --
>> Thanks,
>> Anatoly
  
Ananyev, Konstantin June 23, 2021, 11 a.m. UTC | #5
> >>>
> >>>> Previously, the semantics of power monitor were such that we were
> >>>> checking current value against the expected value, and if they matched,
> >>>> then the sleep was aborted. This is somewhat inflexible, because it only
> >>>> allowed us to check for a specific value.
> >>>>
> >>>> This commit adds an option to reverse the check, so that we can have
> >>>> monitor sleep aborted if the expected value *doesn't* match what's in
> >>>> memory. This allows us to both implement all currently implemented
> >>>> driver code, as well as support more use cases which don't easily map to
> >>>> previous semantics (such as waiting on writes to AF_XDP counter value).
> >>>>
> >>>> Since the old behavior is the default, no need to adjust existing
> >>>> implementations.
> >>>>
> >>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>> ---
> >>>>    lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
> >>>>    lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
> >>>>    2 files changed, 8 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>> index dddca3d41c..1006c2edfc 100644
> >>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> >>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
> >>>>                           *   4, or 8. Supplying any other value will result in
> >>>>                           *   an error.
> >>>>                           */
> >>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> >>>> +                       *   checking if `val` matches something, check if
> >>>> +                       *   `val` *doesn't* match a particular value)
> >>>> +                       */
> >>>>    };
> >>>>
> >>>>    /**
> >>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >>>> index 39ea9fdecd..5d944e9aa4 100644
> >>>> --- a/lib/eal/x86/rte_power_intrinsics.c
> >>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> >>>>                 const uint64_t masked = cur_value & pmc->mask;
> >>>>
> >>>>                 /* if the masked value is already matching, abort */
> >>>> -             if (masked == pmc->val)
> >>>> +             if (!pmc->invert && masked == pmc->val)
> >>>> +                     goto end;
> >>>> +             /* same, but for inverse check */
> >>>> +             if (pmc->invert && masked != pmc->val)
> >>>>                         goto end;
> >>>>         }
> >>>>
> >>>
> >>> Hmm..., such approach looks too 'patchy'...
> >>> Can we at least replace 'inver' with something like:
> >>> enum rte_power_monitor_cond_op {
> >>>           _EQ, NEQ,...
> >>> };
> >>> Then at least new comparions ops can be added in future.
> >>> Even better I think would be to just leave to PMD to provide a comparison callback.
> >>> Will make things really simple and generic:
> >>> struct rte_power_monitor_cond {
> >>>        volatile void *addr;
> >>>        int (*cmp)(uint64_t val);
> >>>        uint8_t size;
> >>> };
> >>> And then in rte_power_monitor(...):
> >>> ....
> >>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> >>> if (pmc->cmp(cur_value) != 0)
> >>>           goto end;
> >>> ....
> >>>
> >>
> >> I like the idea of a callback, but these are supposed to be
> >> intrinsic-like functions, so putting too much into them is contrary to
> >> their goal, and it's going to make the API hard to use in simpler cases
> >> (e.g. when we're explicitly calling rte_power_monitor as opposed to
> >> letting the RX callback do it for us). For example, event/dlb code calls
> >> rte_power_monitor explicitly.
> >
> > Good point, I didn't know that.
> > Would be interesting to see how do they use it.
> 
> To be fair, it should be possible to rewrite their code using a
> callback. Perhaps adding a (void *) parameter for any custom data
> related to the callback (because C doesn't have closures...), but
> otherwise it should be doable, so the question isn't that it's
> impossible to rewrite event/dlb code to use callbacks, it's more of an
> issue with complicating usage of already-not-quite-straightforward API
> even more.
> 
> >
> >>
> >> It's going to be especially "fun" to do these indirect function calls
> >> from inside transactional region on call to multi-monitor.
> >
> > But the callback is not supposed to do any memory reads/writes.
> > Just mask/compare of the provided value with some constant.
> 
> Yeah, but with callbacks we can't really control that, can we? I mean i
> guess a *sane* implementation wouldn't do that, but still, it's
> theoretically possible to perform more complex checks and even touch
> some unrelated data in the process.

Yep, PMD developer can ignore recommendations and do whatever
he wants in the call-back. We can't control it.
If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
In principle it is the same with all other PMD dev-ops - we have to trust that they are
doing what they have to.  

> 
> >
> >> I'm not
> >> opposed to having a callback here, but maybe others have more thoughts
> >> on this?
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly
  
Burakov, Anatoly June 23, 2021, 12:12 p.m. UTC | #6
On 23-Jun-21 12:00 PM, Ananyev, Konstantin wrote:
> 
>>>>>
>>>>>> Previously, the semantics of power monitor were such that we were
>>>>>> checking current value against the expected value, and if they matched,
>>>>>> then the sleep was aborted. This is somewhat inflexible, because it only
>>>>>> allowed us to check for a specific value.
>>>>>>
>>>>>> This commit adds an option to reverse the check, so that we can have
>>>>>> monitor sleep aborted if the expected value *doesn't* match what's in
>>>>>> memory. This allows us to both implement all currently implemented
>>>>>> driver code, as well as support more use cases which don't easily map to
>>>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
>>>>>>
>>>>>> Since the old behavior is the default, no need to adjust existing
>>>>>> implementations.
>>>>>>
>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> ---
>>>>>>     lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>>>>>>     lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>>>>>>     2 files changed, 8 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>> index dddca3d41c..1006c2edfc 100644
>>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>>>>>>                            *   4, or 8. Supplying any other value will result in
>>>>>>                            *   an error.
>>>>>>                            */
>>>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
>>>>>> +                       *   checking if `val` matches something, check if
>>>>>> +                       *   `val` *doesn't* match a particular value)
>>>>>> +                       */
>>>>>>     };
>>>>>>
>>>>>>     /**
>>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>>>>>> index 39ea9fdecd..5d944e9aa4 100644
>>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
>>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>>>>>                  const uint64_t masked = cur_value & pmc->mask;
>>>>>>
>>>>>>                  /* if the masked value is already matching, abort */
>>>>>> -             if (masked == pmc->val)
>>>>>> +             if (!pmc->invert && masked == pmc->val)
>>>>>> +                     goto end;
>>>>>> +             /* same, but for inverse check */
>>>>>> +             if (pmc->invert && masked != pmc->val)
>>>>>>                          goto end;
>>>>>>          }
>>>>>>
>>>>>
>>>>> Hmm..., such approach looks too 'patchy'...
>>>>> Can we at least replace 'inver' with something like:
>>>>> enum rte_power_monitor_cond_op {
>>>>>            _EQ, NEQ,...
>>>>> };
>>>>> Then at least new comparions ops can be added in future.
>>>>> Even better I think would be to just leave to PMD to provide a comparison callback.
>>>>> Will make things really simple and generic:
>>>>> struct rte_power_monitor_cond {
>>>>>         volatile void *addr;
>>>>>         int (*cmp)(uint64_t val);
>>>>>         uint8_t size;
>>>>> };
>>>>> And then in rte_power_monitor(...):
>>>>> ....
>>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
>>>>> if (pmc->cmp(cur_value) != 0)
>>>>>            goto end;
>>>>> ....
>>>>>
>>>>
>>>> I like the idea of a callback, but these are supposed to be
>>>> intrinsic-like functions, so putting too much into them is contrary to
>>>> their goal, and it's going to make the API hard to use in simpler cases
>>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
>>>> letting the RX callback do it for us). For example, event/dlb code calls
>>>> rte_power_monitor explicitly.
>>>
>>> Good point, I didn't know that.
>>> Would be interesting to see how do they use it.
>>
>> To be fair, it should be possible to rewrite their code using a
>> callback. Perhaps adding a (void *) parameter for any custom data
>> related to the callback (because C doesn't have closures...), but
>> otherwise it should be doable, so the question isn't that it's
>> impossible to rewrite event/dlb code to use callbacks, it's more of an
>> issue with complicating usage of already-not-quite-straightforward API
>> even more.
>>
>>>
>>>>
>>>> It's going to be especially "fun" to do these indirect function calls
>>>> from inside transactional region on call to multi-monitor.
>>>
>>> But the callback is not supposed to do any memory reads/writes.
>>> Just mask/compare of the provided value with some constant.
>>
>> Yeah, but with callbacks we can't really control that, can we? I mean i
>> guess a *sane* implementation wouldn't do that, but still, it's
>> theoretically possible to perform more complex checks and even touch
>> some unrelated data in the process.
> 
> Yep, PMD developer can ignore recommendations and do whatever
> he wants in the call-back. We can't control it.
> If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
> In principle it is the same with all other PMD dev-ops - we have to trust that they are
> doing what they have to.

I did a quick prototype for this, and i don't think it is going to work.

Callbacks with just "current value" as argument will be pretty limited 
and will only really work for cases where we know what we are expecting. 
However, for cases like event/dlb or net/mlx5, the expected value is (or 
appears to be) dependent upon some internal device data, and is not 
constant like in case of net/ixgbe for example.

This can be fixed by passing an opaque pointer, either by storing it in 
the monitor condition, or by passing it directly to rte_power_monitor at 
invocation time.

The latter doesn't work well because when we call rte_power_monitor from 
inside the rte_power library, we lack the context necessary to get said 
opaque pointer.

The former doesn't work either, because the only place where we can get 
this argument is inside get_monitor_addr, but the opaque pointer must 
persist after we exit that function in order to avoid use-after-free - 
which means that it either has to be statically allocated (which means 
it's not thread-safe for a non-trivial case), or dynamically allocated 
(which a big no-no on a hotpath).

Any other suggestions? :)

> 
>>
>>>
>>>> I'm not
>>>> opposed to having a callback here, but maybe others have more thoughts
>>>> on this?
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly
>>
>>
>> --
>> Thanks,
>> Anatoly
  
Ananyev, Konstantin June 23, 2021, 1:27 p.m. UTC | #7
> On 23-Jun-21 12:00 PM, Ananyev, Konstantin wrote:
> >
> >>>>>
> >>>>>> Previously, the semantics of power monitor were such that we were
> >>>>>> checking current value against the expected value, and if they matched,
> >>>>>> then the sleep was aborted. This is somewhat inflexible, because it only
> >>>>>> allowed us to check for a specific value.
> >>>>>>
> >>>>>> This commit adds an option to reverse the check, so that we can have
> >>>>>> monitor sleep aborted if the expected value *doesn't* match what's in
> >>>>>> memory. This allows us to both implement all currently implemented
> >>>>>> driver code, as well as support more use cases which don't easily map to
> >>>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
> >>>>>>
> >>>>>> Since the old behavior is the default, no need to adjust existing
> >>>>>> implementations.
> >>>>>>
> >>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>> ---
> >>>>>>     lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
> >>>>>>     lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
> >>>>>>     2 files changed, 8 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>> index dddca3d41c..1006c2edfc 100644
> >>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
> >>>>>>                            *   4, or 8. Supplying any other value will result in
> >>>>>>                            *   an error.
> >>>>>>                            */
> >>>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> >>>>>> +                       *   checking if `val` matches something, check if
> >>>>>> +                       *   `val` *doesn't* match a particular value)
> >>>>>> +                       */
> >>>>>>     };
> >>>>>>
> >>>>>>     /**
> >>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>> index 39ea9fdecd..5d944e9aa4 100644
> >>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
> >>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> >>>>>>                  const uint64_t masked = cur_value & pmc->mask;
> >>>>>>
> >>>>>>                  /* if the masked value is already matching, abort */
> >>>>>> -             if (masked == pmc->val)
> >>>>>> +             if (!pmc->invert && masked == pmc->val)
> >>>>>> +                     goto end;
> >>>>>> +             /* same, but for inverse check */
> >>>>>> +             if (pmc->invert && masked != pmc->val)
> >>>>>>                          goto end;
> >>>>>>          }
> >>>>>>
> >>>>>
> >>>>> Hmm..., such approach looks too 'patchy'...
> >>>>> Can we at least replace 'inver' with something like:
> >>>>> enum rte_power_monitor_cond_op {
> >>>>>            _EQ, NEQ,...
> >>>>> };
> >>>>> Then at least new comparions ops can be added in future.
> >>>>> Even better I think would be to just leave to PMD to provide a comparison callback.
> >>>>> Will make things really simple and generic:
> >>>>> struct rte_power_monitor_cond {
> >>>>>         volatile void *addr;
> >>>>>         int (*cmp)(uint64_t val);
> >>>>>         uint8_t size;
> >>>>> };
> >>>>> And then in rte_power_monitor(...):
> >>>>> ....
> >>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> >>>>> if (pmc->cmp(cur_value) != 0)
> >>>>>            goto end;
> >>>>> ....
> >>>>>
> >>>>
> >>>> I like the idea of a callback, but these are supposed to be
> >>>> intrinsic-like functions, so putting too much into them is contrary to
> >>>> their goal, and it's going to make the API hard to use in simpler cases
> >>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
> >>>> letting the RX callback do it for us). For example, event/dlb code calls
> >>>> rte_power_monitor explicitly.
> >>>
> >>> Good point, I didn't know that.
> >>> Would be interesting to see how do they use it.
> >>
> >> To be fair, it should be possible to rewrite their code using a
> >> callback. Perhaps adding a (void *) parameter for any custom data
> >> related to the callback (because C doesn't have closures...), but
> >> otherwise it should be doable, so the question isn't that it's
> >> impossible to rewrite event/dlb code to use callbacks, it's more of an
> >> issue with complicating usage of already-not-quite-straightforward API
> >> even more.
> >>
> >>>
> >>>>
> >>>> It's going to be especially "fun" to do these indirect function calls
> >>>> from inside transactional region on call to multi-monitor.
> >>>
> >>> But the callback is not supposed to do any memory reads/writes.
> >>> Just mask/compare of the provided value with some constant.
> >>
> >> Yeah, but with callbacks we can't really control that, can we? I mean i
> >> guess a *sane* implementation wouldn't do that, but still, it's
> >> theoretically possible to perform more complex checks and even touch
> >> some unrelated data in the process.
> >
> > Yep, PMD developer can ignore recommendations and do whatever
> > he wants in the call-back. We can't control it.
> > If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
> > In principle it is the same with all other PMD dev-ops - we have to trust that they are
> > doing what they have to.
> 
> I did a quick prototype for this, and i don't think it is going to work.
> 
> Callbacks with just "current value" as argument will be pretty limited
> and will only really work for cases where we know what we are expecting.
> However, for cases like event/dlb or net/mlx5, the expected value is (or
> appears to be) dependent upon some internal device data, and is not
> constant like in case of net/ixgbe for example.
> 
> This can be fixed by passing an opaque pointer, either by storing it in
> the monitor condition, or by passing it directly to rte_power_monitor at
> invocation time.
> 
> The latter doesn't work well because when we call rte_power_monitor from
> inside the rte_power library, we lack the context necessary to get said
> opaque pointer.
> 
> The former doesn't work either, because the only place where we can get
> this argument is inside get_monitor_addr, but the opaque pointer must
> persist after we exit that function in order to avoid use-after-free -
> which means that it either has to be statically allocated (which means
> it's not thread-safe for a non-trivial case), or dynamically allocated
> (which a big no-no on a hotpath).

If I get you right, expected_value (and probably mask) can be variable ones.
So for callback approach to work we need to pass all this as parameters
to PMD comparison callback:
int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
Correct? 

> 
> Any other suggestions? :)
> 
> >
> >>
> >>>
> >>>> I'm not
> >>>> opposed to having a callback here, but maybe others have more thoughts
> >>>> on this?
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> Anatoly
> >>
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly
  
Burakov, Anatoly June 23, 2021, 2:13 p.m. UTC | #8
On 23-Jun-21 2:27 PM, Ananyev, Konstantin wrote:
> 
>> On 23-Jun-21 12:00 PM, Ananyev, Konstantin wrote:
>>>
>>>>>>>
>>>>>>>> Previously, the semantics of power monitor were such that we were
>>>>>>>> checking current value against the expected value, and if they matched,
>>>>>>>> then the sleep was aborted. This is somewhat inflexible, because it only
>>>>>>>> allowed us to check for a specific value.
>>>>>>>>
>>>>>>>> This commit adds an option to reverse the check, so that we can have
>>>>>>>> monitor sleep aborted if the expected value *doesn't* match what's in
>>>>>>>> memory. This allows us to both implement all currently implemented
>>>>>>>> driver code, as well as support more use cases which don't easily map to
>>>>>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
>>>>>>>>
>>>>>>>> Since the old behavior is the default, no need to adjust existing
>>>>>>>> implementations.
>>>>>>>>
>>>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>>>> ---
>>>>>>>>      lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
>>>>>>>>      lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
>>>>>>>>      2 files changed, 8 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>>>> index dddca3d41c..1006c2edfc 100644
>>>>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
>>>>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
>>>>>>>>                             *   4, or 8. Supplying any other value will result in
>>>>>>>>                             *   an error.
>>>>>>>>                             */
>>>>>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
>>>>>>>> +                       *   checking if `val` matches something, check if
>>>>>>>> +                       *   `val` *doesn't* match a particular value)
>>>>>>>> +                       */
>>>>>>>>      };
>>>>>>>>
>>>>>>>>      /**
>>>>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
>>>>>>>> index 39ea9fdecd..5d944e9aa4 100644
>>>>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
>>>>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
>>>>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
>>>>>>>>                   const uint64_t masked = cur_value & pmc->mask;
>>>>>>>>
>>>>>>>>                   /* if the masked value is already matching, abort */
>>>>>>>> -             if (masked == pmc->val)
>>>>>>>> +             if (!pmc->invert && masked == pmc->val)
>>>>>>>> +                     goto end;
>>>>>>>> +             /* same, but for inverse check */
>>>>>>>> +             if (pmc->invert && masked != pmc->val)
>>>>>>>>                           goto end;
>>>>>>>>           }
>>>>>>>>
>>>>>>>
>>>>>>> Hmm..., such approach looks too 'patchy'...
>>>>>>> Can we at least replace 'inver' with something like:
>>>>>>> enum rte_power_monitor_cond_op {
>>>>>>>             _EQ, NEQ,...
>>>>>>> };
>>>>>>> Then at least new comparions ops can be added in future.
>>>>>>> Even better I think would be to just leave to PMD to provide a comparison callback.
>>>>>>> Will make things really simple and generic:
>>>>>>> struct rte_power_monitor_cond {
>>>>>>>          volatile void *addr;
>>>>>>>          int (*cmp)(uint64_t val);
>>>>>>>          uint8_t size;
>>>>>>> };
>>>>>>> And then in rte_power_monitor(...):
>>>>>>> ....
>>>>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
>>>>>>> if (pmc->cmp(cur_value) != 0)
>>>>>>>             goto end;
>>>>>>> ....
>>>>>>>
>>>>>>
>>>>>> I like the idea of a callback, but these are supposed to be
>>>>>> intrinsic-like functions, so putting too much into them is contrary to
>>>>>> their goal, and it's going to make the API hard to use in simpler cases
>>>>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
>>>>>> letting the RX callback do it for us). For example, event/dlb code calls
>>>>>> rte_power_monitor explicitly.
>>>>>
>>>>> Good point, I didn't know that.
>>>>> Would be interesting to see how do they use it.
>>>>
>>>> To be fair, it should be possible to rewrite their code using a
>>>> callback. Perhaps adding a (void *) parameter for any custom data
>>>> related to the callback (because C doesn't have closures...), but
>>>> otherwise it should be doable, so the question isn't that it's
>>>> impossible to rewrite event/dlb code to use callbacks, it's more of an
>>>> issue with complicating usage of already-not-quite-straightforward API
>>>> even more.
>>>>
>>>>>
>>>>>>
>>>>>> It's going to be especially "fun" to do these indirect function calls
>>>>>> from inside transactional region on call to multi-monitor.
>>>>>
>>>>> But the callback is not supposed to do any memory reads/writes.
>>>>> Just mask/compare of the provided value with some constant.
>>>>
>>>> Yeah, but with callbacks we can't really control that, can we? I mean i
>>>> guess a *sane* implementation wouldn't do that, but still, it's
>>>> theoretically possible to perform more complex checks and even touch
>>>> some unrelated data in the process.
>>>
>>> Yep, PMD developer can ignore recommendations and do whatever
>>> he wants in the call-back. We can't control it.
>>> If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
>>> In principle it is the same with all other PMD dev-ops - we have to trust that they are
>>> doing what they have to.
>>
>> I did a quick prototype for this, and i don't think it is going to work.
>>
>> Callbacks with just "current value" as argument will be pretty limited
>> and will only really work for cases where we know what we are expecting.
>> However, for cases like event/dlb or net/mlx5, the expected value is (or
>> appears to be) dependent upon some internal device data, and is not
>> constant like in case of net/ixgbe for example.
>>
>> This can be fixed by passing an opaque pointer, either by storing it in
>> the monitor condition, or by passing it directly to rte_power_monitor at
>> invocation time.
>>
>> The latter doesn't work well because when we call rte_power_monitor from
>> inside the rte_power library, we lack the context necessary to get said
>> opaque pointer.
>>
>> The former doesn't work either, because the only place where we can get
>> this argument is inside get_monitor_addr, but the opaque pointer must
>> persist after we exit that function in order to avoid use-after-free -
>> which means that it either has to be statically allocated (which means
>> it's not thread-safe for a non-trivial case), or dynamically allocated
>> (which a big no-no on a hotpath).
> 
> If I get you right, expected_value (and probably mask) can be variable ones.
> So for callback approach to work we need to pass all this as parameters
> to PMD comparison callback:
> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> Correct?

If we have both expected value, mask, and current value, then what's the 
point of the callback? The point of the callback would be to pass just 
the current value, and let the callback decide what's the expected value 
and how to compare it.

So, we can either let callback handle expected values itself by having 
an opaque callback-specific argument (which means it has to persist 
between .get_monitor_addr() and rte_power_monitor() calls), or we do the 
comparisons inside rte_power_monitor(), and store the expected/mask 
values in the monitor condition, and *don't* have any callbacks at all.

Are you suggesting an alternative to the above two options?

> 
>>
>> Any other suggestions? :)
>>
>>>
>>>>
>>>>>
>>>>>> I'm not
>>>>>> opposed to having a callback here, but maybe others have more thoughts
>>>>>> on this?
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> Anatoly
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly
>>
>>
>> --
>> Thanks,
>> Anatoly
  
Ananyev, Konstantin June 24, 2021, 9:47 a.m. UTC | #9
> >>>>>>>
> >>>>>>>> Previously, the semantics of power monitor were such that we were
> >>>>>>>> checking current value against the expected value, and if they matched,
> >>>>>>>> then the sleep was aborted. This is somewhat inflexible, because it only
> >>>>>>>> allowed us to check for a specific value.
> >>>>>>>>
> >>>>>>>> This commit adds an option to reverse the check, so that we can have
> >>>>>>>> monitor sleep aborted if the expected value *doesn't* match what's in
> >>>>>>>> memory. This allows us to both implement all currently implemented
> >>>>>>>> driver code, as well as support more use cases which don't easily map to
> >>>>>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
> >>>>>>>>
> >>>>>>>> Since the old behavior is the default, no need to adjust existing
> >>>>>>>> implementations.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>>>> ---
> >>>>>>>>      lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
> >>>>>>>>      lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
> >>>>>>>>      2 files changed, 8 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>>>> index dddca3d41c..1006c2edfc 100644
> >>>>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
> >>>>>>>>                             *   4, or 8. Supplying any other value will result in
> >>>>>>>>                             *   an error.
> >>>>>>>>                             */
> >>>>>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> >>>>>>>> +                       *   checking if `val` matches something, check if
> >>>>>>>> +                       *   `val` *doesn't* match a particular value)
> >>>>>>>> +                       */
> >>>>>>>>      };
> >>>>>>>>
> >>>>>>>>      /**
> >>>>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>>>> index 39ea9fdecd..5d944e9aa4 100644
> >>>>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
> >>>>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> >>>>>>>>                   const uint64_t masked = cur_value & pmc->mask;
> >>>>>>>>
> >>>>>>>>                   /* if the masked value is already matching, abort */
> >>>>>>>> -             if (masked == pmc->val)
> >>>>>>>> +             if (!pmc->invert && masked == pmc->val)
> >>>>>>>> +                     goto end;
> >>>>>>>> +             /* same, but for inverse check */
> >>>>>>>> +             if (pmc->invert && masked != pmc->val)
> >>>>>>>>                           goto end;
> >>>>>>>>           }
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hmm..., such approach looks too 'patchy'...
> >>>>>>> Can we at least replace 'inver' with something like:
> >>>>>>> enum rte_power_monitor_cond_op {
> >>>>>>>             _EQ, NEQ,...
> >>>>>>> };
> >>>>>>> Then at least new comparions ops can be added in future.
> >>>>>>> Even better I think would be to just leave to PMD to provide a comparison callback.
> >>>>>>> Will make things really simple and generic:
> >>>>>>> struct rte_power_monitor_cond {
> >>>>>>>          volatile void *addr;
> >>>>>>>          int (*cmp)(uint64_t val);
> >>>>>>>          uint8_t size;
> >>>>>>> };
> >>>>>>> And then in rte_power_monitor(...):
> >>>>>>> ....
> >>>>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> >>>>>>> if (pmc->cmp(cur_value) != 0)
> >>>>>>>             goto end;
> >>>>>>> ....
> >>>>>>>
> >>>>>>
> >>>>>> I like the idea of a callback, but these are supposed to be
> >>>>>> intrinsic-like functions, so putting too much into them is contrary to
> >>>>>> their goal, and it's going to make the API hard to use in simpler cases
> >>>>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
> >>>>>> letting the RX callback do it for us). For example, event/dlb code calls
> >>>>>> rte_power_monitor explicitly.
> >>>>>
> >>>>> Good point, I didn't know that.
> >>>>> Would be interesting to see how do they use it.
> >>>>
> >>>> To be fair, it should be possible to rewrite their code using a
> >>>> callback. Perhaps adding a (void *) parameter for any custom data
> >>>> related to the callback (because C doesn't have closures...), but
> >>>> otherwise it should be doable, so the question isn't that it's
> >>>> impossible to rewrite event/dlb code to use callbacks, it's more of an
> >>>> issue with complicating usage of already-not-quite-straightforward API
> >>>> even more.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> It's going to be especially "fun" to do these indirect function calls
> >>>>>> from inside transactional region on call to multi-monitor.
> >>>>>
> >>>>> But the callback is not supposed to do any memory reads/writes.
> >>>>> Just mask/compare of the provided value with some constant.
> >>>>
> >>>> Yeah, but with callbacks we can't really control that, can we? I mean i
> >>>> guess a *sane* implementation wouldn't do that, but still, it's
> >>>> theoretically possible to perform more complex checks and even touch
> >>>> some unrelated data in the process.
> >>>
> >>> Yep, PMD developer can ignore recommendations and do whatever
> >>> he wants in the call-back. We can't control it.
> >>> If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
> >>> In principle it is the same with all other PMD dev-ops - we have to trust that they are
> >>> doing what they have to.
> >>
> >> I did a quick prototype for this, and i don't think it is going to work.
> >>
> >> Callbacks with just "current value" as argument will be pretty limited
> >> and will only really work for cases where we know what we are expecting.
> >> However, for cases like event/dlb or net/mlx5, the expected value is (or
> >> appears to be) dependent upon some internal device data, and is not
> >> constant like in case of net/ixgbe for example.
> >>
> >> This can be fixed by passing an opaque pointer, either by storing it in
> >> the monitor condition, or by passing it directly to rte_power_monitor at
> >> invocation time.
> >>
> >> The latter doesn't work well because when we call rte_power_monitor from
> >> inside the rte_power library, we lack the context necessary to get said
> >> opaque pointer.
> >>
> >> The former doesn't work either, because the only place where we can get
> >> this argument is inside get_monitor_addr, but the opaque pointer must
> >> persist after we exit that function in order to avoid use-after-free -
> >> which means that it either has to be statically allocated (which means
> >> it's not thread-safe for a non-trivial case), or dynamically allocated
> >> (which a big no-no on a hotpath).
> >
> > If I get you right, expected_value (and probably mask) can be variable ones.
> > So for callback approach to work we need to pass all this as parameters
> > to PMD comparison callback:
> > int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> > Correct?
> 
> If we have both expected value, mask, and current value, then what's the
> point of the callback? The point of the callback would be to pass just
> the current value, and let the callback decide what's the expected value
> and how to compare it.

For me the main point of callback is to hide PMD specific comparison semantics.
Basically they provide us with some values in struct rte_power_monitor_cond,
and then it is up to them how to interpret them in their comparison function.
All we'll do for them: will read the value at address provided. 
I understand that it looks like an overkill, as majority of these comparison functions
will be like:
int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
{
	return ((real_val & mask) == expected_val);
}  
Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.

> 
> So, we can either let callback handle expected values itself by having
> an opaque callback-specific argument (which means it has to persist
> between .get_monitor_addr() and rte_power_monitor() calls), 

But that's what we doing already - PMD fills rte_power_monitor_cond values
for us, we store them somewhere and then use them to decide should we go to sleep or not.
All callback does - moves actual values interpretation back to PMD:
Right now:
PMD:      provide PMC values
POWER: store PMC values somewhere
                read the value at address provided in PMC
                interpret PMC values and newly read value and make the decision

With callback:   
PMD:      provide PMC values
POWER: store PMC values somewhere
                read the value at address provided in PMC
PMD:      interpret PMC values and newly read value and make the decision

Or did you mean something different here?

>or we do the
> comparisons inside rte_power_monitor(), and store the expected/mask
> values in the monitor condition, and *don't* have any callbacks at all.
> Are you suggesting an alternative to the above two options?

As I said in my first mail - we can just replace 'inverse' with 'op'.
That at least will make this API extendable, if someone will need
something different in future.    

Another option is 

> 
> >
> >>
> >> Any other suggestions? :)
> >>
> >>>
> >>>>
> >>>>>
> >>>>>> I'm not
> >>>>>> opposed to having a callback here, but maybe others have more thoughts
> >>>>>> on this?
> >>>>>>
> >>>>>> --
> >>>>>> Thanks,
> >>>>>> Anatoly
> >>>>
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> Anatoly
> >>
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly
  
Burakov, Anatoly June 24, 2021, 2:34 p.m. UTC | #10
On 24-Jun-21 10:47 AM, Ananyev, Konstantin wrote:
> 

<snip>

>>>> I did a quick prototype for this, and i don't think it is going to work.
>>>>
>>>> Callbacks with just "current value" as argument will be pretty limited
>>>> and will only really work for cases where we know what we are expecting.
>>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
>>>> appears to be) dependent upon some internal device data, and is not
>>>> constant like in case of net/ixgbe for example.
>>>>
>>>> This can be fixed by passing an opaque pointer, either by storing it in
>>>> the monitor condition, or by passing it directly to rte_power_monitor at
>>>> invocation time.
>>>>
>>>> The latter doesn't work well because when we call rte_power_monitor from
>>>> inside the rte_power library, we lack the context necessary to get said
>>>> opaque pointer.
>>>>
>>>> The former doesn't work either, because the only place where we can get
>>>> this argument is inside get_monitor_addr, but the opaque pointer must
>>>> persist after we exit that function in order to avoid use-after-free -
>>>> which means that it either has to be statically allocated (which means
>>>> it's not thread-safe for a non-trivial case), or dynamically allocated
>>>> (which a big no-no on a hotpath).
>>>
>>> If I get you right, expected_value (and probably mask) can be variable ones.
>>> So for callback approach to work we need to pass all this as parameters
>>> to PMD comparison callback:
>>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>> Correct?
>>
>> If we have both expected value, mask, and current value, then what's the
>> point of the callback? The point of the callback would be to pass just
>> the current value, and let the callback decide what's the expected value
>> and how to compare it.
> 
> For me the main point of callback is to hide PMD specific comparison semantics.
> Basically they provide us with some values in struct rte_power_monitor_cond,
> and then it is up to them how to interpret them in their comparison function.
> All we'll do for them: will read the value at address provided.
> I understand that it looks like an overkill, as majority of these comparison functions
> will be like:
> int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> {
>          return ((real_val & mask) == expected_val);
> }
> Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
> 
>>
>> So, we can either let callback handle expected values itself by having
>> an opaque callback-specific argument (which means it has to persist
>> between .get_monitor_addr() and rte_power_monitor() calls),
> 
> But that's what we doing already - PMD fills rte_power_monitor_cond values
> for us, we store them somewhere and then use them to decide should we go to sleep or not.
> All callback does - moves actual values interpretation back to PMD:
> Right now:
> PMD:      provide PMC values
> POWER: store PMC values somewhere
>                  read the value at address provided in PMC
>                  interpret PMC values and newly read value and make the decision
> 
> With callback:
> PMD:      provide PMC values
> POWER: store PMC values somewhere
>                  read the value at address provided in PMC
> PMD:      interpret PMC values and newly read value and make the decision
> 
> Or did you mean something different here?
> 
>> or we do the
>> comparisons inside rte_power_monitor(), and store the expected/mask
>> values in the monitor condition, and *don't* have any callbacks at all.
>> Are you suggesting an alternative to the above two options?
> 
> As I said in my first mail - we can just replace 'inverse' with 'op'.
> That at least will make this API extendable, if someone will need
> something different in future.
> 
> Another option is

Right, so the idea is store the PMD-specific data in the monitor 
condition, and leave it to the callback to interpret it.

The obvious question then is, how many values is enough? Two? Three? 
Four? This option doesn't really solve the basic issue, it just kicks 
the can down the road. I guess three values should be enough for 
everyone (tm) ? :D

I don't like the 'op' thing because if the goal is to be flexible, it's 
unnecessarily limiting *and* makes the API even more complex to use. I 
would rather have a number of PMD-specific values and leave it up to the 
callback to interpret them, because at least that way we're not limited 
to predefined operations on the monitor condition data.
  
Ananyev, Konstantin June 24, 2021, 2:57 p.m. UTC | #11
> >>>> I did a quick prototype for this, and i don't think it is going to work.
> >>>>
> >>>> Callbacks with just "current value" as argument will be pretty limited
> >>>> and will only really work for cases where we know what we are expecting.
> >>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
> >>>> appears to be) dependent upon some internal device data, and is not
> >>>> constant like in case of net/ixgbe for example.
> >>>>
> >>>> This can be fixed by passing an opaque pointer, either by storing it in
> >>>> the monitor condition, or by passing it directly to rte_power_monitor at
> >>>> invocation time.
> >>>>
> >>>> The latter doesn't work well because when we call rte_power_monitor from
> >>>> inside the rte_power library, we lack the context necessary to get said
> >>>> opaque pointer.
> >>>>
> >>>> The former doesn't work either, because the only place where we can get
> >>>> this argument is inside get_monitor_addr, but the opaque pointer must
> >>>> persist after we exit that function in order to avoid use-after-free -
> >>>> which means that it either has to be statically allocated (which means
> >>>> it's not thread-safe for a non-trivial case), or dynamically allocated
> >>>> (which a big no-no on a hotpath).
> >>>
> >>> If I get you right, expected_value (and probably mask) can be variable ones.
> >>> So for callback approach to work we need to pass all this as parameters
> >>> to PMD comparison callback:
> >>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> >>> Correct?
> >>
> >> If we have both expected value, mask, and current value, then what's the
> >> point of the callback? The point of the callback would be to pass just
> >> the current value, and let the callback decide what's the expected value
> >> and how to compare it.
> >
> > For me the main point of callback is to hide PMD specific comparison semantics.
> > Basically they provide us with some values in struct rte_power_monitor_cond,
> > and then it is up to them how to interpret them in their comparison function.
> > All we'll do for them: will read the value at address provided.
> > I understand that it looks like an overkill, as majority of these comparison functions
> > will be like:
> > int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> > {
> >          return ((real_val & mask) == expected_val);
> > }
> > Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
> >
> >>
> >> So, we can either let callback handle expected values itself by having
> >> an opaque callback-specific argument (which means it has to persist
> >> between .get_monitor_addr() and rte_power_monitor() calls),
> >
> > But that's what we doing already - PMD fills rte_power_monitor_cond values
> > for us, we store them somewhere and then use them to decide should we go to sleep or not.
> > All callback does - moves actual values interpretation back to PMD:
> > Right now:
> > PMD:      provide PMC values
> > POWER: store PMC values somewhere
> >                  read the value at address provided in PMC
> >                  interpret PMC values and newly read value and make the decision
> >
> > With callback:
> > PMD:      provide PMC values
> > POWER: store PMC values somewhere
> >                  read the value at address provided in PMC
> > PMD:      interpret PMC values and newly read value and make the decision
> >
> > Or did you mean something different here?
> >
> >> or we do the
> >> comparisons inside rte_power_monitor(), and store the expected/mask
> >> values in the monitor condition, and *don't* have any callbacks at all.
> >> Are you suggesting an alternative to the above two options?
> >
> > As I said in my first mail - we can just replace 'inverse' with 'op'.
> > That at least will make this API extendable, if someone will need
> > something different in future.
> >
> > Another option is
> 
> Right, so the idea is store the PMD-specific data in the monitor
> condition, and leave it to the callback to interpret it.
> 
> The obvious question then is, how many values is enough? Two? Three?
> Four? This option doesn't really solve the basic issue, it just kicks
> the can down the road. I guess three values should be enough for
> everyone (tm) ? :D
> 
> I don't like the 'op' thing because if the goal is to be flexible, it's
> unnecessarily limiting *and* makes the API even more complex to use. I
> would rather have a number of PMD-specific values and leave it up to the
> callback to interpret them, because at least that way we're not limited
> to predefined operations on the monitor condition data.

Just to make sure we are talking about the same, does what you propose
looks like that:

 struct rte_power_monitor_cond {
        volatile void *addr;  /**< Address to monitor for changes */
        uint8_t size;    /**< Data size (in bytes) that will be used to compare
                          *   expected value (`val`) with data read from the
                          *   monitored memory location (`addr`). Can be 1, 2,
                          *   4, or 8. Supplying any other value will result in
                          *   an error.  
                          */
        int (*cmp)(uint64_t real_value, const uint64_t opaque[4]);
        uint64_t opaque[4];  /*PMD specific data, used by comparison call-back below */
};

And then in rte_power_monitor():
...
uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
if (pmc->cmp(cur_value, pmc->opaque) != 0) {
    /* goto sleep */
}

?
  
Burakov, Anatoly June 24, 2021, 3:04 p.m. UTC | #12
On 24-Jun-21 3:57 PM, Ananyev, Konstantin wrote:
> 
> 
>>>>>> I did a quick prototype for this, and i don't think it is going to work.
>>>>>>
>>>>>> Callbacks with just "current value" as argument will be pretty limited
>>>>>> and will only really work for cases where we know what we are expecting.
>>>>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
>>>>>> appears to be) dependent upon some internal device data, and is not
>>>>>> constant like in case of net/ixgbe for example.
>>>>>>
>>>>>> This can be fixed by passing an opaque pointer, either by storing it in
>>>>>> the monitor condition, or by passing it directly to rte_power_monitor at
>>>>>> invocation time.
>>>>>>
>>>>>> The latter doesn't work well because when we call rte_power_monitor from
>>>>>> inside the rte_power library, we lack the context necessary to get said
>>>>>> opaque pointer.
>>>>>>
>>>>>> The former doesn't work either, because the only place where we can get
>>>>>> this argument is inside get_monitor_addr, but the opaque pointer must
>>>>>> persist after we exit that function in order to avoid use-after-free -
>>>>>> which means that it either has to be statically allocated (which means
>>>>>> it's not thread-safe for a non-trivial case), or dynamically allocated
>>>>>> (which a big no-no on a hotpath).
>>>>>
>>>>> If I get you right, expected_value (and probably mask) can be variable ones.
>>>>> So for callback approach to work we need to pass all this as parameters
>>>>> to PMD comparison callback:
>>>>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>>>> Correct?
>>>>
>>>> If we have both expected value, mask, and current value, then what's the
>>>> point of the callback? The point of the callback would be to pass just
>>>> the current value, and let the callback decide what's the expected value
>>>> and how to compare it.
>>>
>>> For me the main point of callback is to hide PMD specific comparison semantics.
>>> Basically they provide us with some values in struct rte_power_monitor_cond,
>>> and then it is up to them how to interpret them in their comparison function.
>>> All we'll do for them: will read the value at address provided.
>>> I understand that it looks like an overkill, as majority of these comparison functions
>>> will be like:
>>> int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>> {
>>>           return ((real_val & mask) == expected_val);
>>> }
>>> Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
>>>
>>>>
>>>> So, we can either let callback handle expected values itself by having
>>>> an opaque callback-specific argument (which means it has to persist
>>>> between .get_monitor_addr() and rte_power_monitor() calls),
>>>
>>> But that's what we doing already - PMD fills rte_power_monitor_cond values
>>> for us, we store them somewhere and then use them to decide should we go to sleep or not.
>>> All callback does - moves actual values interpretation back to PMD:
>>> Right now:
>>> PMD:      provide PMC values
>>> POWER: store PMC values somewhere
>>>                   read the value at address provided in PMC
>>>                   interpret PMC values and newly read value and make the decision
>>>
>>> With callback:
>>> PMD:      provide PMC values
>>> POWER: store PMC values somewhere
>>>                   read the value at address provided in PMC
>>> PMD:      interpret PMC values and newly read value and make the decision
>>>
>>> Or did you mean something different here?
>>>
>>>> or we do the
>>>> comparisons inside rte_power_monitor(), and store the expected/mask
>>>> values in the monitor condition, and *don't* have any callbacks at all.
>>>> Are you suggesting an alternative to the above two options?
>>>
>>> As I said in my first mail - we can just replace 'inverse' with 'op'.
>>> That at least will make this API extendable, if someone will need
>>> something different in future.
>>>
>>> Another option is
>>
>> Right, so the idea is store the PMD-specific data in the monitor
>> condition, and leave it to the callback to interpret it.
>>
>> The obvious question then is, how many values is enough? Two? Three?
>> Four? This option doesn't really solve the basic issue, it just kicks
>> the can down the road. I guess three values should be enough for
>> everyone (tm) ? :D
>>
>> I don't like the 'op' thing because if the goal is to be flexible, it's
>> unnecessarily limiting *and* makes the API even more complex to use. I
>> would rather have a number of PMD-specific values and leave it up to the
>> callback to interpret them, because at least that way we're not limited
>> to predefined operations on the monitor condition data.
> 
> Just to make sure we are talking about the same, does what you propose
> looks like that:
> 
>   struct rte_power_monitor_cond {
>          volatile void *addr;  /**< Address to monitor for changes */
>          uint8_t size;    /**< Data size (in bytes) that will be used to compare
>                            *   expected value (`val`) with data read from the
>                            *   monitored memory location (`addr`). Can be 1, 2,
>                            *   4, or 8. Supplying any other value will result in
>                            *   an error.
>                            */
>          int (*cmp)(uint64_t real_value, const uint64_t opaque[4]);
>          uint64_t opaque[4];  /*PMD specific data, used by comparison call-back below */
> };
> 
> And then in rte_power_monitor():
> ...
> uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> if (pmc->cmp(cur_value, pmc->opaque) != 0) {
>      /* goto sleep */
> }
> 
> ?
> 

Something like that, yes.
  
Ananyev, Konstantin June 24, 2021, 3:25 p.m. UTC | #13
> >>>>>> I did a quick prototype for this, and i don't think it is going to work.
> >>>>>>
> >>>>>> Callbacks with just "current value" as argument will be pretty limited
> >>>>>> and will only really work for cases where we know what we are expecting.
> >>>>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
> >>>>>> appears to be) dependent upon some internal device data, and is not
> >>>>>> constant like in case of net/ixgbe for example.
> >>>>>>
> >>>>>> This can be fixed by passing an opaque pointer, either by storing it in
> >>>>>> the monitor condition, or by passing it directly to rte_power_monitor at
> >>>>>> invocation time.
> >>>>>>
> >>>>>> The latter doesn't work well because when we call rte_power_monitor from
> >>>>>> inside the rte_power library, we lack the context necessary to get said
> >>>>>> opaque pointer.
> >>>>>>
> >>>>>> The former doesn't work either, because the only place where we can get
> >>>>>> this argument is inside get_monitor_addr, but the opaque pointer must
> >>>>>> persist after we exit that function in order to avoid use-after-free -
> >>>>>> which means that it either has to be statically allocated (which means
> >>>>>> it's not thread-safe for a non-trivial case), or dynamically allocated
> >>>>>> (which a big no-no on a hotpath).
> >>>>>
> >>>>> If I get you right, expected_value (and probably mask) can be variable ones.
> >>>>> So for callback approach to work we need to pass all this as parameters
> >>>>> to PMD comparison callback:
> >>>>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> >>>>> Correct?
> >>>>
> >>>> If we have both expected value, mask, and current value, then what's the
> >>>> point of the callback? The point of the callback would be to pass just
> >>>> the current value, and let the callback decide what's the expected value
> >>>> and how to compare it.
> >>>
> >>> For me the main point of callback is to hide PMD specific comparison semantics.
> >>> Basically they provide us with some values in struct rte_power_monitor_cond,
> >>> and then it is up to them how to interpret them in their comparison function.
> >>> All we'll do for them: will read the value at address provided.
> >>> I understand that it looks like an overkill, as majority of these comparison functions
> >>> will be like:
> >>> int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> >>> {
> >>>           return ((real_val & mask) == expected_val);
> >>> }
> >>> Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
> >>>
> >>>>
> >>>> So, we can either let callback handle expected values itself by having
> >>>> an opaque callback-specific argument (which means it has to persist
> >>>> between .get_monitor_addr() and rte_power_monitor() calls),
> >>>
> >>> But that's what we doing already - PMD fills rte_power_monitor_cond values
> >>> for us, we store them somewhere and then use them to decide should we go to sleep or not.
> >>> All callback does - moves actual values interpretation back to PMD:
> >>> Right now:
> >>> PMD:      provide PMC values
> >>> POWER: store PMC values somewhere
> >>>                   read the value at address provided in PMC
> >>>                   interpret PMC values and newly read value and make the decision
> >>>
> >>> With callback:
> >>> PMD:      provide PMC values
> >>> POWER: store PMC values somewhere
> >>>                   read the value at address provided in PMC
> >>> PMD:      interpret PMC values and newly read value and make the decision
> >>>
> >>> Or did you mean something different here?
> >>>
> >>>> or we do the
> >>>> comparisons inside rte_power_monitor(), and store the expected/mask
> >>>> values in the monitor condition, and *don't* have any callbacks at all.
> >>>> Are you suggesting an alternative to the above two options?
> >>>
> >>> As I said in my first mail - we can just replace 'inverse' with 'op'.
> >>> That at least will make this API extendable, if someone will need
> >>> something different in future.
> >>>
> >>> Another option is
> >>
> >> Right, so the idea is store the PMD-specific data in the monitor
> >> condition, and leave it to the callback to interpret it.
> >>
> >> The obvious question then is, how many values is enough? Two? Three?
> >> Four? This option doesn't really solve the basic issue, it just kicks
> >> the can down the road. I guess three values should be enough for
> >> everyone (tm) ? :D
> >>
> >> I don't like the 'op' thing because if the goal is to be flexible, it's
> >> unnecessarily limiting *and* makes the API even more complex to use. I
> >> would rather have a number of PMD-specific values and leave it up to the
> >> callback to interpret them, because at least that way we're not limited
> >> to predefined operations on the monitor condition data.
> >
> > Just to make sure we are talking about the same, does what you propose
> > looks like that:
> >
> >   struct rte_power_monitor_cond {
> >          volatile void *addr;  /**< Address to monitor for changes */
> >          uint8_t size;    /**< Data size (in bytes) that will be used to compare
> >                            *   expected value (`val`) with data read from the
> >                            *   monitored memory location (`addr`). Can be 1, 2,
> >                            *   4, or 8. Supplying any other value will result in
> >                            *   an error.
> >                            */
> >          int (*cmp)(uint64_t real_value, const uint64_t opaque[4]);
> >          uint64_t opaque[4];  /*PMD specific data, used by comparison call-back below */
> > };
> >
> > And then in rte_power_monitor():
> > ...
> > uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> > if (pmc->cmp(cur_value, pmc->opaque) != 0) {
> >      /* goto sleep */
> > }
> >
> > ?
> >
> 
> Something like that, yes.
> 

Seems reasonable to me.
Thanks
Konstantin
  
Burakov, Anatoly June 24, 2021, 3:54 p.m. UTC | #14
On 24-Jun-21 4:25 PM, Ananyev, Konstantin wrote:
> 
>>>>>>>> I did a quick prototype for this, and i don't think it is going to work.
>>>>>>>>
>>>>>>>> Callbacks with just "current value" as argument will be pretty limited
>>>>>>>> and will only really work for cases where we know what we are expecting.
>>>>>>>> However, for cases like event/dlb or net/mlx5, the expected value is (or
>>>>>>>> appears to be) dependent upon some internal device data, and is not
>>>>>>>> constant like in case of net/ixgbe for example.
>>>>>>>>
>>>>>>>> This can be fixed by passing an opaque pointer, either by storing it in
>>>>>>>> the monitor condition, or by passing it directly to rte_power_monitor at
>>>>>>>> invocation time.
>>>>>>>>
>>>>>>>> The latter doesn't work well because when we call rte_power_monitor from
>>>>>>>> inside the rte_power library, we lack the context necessary to get said
>>>>>>>> opaque pointer.
>>>>>>>>
>>>>>>>> The former doesn't work either, because the only place where we can get
>>>>>>>> this argument is inside get_monitor_addr, but the opaque pointer must
>>>>>>>> persist after we exit that function in order to avoid use-after-free -
>>>>>>>> which means that it either has to be statically allocated (which means
>>>>>>>> it's not thread-safe for a non-trivial case), or dynamically allocated
>>>>>>>> (which a big no-no on a hotpath).
>>>>>>>
>>>>>>> If I get you right, expected_value (and probably mask) can be variable ones.
>>>>>>> So for callback approach to work we need to pass all this as parameters
>>>>>>> to PMD comparison callback:
>>>>>>> int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>>>>>> Correct?
>>>>>>
>>>>>> If we have both expected value, mask, and current value, then what's the
>>>>>> point of the callback? The point of the callback would be to pass just
>>>>>> the current value, and let the callback decide what's the expected value
>>>>>> and how to compare it.
>>>>>
>>>>> For me the main point of callback is to hide PMD specific comparison semantics.
>>>>> Basically they provide us with some values in struct rte_power_monitor_cond,
>>>>> and then it is up to them how to interpret them in their comparison function.
>>>>> All we'll do for them: will read the value at address provided.
>>>>> I understand that it looks like an overkill, as majority of these comparison functions
>>>>> will be like:
>>>>> int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
>>>>> {
>>>>>            return ((real_val & mask) == expected_val);
>>>>> }
>>>>> Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.
>>>>>
>>>>>>
>>>>>> So, we can either let callback handle expected values itself by having
>>>>>> an opaque callback-specific argument (which means it has to persist
>>>>>> between .get_monitor_addr() and rte_power_monitor() calls),
>>>>>
>>>>> But that's what we doing already - PMD fills rte_power_monitor_cond values
>>>>> for us, we store them somewhere and then use them to decide should we go to sleep or not.
>>>>> All callback does - moves actual values interpretation back to PMD:
>>>>> Right now:
>>>>> PMD:      provide PMC values
>>>>> POWER: store PMC values somewhere
>>>>>                    read the value at address provided in PMC
>>>>>                    interpret PMC values and newly read value and make the decision
>>>>>
>>>>> With callback:
>>>>> PMD:      provide PMC values
>>>>> POWER: store PMC values somewhere
>>>>>                    read the value at address provided in PMC
>>>>> PMD:      interpret PMC values and newly read value and make the decision
>>>>>
>>>>> Or did you mean something different here?
>>>>>
>>>>>> or we do the
>>>>>> comparisons inside rte_power_monitor(), and store the expected/mask
>>>>>> values in the monitor condition, and *don't* have any callbacks at all.
>>>>>> Are you suggesting an alternative to the above two options?
>>>>>
>>>>> As I said in my first mail - we can just replace 'inverse' with 'op'.
>>>>> That at least will make this API extendable, if someone will need
>>>>> something different in future.
>>>>>
>>>>> Another option is
>>>>
>>>> Right, so the idea is store the PMD-specific data in the monitor
>>>> condition, and leave it to the callback to interpret it.
>>>>
>>>> The obvious question then is, how many values is enough? Two? Three?
>>>> Four? This option doesn't really solve the basic issue, it just kicks
>>>> the can down the road. I guess three values should be enough for
>>>> everyone (tm) ? :D
>>>>
>>>> I don't like the 'op' thing because if the goal is to be flexible, it's
>>>> unnecessarily limiting *and* makes the API even more complex to use. I
>>>> would rather have a number of PMD-specific values and leave it up to the
>>>> callback to interpret them, because at least that way we're not limited
>>>> to predefined operations on the monitor condition data.
>>>
>>> Just to make sure we are talking about the same, does what you propose
>>> looks like that:
>>>
>>>    struct rte_power_monitor_cond {
>>>           volatile void *addr;  /**< Address to monitor for changes */
>>>           uint8_t size;    /**< Data size (in bytes) that will be used to compare
>>>                             *   expected value (`val`) with data read from the
>>>                             *   monitored memory location (`addr`). Can be 1, 2,
>>>                             *   4, or 8. Supplying any other value will result in
>>>                             *   an error.
>>>                             */
>>>           int (*cmp)(uint64_t real_value, const uint64_t opaque[4]);
>>>           uint64_t opaque[4];  /*PMD specific data, used by comparison call-back below */
>>> };
>>>
>>> And then in rte_power_monitor():
>>> ...
>>> uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
>>> if (pmc->cmp(cur_value, pmc->opaque) != 0) {
>>>       /* goto sleep */
>>> }
>>>
>>> ?
>>>
>>
>> Something like that, yes.
>>
> 
> Seems reasonable to me.
> Thanks
> Konstantin
> 

OK, i'll implement this in v2. Thanks for your input!
  
David Marchand July 9, 2021, 3:03 p.m. UTC | #15
On Thu, Jun 24, 2021 at 4:35 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
> Right, so the idea is store the PMD-specific data in the monitor
> condition, and leave it to the callback to interpret it.
>
> The obvious question then is, how many values is enough? Two? Three?
> Four? This option doesn't really solve the basic issue, it just kicks
> the can down the road. I guess three values should be enough for
> everyone (tm) ? :D

Can we have multiple callbacks executed for a given rxq?

Since this is on the rx path, I would say no, but I might be missing
some consideration for future evols of this API.
In this case, a private field in rxq seems a good place for temporary
storage, and then we simply pass a void * opaque pointer.
  

Patch

diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
index dddca3d41c..1006c2edfc 100644
--- a/lib/eal/include/generic/rte_power_intrinsics.h
+++ b/lib/eal/include/generic/rte_power_intrinsics.h
@@ -31,6 +31,10 @@  struct rte_power_monitor_cond {
 	                  *   4, or 8. Supplying any other value will result in
 	                  *   an error.
 	                  */
+	uint8_t invert;  /**< Invert check for expected value (e.g. instead of
+	                  *   checking if `val` matches something, check if
+	                  *   `val` *doesn't* match a particular value)
+	                  */
 };
 
 /**
diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 39ea9fdecd..5d944e9aa4 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -117,7 +117,10 @@  rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t masked = cur_value & pmc->mask;
 
 		/* if the masked value is already matching, abort */
-		if (masked == pmc->val)
+		if (!pmc->invert && masked == pmc->val)
+			goto end;
+		/* same, but for inverse check */
+		if (pmc->invert && masked != pmc->val)
 			goto end;
 	}