mbox series

[v16,00/11] Add PMD power management

Message ID cover.1610473000.git.anatoly.burakov@intel.com (mailing list archive)
Headers
Series Add PMD power management |

Message

Burakov, Anatoly Jan. 12, 2021, 5:37 p.m. UTC
  This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. There are multiple proposed mechanisms to achieve said power
savings: simple frequency scaling, idle loop, and monitoring the Rx
queue for incoming packages. The latter is achieved through cooperation
with the NIC driver that will allow us to know address of wake up event,
and wait for writes on that address.

On IA, this is achieved through using UMONITOR/UMWAIT instructions. They 
are used in their raw opcode form because there is no widespread 
compiler support for them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen to implement 
similar instructions.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we employ
one of the suggested power management schemes automatically, from within
a Rx callback inside the PMD. Once there's traffic again, the empty poll
counter is reset.

This patchset also introduces a few changes into existing power 
management-related intrinsics, namely to provide a native way of waking 
up a sleeping core without application being responsible for it, as well 
as general robustness improvements. There's quite a bit of locking going 
on, but these locks are per-thread and very little (if any) contention 
is expected, so the performance impact shouldn't be that bad (and in any 
case the locking happens when we're about to sleep anyway).

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it with
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

Things of note:

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

v16:
- Implemented Konstantin's suggestions and comments
- Added return values to the API

v15:
- Fixed incorrect check in UMWAIT callback
- Fixed accidental whitespace changes

v14:
- Fixed ARM/PPC builds
- Addressed various review comments

v13:
- Reworked the librte_power code to require less locking and handle invalid
  parameters better
- Fix numerous rebase errors present in v12

v12:
- Rebase on top of 21.02
- Rework of power intrinsics code

Anatoly Burakov (5):
  eal: uninline power intrinsics
  eal: avoid invalid API usage in power intrinsics
  eal: change API of power intrinsics
  eal: remove sync version of power monitor
  eal: add monitor wakeup function

Liang Ma (6):
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  44 +++
 doc/guides/rel_notes/release_21_02.rst        |  15 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |  10 +-
 drivers/event/dlb2/dlb2.c                     |  10 +-
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  25 ++
 drivers/net/i40e/i40e_rxtx.h                  |   1 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   1 +
 examples/l3fwd-power/main.c                   |  89 ++++-
 .../arm/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/arm/meson.build                |   1 +
 lib/librte_eal/arm/rte_power_intrinsics.c     |  38 ++
 .../include/generic/rte_power_intrinsics.h    |  88 ++---
 .../ppc/include/rte_power_intrinsics.h        |  40 --
 lib/librte_eal/ppc/meson.build                |   1 +
 lib/librte_eal/ppc/rte_power_intrinsics.c     |  38 ++
 lib/librte_eal/version.map                    |   3 +
 .../x86/include/rte_power_intrinsics.h        | 115 ------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 220 +++++++++++
 lib/librte_ethdev/rte_ethdev.c                |  28 ++
 lib/librte_ethdev/rte_ethdev.h                |  25 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  22 ++
 lib/librte_ethdev/version.map                 |   3 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 359 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  90 +++++
 lib/librte_power/version.map                  |   5 +
 34 files changed, 1148 insertions(+), 259 deletions(-)
 create mode 100644 lib/librte_eal/arm/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/ppc/rte_power_intrinsics.c
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
  

Comments

David Marchand Jan. 14, 2021, 9:36 a.m. UTC | #1
On Tue, Jan 12, 2021 at 6:37 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> This patchset proposes a simple API for Ethernet drivers to cause the
> CPU to enter a power-optimized state while waiting for packets to
> arrive. There are multiple proposed mechanisms to achieve said power
> savings: simple frequency scaling, idle loop, and monitoring the Rx
> queue for incoming packages. The latter is achieved through cooperation
> with the NIC driver that will allow us to know address of wake up event,
> and wait for writes on that address.
>
> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
> are used in their raw opcode form because there is no widespread
> compiler support for them yet. Still, the API is made generic enough to
> hopefully support other architectures, if they happen to implement
> similar instructions.
>
> To achieve power savings, there is a very simple mechanism used: we're
> counting empty polls, and if a certain threshold is reached, we employ
> one of the suggested power management schemes automatically, from within
> a Rx callback inside the PMD. Once there's traffic again, the empty poll
> counter is reset.
>
> This patchset also introduces a few changes into existing power
> management-related intrinsics, namely to provide a native way of waking
> up a sleeping core without application being responsible for it, as well
> as general robustness improvements. There's quite a bit of locking going
> on, but these locks are per-thread and very little (if any) contention
> is expected, so the performance impact shouldn't be that bad (and in any
> case the locking happens when we're about to sleep anyway).
>
> Why are we putting it into ethdev as opposed to leaving this up to the
> application? Our customers specifically requested a way to do it with
> minimal changes to the application code. The current approach allows to
> just flip a switch and automatically have power savings.
>
> Things of note:
>
> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>   must at most handle RX on a single queue

If we want to save power, it is likely we would poll more rxqs on a thread.


> - Support 3 type policies. Monitor/Pause/Frequency Scaling
> - Power management is enabled per-queue
> - The API doesn't extend to other device types
>
> v16:
> - Implemented Konstantin's suggestions and comments
> - Added return values to the API

- This revision breaks SPDK build (reported by UNH):
http://mails.dpdk.org/archives/test-report/2021-January/174069.html


- Build is broken for ARM and PPC at patch:
86491d5bd4 - (HEAD) eal: add monitor wakeup function (25 minutes ago)
<Anatoly Burakov>

Only pasting the ARM failure:
ninja: Entering directory `/home/dmarchan/builds/build-arm64-host-clang'
[1/297] Compiling C object
'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'.
FAILED: lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o
aarch64-linux-gnu-gcc -Ilib/76b5a35@@rte_eal@sta -Ilib
-I../../dpdk/lib -I. -I../../dpdk/ -Iconfig -I../../dpdk/config
-Ilib/librte_eal/include -I../../dpdk/lib/librte_eal/include
-Ilib/librte_eal/linux/include
-I../../dpdk/lib/librte_eal/linux/include -Ilib/librte_eal/arm/include
-I../../dpdk/lib/librte_eal/arm/include -Ilib/librte_eal/common
-I../../dpdk/lib/librte_eal/common -Ilib/librte_eal
-I../../dpdk/lib/librte_eal -Ilib/librte_kvargs
-I../../dpdk/lib/librte_kvargs
-Ilib/librte_telemetry/../librte_metrics
-I../../dpdk/lib/librte_telemetry/../librte_metrics
-Ilib/librte_telemetry -I../../dpdk/lib/librte_telemetry
-fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall
-Winvalid-pch -Werror -O2 -g -include rte_config.h -Wextra -Wcast-qual
-Wdeprecated -Wformat -Wformat-nonliteral -Wformat-security
-Wmissing-declarations -Wmissing-prototypes -Wnested-externs
-Wold-style-definition -Wpointer-arith -Wsign-compare
-Wstrict-prototypes -Wundef -Wwrite-strings -Wno-packed-not-aligned
-Wno-missing-field-initializers -D_GNU_SOURCE -fPIC -march=armv8-a+crc
-DALLOW_EXPERIMENTAL_API -DALLOW_INTERNAL_API -Wno-format-truncation
'-DABI_VERSION="21.1"' -DRTE_LIBEAL_USE_GETENTROPY -MD -MQ
'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o' -MF
'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o.d'
-o 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'
-c ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c
../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:35:1: error:
conflicting types for ‘rte_power_monitor_wakeup’
 rte_power_monitor_wakeup(const unsigned int lcore_id)
 ^~~~~~~~~~~~~~~~~~~~~~~~
In file included from
../../dpdk/lib/librte_eal/arm/include/rte_power_intrinsics.h:14,
                 from ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:5:
../../dpdk/lib/librte_eal/include/generic/rte_power_intrinsics.h:79:5:
note: previous declaration of ‘rte_power_monitor_wakeup’ was here
 int rte_power_monitor_wakeup(const unsigned int lcore_id);
     ^~~~~~~~~~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.



- The ABI check is still not happy as I reported earlier.
Reproduced on v16 (GHA had a hiccup on this revision, but previous
ones had the failure too):

1 Changed variable:

  [C] 'rte_eth_dev rte_eth_devices[]' was changed at rte_ethdev_core.h:196:1:
    type of variable changed:
      array element type 'struct rte_eth_dev' changed:
        type size hasn't changed
        1 data member change:
          type of 'const eth_dev_ops* rte_eth_dev::dev_ops' changed:
            in pointed to type 'const eth_dev_ops':
              in unqualified underlying type 'struct eth_dev_ops' at
rte_ethdev_driver.h:789:1:
                type size changed from 6208 to 6272 (in bits)
                1 data member insertion:
                  'eth_get_monitor_addr_t
eth_dev_ops::get_monitor_addr', at offset 6208 (in bits) at
rte_ethdev_driver.h:940:1
                no data member changes (94 filtered);
      type size hasn't changed

Error: ABI issue reported for 'abidiff --suppr
/home/dmarchan/dpdk/devtools/../devtools/libabigail.abignore
--no-added-syms --headers-dir1
/home/dmarchan/abi/v20.11/build-gcc-static/usr/local/include
--headers-dir2 /home/dmarchan/builds/build-gcc-static/install/usr/local/include
/home/dmarchan/abi/v20.11/build-gcc-static/dump/librte_ethdev.dump
/home/dmarchan/builds/build-gcc-static/install/dump/librte_ethdev.dump'

ABIDIFF_ABI_CHANGE, this change requires a review (abidiff flagged
this as a potential issue).

One solution is to add an exception on the eth_dev_ops structure.

--- a/devtools/libabigail.abignore
+++ b/devtools/libabigail.abignore
@@ -7,3 +7,7 @@
         symbol_version = INTERNAL
 [suppress_variable]
         symbol_version = INTERNAL
+
+; Explicit ignore for driver-only ABI
+[suppress_type]
+        name = eth_dev_ops
  
Burakov, Anatoly Jan. 14, 2021, 10:25 a.m. UTC | #2
On 14-Jan-21 9:36 AM, David Marchand wrote:
> On Tue, Jan 12, 2021 at 6:37 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> This patchset proposes a simple API for Ethernet drivers to cause the
>> CPU to enter a power-optimized state while waiting for packets to
>> arrive. There are multiple proposed mechanisms to achieve said power
>> savings: simple frequency scaling, idle loop, and monitoring the Rx
>> queue for incoming packages. The latter is achieved through cooperation
>> with the NIC driver that will allow us to know address of wake up event,
>> and wait for writes on that address.
>>
>> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
>> are used in their raw opcode form because there is no widespread
>> compiler support for them yet. Still, the API is made generic enough to
>> hopefully support other architectures, if they happen to implement
>> similar instructions.
>>
>> To achieve power savings, there is a very simple mechanism used: we're
>> counting empty polls, and if a certain threshold is reached, we employ
>> one of the suggested power management schemes automatically, from within
>> a Rx callback inside the PMD. Once there's traffic again, the empty poll
>> counter is reset.
>>
>> This patchset also introduces a few changes into existing power
>> management-related intrinsics, namely to provide a native way of waking
>> up a sleeping core without application being responsible for it, as well
>> as general robustness improvements. There's quite a bit of locking going
>> on, but these locks are per-thread and very little (if any) contention
>> is expected, so the performance impact shouldn't be that bad (and in any
>> case the locking happens when we're about to sleep anyway).
>>
>> Why are we putting it into ethdev as opposed to leaving this up to the
>> application? Our customers specifically requested a way to do it with
>> minimal changes to the application code. The current approach allows to
>> just flip a switch and automatically have power savings.
>>
>> Things of note:
>>
>> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>>    must at most handle RX on a single queue
> 
> If we want to save power, it is likely we would poll more rxqs on a thread.

We are investigating possibilities to make that happen, but for this 
patchset, this is the limitation.

> 
> 
>> - Support 3 type policies. Monitor/Pause/Frequency Scaling
>> - Power management is enabled per-queue
>> - The API doesn't extend to other device types
>>
>> v16:
>> - Implemented Konstantin's suggestions and comments
>> - Added return values to the API
> 
> - This revision breaks SPDK build (reported by UNH):
> http://mails.dpdk.org/archives/test-report/2021-January/174069.html
> 
> 
> - Build is broken for ARM and PPC at patch:
> 86491d5bd4 - (HEAD) eal: add monitor wakeup function (25 minutes ago)
> <Anatoly Burakov>
> 
> Only pasting the ARM failure:
> ninja: Entering directory `/home/dmarchan/builds/build-arm64-host-clang'
> [1/297] Compiling C object
> 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'.
> FAILED: lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o
> aarch64-linux-gnu-gcc -Ilib/76b5a35@@rte_eal@sta -Ilib
> -I../../dpdk/lib -I. -I../../dpdk/ -Iconfig -I../../dpdk/config
> -Ilib/librte_eal/include -I../../dpdk/lib/librte_eal/include
> -Ilib/librte_eal/linux/include
> -I../../dpdk/lib/librte_eal/linux/include -Ilib/librte_eal/arm/include
> -I../../dpdk/lib/librte_eal/arm/include -Ilib/librte_eal/common
> -I../../dpdk/lib/librte_eal/common -Ilib/librte_eal
> -I../../dpdk/lib/librte_eal -Ilib/librte_kvargs
> -I../../dpdk/lib/librte_kvargs
> -Ilib/librte_telemetry/../librte_metrics
> -I../../dpdk/lib/librte_telemetry/../librte_metrics
> -Ilib/librte_telemetry -I../../dpdk/lib/librte_telemetry
> -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall
> -Winvalid-pch -Werror -O2 -g -include rte_config.h -Wextra -Wcast-qual
> -Wdeprecated -Wformat -Wformat-nonliteral -Wformat-security
> -Wmissing-declarations -Wmissing-prototypes -Wnested-externs
> -Wold-style-definition -Wpointer-arith -Wsign-compare
> -Wstrict-prototypes -Wundef -Wwrite-strings -Wno-packed-not-aligned
> -Wno-missing-field-initializers -D_GNU_SOURCE -fPIC -march=armv8-a+crc
> -DALLOW_EXPERIMENTAL_API -DALLOW_INTERNAL_API -Wno-format-truncation
> '-DABI_VERSION="21.1"' -DRTE_LIBEAL_USE_GETENTROPY -MD -MQ
> 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o' -MF
> 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o.d'
> -o 'lib/76b5a35@@rte_eal@sta/librte_eal_arm_rte_power_intrinsics.c.o'
> -c ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c
> ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:35:1: error:
> conflicting types for ‘rte_power_monitor_wakeup’
>   rte_power_monitor_wakeup(const unsigned int lcore_id)
>   ^~~~~~~~~~~~~~~~~~~~~~~~
> In file included from
> ../../dpdk/lib/librte_eal/arm/include/rte_power_intrinsics.h:14,
>                   from ../../dpdk/lib/librte_eal/arm/rte_power_intrinsics.c:5:
> ../../dpdk/lib/librte_eal/include/generic/rte_power_intrinsics.h:79:5:
> note: previous declaration of ‘rte_power_monitor_wakeup’ was here
>   int rte_power_monitor_wakeup(const unsigned int lcore_id);
>       ^~~~~~~~~~~~~~~~~~~~~~~~
> ninja: build stopped: subcommand failed.

Woops, wrong return value in the .c files. Will fix!

> 
> 
> 
> - The ABI check is still not happy as I reported earlier.
> Reproduced on v16 (GHA had a hiccup on this revision, but previous
> ones had the failure too):
> 
> 1 Changed variable:
> 
>    [C] 'rte_eth_dev rte_eth_devices[]' was changed at rte_ethdev_core.h:196:1:
>      type of variable changed:
>        array element type 'struct rte_eth_dev' changed:
>          type size hasn't changed
>          1 data member change:
>            type of 'const eth_dev_ops* rte_eth_dev::dev_ops' changed:
>              in pointed to type 'const eth_dev_ops':
>                in unqualified underlying type 'struct eth_dev_ops' at
> rte_ethdev_driver.h:789:1:
>                  type size changed from 6208 to 6272 (in bits)
>                  1 data member insertion:
>                    'eth_get_monitor_addr_t
> eth_dev_ops::get_monitor_addr', at offset 6208 (in bits) at
> rte_ethdev_driver.h:940:1
>                  no data member changes (94 filtered);
>        type size hasn't changed
> 
> Error: ABI issue reported for 'abidiff --suppr
> /home/dmarchan/dpdk/devtools/../devtools/libabigail.abignore
> --no-added-syms --headers-dir1
> /home/dmarchan/abi/v20.11/build-gcc-static/usr/local/include
> --headers-dir2 /home/dmarchan/builds/build-gcc-static/install/usr/local/include
> /home/dmarchan/abi/v20.11/build-gcc-static/dump/librte_ethdev.dump
> /home/dmarchan/builds/build-gcc-static/install/dump/librte_ethdev.dump'
> 
> ABIDIFF_ABI_CHANGE, this change requires a review (abidiff flagged
> this as a potential issue).
> 
> One solution is to add an exception on the eth_dev_ops structure.
> 
> --- a/devtools/libabigail.abignore
> +++ b/devtools/libabigail.abignore
> @@ -7,3 +7,7 @@
>           symbol_version = INTERNAL
>   [suppress_variable]
>           symbol_version = INTERNAL
> +
> +; Explicit ignore for driver-only ABI
> +[suppress_type]
> +        name = eth_dev_ops
> 
> 

Right, OK. I didn't realize an "exception" is something you actually do 
in code, not an ad-hoc community process type thing :) I'll add this in 
the next revision.