[dpdk-dev,RFC] ethdev: abstraction layer for QoS hierarchical scheduler

Message ID 1480529810-95280-1-git-send-email-cristian.dumitrescu@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
checkpatch/checkpatch warning coding style issues

Commit Message

Cristian Dumitrescu Nov. 30, 2016, 6:16 p.m. UTC
  This RFC proposes an ethdev-based abstraction layer for Quality of Service (QoS)
hierarchical scheduler. The goal of the abstraction layer is to provide a simple
generic API that is agnostic of the underlying HW, SW or mixed HW-SW complex
implementation.

Q1: What is the benefit for having an abstraction layer for QoS hierarchical
layer?
A1: There is growing interest in the industry for handling various HW-based,
SW-based or mixed hierarchical scheduler implementations using a unified DPDK
API.

Q2: Which devices are targeted by this abstraction layer?
A2: All current and future devices that expose a hierarchical scheduler feature
under DPDK, including NICs, FPGAs, ASICs, SOCs, SW libraries.

Q3: Which scheduler hierarchies are supported by the API?
A3: Hopefully any scheduler hierarchy can be described and covered by the
current API. Of course, functional correctness, accuracy and performance levels
depend on the specific implementations of this API.

Q4: Why have this abstraction layer into ethdev as opposed to a new type of
device (e.g. scheddev) similar to ethdev, cryptodev, eventdev, etc?
A4: Packets are sent to the Ethernet device using the ethdev API
rte_eth_tx_burst() function, with the hierarchical scheduling taking place
automatically (i.e. no SW intervention) in HW implementations. Basically, the
hierarchical scheduler is done as part of packet TX operation.
The hierarchical scheduler is typically the last stage before packet TX and it
is tightly integrated with the TX stage. The hierarchical scheduler is just
another offload feature of the Ethernet device, which needs to be accommodated
by the ethdev API similar to any other offload feature (such as RSS, DCB,
flow director, etc).
Once the decision to schedule a specific packet has been taken, this packet
cannot be dropped and it has to be sent over the wire as is, otherwise what
takes place on the wire is not what was planned at scheduling time, so the
scheduling is not accurate (Note: there are some devices which allow prepending
headers to the packet after the scheduling stage at the expense of sending
correction requests back to the scheduler, but this only strengthens the bond
between scheduling and TX).

Q5: Given that the packet scheduling takes place automatically for pure HW
implementations, how does packet scheduling take place for poll-mode SW
implementations?
A5: The API provided function rte_sched_run() is designed to take care of this.
For HW implementations, this function typically does nothing. For SW
implementations, this function is typically expected to perform dequeue of
packets from the hierarchical scheduler and their write to Ethernet device TX
queue, periodic flush of any buffers on enqueue-side into the hierarchical
scheduler for burst-oriented implementations, etc.

Q6: Which are the scheduling algorithms supported?
A6: The fundamental scheduling algorithms that are supported are Strict Priority
(SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported at
the level of each node of the scheduling hierarchy, regardless of the node
level/position in the tree. The SP algorithm is used to schedule between sibling
nodes with different priority, while WFQ is used to schedule between groups of
siblings that have the same priority.
Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR
(DWRR), etc are considered approximations of the ideal WFQ and are therefore
assimilated to WFQ, although an associated implementation-dependent accuracy,
performance and resource usage trade-off might exist.

Q7: Which are the supported congestion management algorithms?
A7: Tail drop, head drop and Weighted Random Early Detection (WRED). They are
available for every leaf node in the hierarchy, subject to the specific
implementation supporting them.

Q8: Is traffic shaping supported?
A8: Yes, there are a number of shapers (rate limiters) that can be supported for
each node in the hierarchy (built-in limit is currently set to 4 per node). Each
shaper can be private to a node (used only by that node) or shared between
multiple nodes.

Q9: What is the purpose of having shaper profiles and WRED profiles?
A9: In most implementations, many shapers typically share the same configuration
parameters, so defining shaper profiles simplifies the configuration task. Same
considerations apply to WRED contexts and profiles.

Q10: How is the scheduling hierarchy defined and created?
A10: Scheduler hierarchy tree is set up by creating new nodes and connecting
them to other existing nodes, which thus become parent nodes. The unique ID that
is assigned to each node when the node is created is further used to update the
node configuration or to connect children nodes to it. The leaf nodes of the
scheduler hierarchy are each attached to one of the Ethernet device TX queues.

Q11: Are on-the-fly changes of the scheduling hierarchy allowed by the API?
A11: Yes. The actual changes take place subject to the specific implementation
supporting them, otherwise error code is returned.

Q12: What is the typical function call sequence to set up and run the Ethernet
device scheduler?
A12: The typical simplified function call sequence is listed below:
i) Configure the Ethernet device and its TX queues: rte_eth_dev_configure(),
rte_eth_tx_queue_setup()
ii) Create WRED profiles and WRED contexts, shaper profiles and shapers:
rte_eth_sched_wred_profile_add(), rte_eth_sched_wred_context_add(),
rte_eth_sched_shaper_profile_add(), rte_eth_sched_shaper_add()
iii) Create the scheduler hierarchy nodes and tree: rte_eth_sched_node_add()
iv) Freeze the start-up hierarchy and ask the device whether it supports it:
rte_eth_sched_node_add()
v) Start the Ethernet port: rte_eth_dev_start()
vi) Run-time scheduler hierarchy updates: rte_eth_sched_node_add(),
rte_eth_sched_node_<attribute>_set()
vii) Run-time packet enqueue into the hierarchical scheduler: rte_eth_tx_burst()
viii) Run-time support for SW poll-mode implementations (see previous answer):
rte_sched_run()

Q13: Which are the possible options for the user when the Ethernet port does not
support the scheduling hierarchy required by the user?
A13: The following options are available to the user:
i) abort
ii) try out a new hierarchy (e.g. with less leaf nodes), if acceptable
iii) wrap the Ethernet device into a new type of Ethernet device that has a SW
front-end implementing the hierarchical scheduler (e.g. existing DPDK library
librte_sched); instantiate the new device type on-the-fly and check if the
hierarchy requirements can be met by the new device.


Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
---
 lib/librte_ether/rte_ethdev.h | 794 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 794 insertions(+)
 mode change 100644 => 100755 lib/librte_ether/rte_ethdev.h
  

Comments

Stephen Hemminger Dec. 6, 2016, 7:51 p.m. UTC | #1
On Wed, 30 Nov 2016 18:16:50 +0000
Cristian Dumitrescu <cristian.dumitrescu@intel.com> wrote:

> This RFC proposes an ethdev-based abstraction layer for Quality of Service (QoS)
> hierarchical scheduler. The goal of the abstraction layer is to provide a simple
> generic API that is agnostic of the underlying HW, SW or mixed HW-SW complex
> implementation.
> 
> Q1: What is the benefit for having an abstraction layer for QoS hierarchical
> layer?
> A1: There is growing interest in the industry for handling various HW-based,
> SW-based or mixed hierarchical scheduler implementations using a unified DPDK
> API.
> 
> Q2: Which devices are targeted by this abstraction layer?
> A2: All current and future devices that expose a hierarchical scheduler feature
> under DPDK, including NICs, FPGAs, ASICs, SOCs, SW libraries.
> 
> Q3: Which scheduler hierarchies are supported by the API?
> A3: Hopefully any scheduler hierarchy can be described and covered by the
> current API. Of course, functional correctness, accuracy and performance levels
> depend on the specific implementations of this API.
> 
> Q4: Why have this abstraction layer into ethdev as opposed to a new type of
> device (e.g. scheddev) similar to ethdev, cryptodev, eventdev, etc?
> A4: Packets are sent to the Ethernet device using the ethdev API
> rte_eth_tx_burst() function, with the hierarchical scheduling taking place
> automatically (i.e. no SW intervention) in HW implementations. Basically, the
> hierarchical scheduler is done as part of packet TX operation.
> The hierarchical scheduler is typically the last stage before packet TX and it
> is tightly integrated with the TX stage. The hierarchical scheduler is just
> another offload feature of the Ethernet device, which needs to be accommodated
> by the ethdev API similar to any other offload feature (such as RSS, DCB,
> flow director, etc).
> Once the decision to schedule a specific packet has been taken, this packet
> cannot be dropped and it has to be sent over the wire as is, otherwise what
> takes place on the wire is not what was planned at scheduling time, so the
> scheduling is not accurate (Note: there are some devices which allow prepending
> headers to the packet after the scheduling stage at the expense of sending
> correction requests back to the scheduler, but this only strengthens the bond
> between scheduling and TX).
> 
> Q5: Given that the packet scheduling takes place automatically for pure HW
> implementations, how does packet scheduling take place for poll-mode SW
> implementations?
> A5: The API provided function rte_sched_run() is designed to take care of this.
> For HW implementations, this function typically does nothing. For SW
> implementations, this function is typically expected to perform dequeue of
> packets from the hierarchical scheduler and their write to Ethernet device TX
> queue, periodic flush of any buffers on enqueue-side into the hierarchical
> scheduler for burst-oriented implementations, etc.
> 
> Q6: Which are the scheduling algorithms supported?
> A6: The fundamental scheduling algorithms that are supported are Strict Priority
> (SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported at
> the level of each node of the scheduling hierarchy, regardless of the node
> level/position in the tree. The SP algorithm is used to schedule between sibling
> nodes with different priority, while WFQ is used to schedule between groups of
> siblings that have the same priority.
> Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR
> (DWRR), etc are considered approximations of the ideal WFQ and are therefore
> assimilated to WFQ, although an associated implementation-dependent accuracy,
> performance and resource usage trade-off might exist.
> 
> Q7: Which are the supported congestion management algorithms?
> A7: Tail drop, head drop and Weighted Random Early Detection (WRED). They are
> available for every leaf node in the hierarchy, subject to the specific
> implementation supporting them.
> 
> Q8: Is traffic shaping supported?
> A8: Yes, there are a number of shapers (rate limiters) that can be supported for
> each node in the hierarchy (built-in limit is currently set to 4 per node). Each
> shaper can be private to a node (used only by that node) or shared between
> multiple nodes.
> 
> Q9: What is the purpose of having shaper profiles and WRED profiles?
> A9: In most implementations, many shapers typically share the same configuration
> parameters, so defining shaper profiles simplifies the configuration task. Same
> considerations apply to WRED contexts and profiles.
> 
> Q10: How is the scheduling hierarchy defined and created?
> A10: Scheduler hierarchy tree is set up by creating new nodes and connecting
> them to other existing nodes, which thus become parent nodes. The unique ID that
> is assigned to each node when the node is created is further used to update the
> node configuration or to connect children nodes to it. The leaf nodes of the
> scheduler hierarchy are each attached to one of the Ethernet device TX queues.
> 
> Q11: Are on-the-fly changes of the scheduling hierarchy allowed by the API?
> A11: Yes. The actual changes take place subject to the specific implementation
> supporting them, otherwise error code is returned.
> 
> Q12: What is the typical function call sequence to set up and run the Ethernet
> device scheduler?
> A12: The typical simplified function call sequence is listed below:
> i) Configure the Ethernet device and its TX queues: rte_eth_dev_configure(),
> rte_eth_tx_queue_setup()
> ii) Create WRED profiles and WRED contexts, shaper profiles and shapers:
> rte_eth_sched_wred_profile_add(), rte_eth_sched_wred_context_add(),
> rte_eth_sched_shaper_profile_add(), rte_eth_sched_shaper_add()
> iii) Create the scheduler hierarchy nodes and tree: rte_eth_sched_node_add()
> iv) Freeze the start-up hierarchy and ask the device whether it supports it:
> rte_eth_sched_node_add()
> v) Start the Ethernet port: rte_eth_dev_start()
> vi) Run-time scheduler hierarchy updates: rte_eth_sched_node_add(),
> rte_eth_sched_node_<attribute>_set()
> vii) Run-time packet enqueue into the hierarchical scheduler: rte_eth_tx_burst()
> viii) Run-time support for SW poll-mode implementations (see previous answer):
> rte_sched_run()
> 
> Q13: Which are the possible options for the user when the Ethernet port does not
> support the scheduling hierarchy required by the user?
> A13: The following options are available to the user:
> i) abort
> ii) try out a new hierarchy (e.g. with less leaf nodes), if acceptable
> iii) wrap the Ethernet device into a new type of Ethernet device that has a SW
> front-end implementing the hierarchical scheduler (e.g. existing DPDK library
> librte_sched); instantiate the new device type on-the-fly and check if the
> hierarchy requirements can be met by the new device.
> 
> 
> Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>

This seems to be more of an abstraction of existing QoS.
Why not something like Linux Qdisc or FreeBSD DummyNet/PF/ALTQ where the Qos
components are stackable objects? And why not make it the same as existing
OS abstractions? Rather than reinventing wheel which seems to be DPDK Standard
Procedure, could an existing abstraction be used?
  
Thomas Monjalon Dec. 6, 2016, 10:14 p.m. UTC | #2
2016-12-06 11:51, Stephen Hemminger:
> Rather than reinventing wheel which seems to be DPDK Standard
> Procedure, could an existing abstraction be used?

Stephen, you know that the DPDK standard procedure is to consider
reviews and good comments ;)
  
Alan Robertson Dec. 7, 2016, 10:58 a.m. UTC | #3
Hi Cristian,

Looking at points 10 and 11 it's good to hear nodes can be dynamically added.

We've been trying to decide the best way to do this for support of qos on tunnels for
some time now and the existing implementation doesn't allow this so effectively ruled
out hierarchical queueing for tunnel targets on the output interface.

Having said that, has thought been given to separating the queueing from being so closely
tied to the Ethernet transmit process ?   When queueing on a tunnel for example we may
be working with encryption.   When running with an anti-reply window it is really much
better to do the QOS (packet reordering) before the encryption.  To support this would
it be possible to have a separate scheduler structure which can be passed into the
scheduling API ?  This means the calling code can hang the structure of whatever entity
it wishes to perform qos on, and we get dynamic target support (sessions/tunnels etc).

Regarding the structure allocation, would it be possible to make the number of queues
associated with a TC a compile time option which the scheduler would accommodate ?
We frequently only use one queue per tc which means 75% of the space allocated at
the queueing layer for that tc is never used.  This may be specific to our implementation
but if other implementations do the same if folks could say we may get a better idea
if this is a common case.

Whilst touching on the scheduler, the token replenishment works using a division and
multiplication obviously to cater for the fact that it may be run after several tc windows
have passed.  The most commonly used industrial scheduler simply does a lapsed on the tc
and then adds the bc.   This relies on the scheduler being called within the tc window
though.  It would be nice to have this as a configurable option since it's much for efficient
assuming the infra code from which it's called can guarantee the calling frequency.

I hope you'll consider these points for inclusion into a future road map.  Hopefully in the
future my employer will increase the priority of some of the tasks and a PR may appear
on the mailing list.

Thanks,
Alan.

Subject:

[dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler

Date:

Wed, 30 Nov 2016 18:16:50 +0000

From:

Cristian Dumitrescu <cristian.dumitrescu@intel.com><mailto:cristian.dumitrescu@intel.com>

To:

dev@dpdk.org<mailto:dev@dpdk.org>

CC:

cristian.dumitrescu@intel.com<mailto:cristian.dumitrescu@intel.com>



This RFC proposes an ethdev-based abstraction layer for Quality of Service (QoS)

hierarchical scheduler. The goal of the abstraction layer is to provide a simple

generic API that is agnostic of the underlying HW, SW or mixed HW-SW complex

implementation.



Q1: What is the benefit for having an abstraction layer for QoS hierarchical

layer?

A1: There is growing interest in the industry for handling various HW-based,

SW-based or mixed hierarchical scheduler implementations using a unified DPDK

API.



Q2: Which devices are targeted by this abstraction layer?

A2: All current and future devices that expose a hierarchical scheduler feature

under DPDK, including NICs, FPGAs, ASICs, SOCs, SW libraries.



Q3: Which scheduler hierarchies are supported by the API?

A3: Hopefully any scheduler hierarchy can be described and covered by the

current API. Of course, functional correctness, accuracy and performance levels

depend on the specific implementations of this API.



Q4: Why have this abstraction layer into ethdev as opposed to a new type of

device (e.g. scheddev) similar to ethdev, cryptodev, eventdev, etc?

A4: Packets are sent to the Ethernet device using the ethdev API

rte_eth_tx_burst() function, with the hierarchical scheduling taking place

automatically (i.e. no SW intervention) in HW implementations. Basically, the

hierarchical scheduler is done as part of packet TX operation.

The hierarchical scheduler is typically the last stage before packet TX and it

is tightly integrated with the TX stage. The hierarchical scheduler is just

another offload feature of the Ethernet device, which needs to be accommodated

by the ethdev API similar to any other offload feature (such as RSS, DCB,

flow director, etc).

Once the decision to schedule a specific packet has been taken, this packet

cannot be dropped and it has to be sent over the wire as is, otherwise what

takes place on the wire is not what was planned at scheduling time, so the

scheduling is not accurate (Note: there are some devices which allow prepending

headers to the packet after the scheduling stage at the expense of sending

correction requests back to the scheduler, but this only strengthens the bond

between scheduling and TX).



Q5: Given that the packet scheduling takes place automatically for pure HW

implementations, how does packet scheduling take place for poll-mode SW

implementations?

A5: The API provided function rte_sched_run() is designed to take care of this.

For HW implementations, this function typically does nothing. For SW

implementations, this function is typically expected to perform dequeue of

packets from the hierarchical scheduler and their write to Ethernet device TX

queue, periodic flush of any buffers on enqueue-side into the hierarchical

scheduler for burst-oriented implementations, etc.



Q6: Which are the scheduling algorithms supported?

A6: The fundamental scheduling algorithms that are supported are Strict Priority

(SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported at

the level of each node of the scheduling hierarchy, regardless of the node

level/position in the tree. The SP algorithm is used to schedule between sibling

nodes with different priority, while WFQ is used to schedule between groups of

siblings that have the same priority.

Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR

(DWRR), etc are considered approximations of the ideal WFQ and are therefore

assimilated to WFQ, although an associated implementation-dependent accuracy,

performance and resource usage trade-off might exist.



Q7: Which are the supported congestion management algorithms?

A7: Tail drop, head drop and Weighted Random Early Detection (WRED). They are

available for every leaf node in the hierarchy, subject to the specific

implementation supporting them.



Q8: Is traffic shaping supported?

A8: Yes, there are a number of shapers (rate limiters) that can be supported for

each node in the hierarchy (built-in limit is currently set to 4 per node). Each

shaper can be private to a node (used only by that node) or shared between

multiple nodes.



Q9: What is the purpose of having shaper profiles and WRED profiles?

A9: In most implementations, many shapers typically share the same configuration

parameters, so defining shaper profiles simplifies the configuration task. Same

considerations apply to WRED contexts and profiles.



Q10: How is the scheduling hierarchy defined and created?

A10: Scheduler hierarchy tree is set up by creating new nodes and connecting

them to other existing nodes, which thus become parent nodes. The unique ID that

is assigned to each node when the node is created is further used to update the

node configuration or to connect children nodes to it. The leaf nodes of the

scheduler hierarchy are each attached to one of the Ethernet device TX queues.



Q11: Are on-the-fly changes of the scheduling hierarchy allowed by the API?

A11: Yes. The actual changes take place subject to the specific implementation

supporting them, otherwise error code is returned.



Q12: What is the typical function call sequence to set up and run the Ethernet

device scheduler?

A12: The typical simplified function call sequence is listed below:

i) Configure the Ethernet device and its TX queues: rte_eth_dev_configure(),

rte_eth_tx_queue_setup()

ii) Create WRED profiles and WRED contexts, shaper profiles and shapers:

rte_eth_sched_wred_profile_add(), rte_eth_sched_wred_context_add(),

rte_eth_sched_shaper_profile_add(), rte_eth_sched_shaper_add()

iii) Create the scheduler hierarchy nodes and tree: rte_eth_sched_node_add()

iv) Freeze the start-up hierarchy and ask the device whether it supports it:

rte_eth_sched_node_add()

v) Start the Ethernet port: rte_eth_dev_start()

vi) Run-time scheduler hierarchy updates: rte_eth_sched_node_add(),

rte_eth_sched_node_<attribute>_set()

vii) Run-time packet enqueue into the hierarchical scheduler: rte_eth_tx_burst()

viii) Run-time support for SW poll-mode implementations (see previous answer):

rte_sched_run()



Q13: Which are the possible options for the user when the Ethernet port does not

support the scheduling hierarchy required by the user?

A13: The following options are available to the user:

i) abort

ii) try out a new hierarchy (e.g. with less leaf nodes), if acceptable

iii) wrap the Ethernet device into a new type of Ethernet device that has a SW

front-end implementing the hierarchical scheduler (e.g. existing DPDK library

librte_sched); instantiate the new device type on-the-fly and check if the

hierarchy requirements can be met by the new device.





Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com><mailto:cristian.dumitrescu@intel.com>

---

 lib/librte_ether/rte_ethdev.h | 794 ++++++++++++++++++++++++++++++++++++++++++

 1 file changed, 794 insertions(+)

 mode change 100644 => 100755 lib/librte_ether/rte_ethdev.h



diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h

old mode 100644

new mode 100755

index 9678179..d4d8604

--- a/lib/librte_ether/rte_ethdev.h

+++ b/lib/librte_ether/rte_ethdev.h

@@ -182,6 +182,8 @@ extern "C" {

 #include <rte_pci.h>

 #include <rte_dev.h>

 #include <rte_devargs.h>

+#include <rte_meter.h>

+#include <rte_red.h>

 #include "rte_ether.h"

 #include "rte_eth_ctrl.h"

 #include "rte_dev_info.h"

@@ -1038,6 +1040,152 @@ TAILQ_HEAD(rte_eth_dev_cb_list, rte_eth_dev_callback);

 /**< l2 tunnel forwarding mask */

 #define ETH_L2_TUNNEL_FORWARDING_MASK   0x00000008



+/**

+ * Scheduler configuration

+ */

+

+/**< Max number of shapers per node */

+#define RTE_ETH_SCHED_SHAPERS_PER_NODE                     4

+/**< Invalid shaper ID */

+#define RTE_ETH_SCHED_SHAPER_ID_NONE                       UINT32_MAX

+/**< Max number of WRED contexts per node */

+#define RTE_ETH_SCHED_WRED_CONTEXTS_PER_NODE               4

+/**< Invalid WRED context ID */

+#define RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE                 UINT32_MAX

+/**< Invalid node ID */

+#define RTE_ETH_SCHED_NODE_NULL                            UINT32_MAX

+

+/**

+  * Congestion management (CMAN) mode

+  *

+  * This is used for controlling the admission of packets into a packet queue or

+  * group of packet queues on congestion. On request of writing a new packet

+  * into the current queue while the queue is full, the *tail drop* algorithm

+  * drops the new packet while leaving the queue unmodified, as opposed to *head

+  * drop* algorithm, which drops the packet at the head of the queue (the oldest

+  * packet waiting in the queue) and admits the new packet at the tail of the

+  * queue.

+  *

+  * The *Random Early Detection (RED)* algorithm works by proactively dropping

+  * more and more input packets as the queue occupancy builds up. When the queue

+  * is full or almost full, RED effectively works as *tail drop*. The *Weighted

+  * RED* algorithm uses a separate set of RED thresholds per packet color.

+  */

+enum rte_eth_sched_cman_mode {

+       RTE_ETH_SCHED_CMAN_TAIL_DROP = 0, /**< Tail drop */

+       RTE_ETH_SCHED_CMAN_HEAD_DROP, /**< Head drop */

+       RTE_ETH_SCHED_CMAN_WRED, /**< Weighted Random Early Detection (WRED) */

+};

+

+/**

+  * WRED profile

+  */

+struct rte_eth_sched_wred_params {

+       /**< One set of RED parameters per packet color */

+       struct rte_red_params red_params[e_RTE_METER_COLORS];

+};

+

+/**

+  * Shaper (rate limiter) profile

+  *

+  * Multiple shaper instances can share the same shaper profile. Each node can

+  * have multiple shapers enabled (up to RTE_ETH_SCHED_SHAPERS_PER_NODE). Each

+  * shaper can be private to a node (only one node using it) or shared (multiple

+  * nodes use the same shaper instance).

+  */

+struct rte_eth_sched_shaper_params {

+       uint64_t rate; /**< Token bucket rate (bytes per second) */

+       uint64_t size; /**< Token bucket size (bytes) */

+};

+

+/**

+  * Node parameters

+  *

+  * Each scheduler hierarchy node has multiple inputs (children nodes of the

+  * current parent node) and a single output (which is input to its parent

+  * node). The current node arbitrates its inputs using Strict Priority (SP)

+  * and Weighted Fair Queuing (WFQ) algorithms to schedule input packets on its

+  * output while observing its shaping/rate limiting constraints.  Algorithms

+  * such as Weighted Round Robin (WRR), byte-level WRR, Deficit WRR (DWRR), etc

+  * are considered approximations of the ideal WFQ and are assimilated to WFQ,

+  * although an associated implementation-dependent trade-off on accuracy,

+  * performance and resource usage might exist.

+  *

+  * Children nodes with different priorities are scheduled using the SP

+  * algorithm, based on their priority, with zero (0) as the highest priority.

+  * Children with same priority are scheduled using the WFQ algorithm, based on

+  * their weight, which is relative to the sum of the weights of all siblings

+  * with same priority, with one (1) as the lowest weight.

+  */

+struct rte_eth_sched_node_params {

+       /**< Child node priority (used by SP). The highest priority is zero. */

+       uint32_t priority;

+       /**< Child node weight (used by WFQ), relative to some of weights of all

+            siblings with same priority). The lowest weight is one. */

+       uint32_t weight;

+       /**< Set of shaper instances enabled for current node. Each node shaper

+            can be disabled by setting it to RTE_ETH_SCHED_SHAPER_ID_NONE. */

+       uint32_t shaper_id[RTE_ETH_SCHED_SHAPERS_PER_NODE];

+       /**< Set to zero if current node is not a hierarchy leaf node, set to a

+            non-zero value otherwise. A leaf node is a hierarchy node that does

+            not have any children. A leaf node has to be connected to a valid

+            packet queue. */

+       int is_leaf;

+       /**< Parameters valid for leaf nodes only */

+       struct {

+              /**< Packet queue ID */

+              uint64_t queue_id;

+              /**< Congestion management mode */

+              enum rte_eth_sched_cman_mode cman;

+              /**< Set of WRED contexts enabled for current leaf node. Each

+                   leaf node WRED context can be disabled by setting it to

+                   RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE. Only valid when

+                   congestion management for current leaf node is set to WRED. */

+              uint32_t wred_context_id[RTE_ETH_SCHED_WRED_CONTEXTS_PER_NODE];

+       } leaf;

+};

+

+/**

+  * Node statistics counter type

+  */

+enum rte_eth_sched_stats_counter {

+       /**< Number of packets scheduled from current node. */

+       RTE_ETH_SCHED_STATS_COUNTER_N_PKTS = 1<< 0,

+       /**< Number of bytes scheduled from current node. */

+       RTE_ETH_SCHED_STATS_COUNTER_N_BYTES = 1 << 1,

+       RTE_ETH_SCHED_STATS_COUNTER_N_PKTS_DROPPED = 1 << 2,

+       RTE_ETH_SCHED_STATS_COUNTER_N_BYTES_DROPPED = 1 << 3,

+       /**< Number of packets currently waiting in the packet queue of current

+            leaf node. */

+       RTE_ETH_SCHED_STATS_COUNTER_N_PKTS_QUEUED = 1 << 4,

+       /**< Number of bytes currently waiting in the packet queue of current

+            leaf node. */

+       RTE_ETH_SCHED_STATS_COUNTER_N_BYTES_QUEUED = 1 << 5,

+};

+

+/**

+  * Node statistics counters

+  */

+struct rte_eth_sched_node_stats {

+       /**< Number of packets scheduled from current node. */

+       uint64_t n_pkts;

+       /**< Number of bytes scheduled from current node. */

+       uint64_t n_bytes;

+       /**< Statistics counters for leaf nodes only */

+       struct {

+              /**< Number of packets dropped by current leaf node. */

+              uint64_t n_pkts_dropped;

+              /**< Number of bytes dropped by current leaf node. */

+              uint64_t n_bytes_dropped;

+              /**< Number of packets currently waiting in the packet queue of

+                   current leaf node. */

+              uint64_t n_pkts_queued;

+              /**< Number of bytes currently waiting in the packet queue of

+                   current leaf node. */

+              uint64_t n_bytes_queued;

+       } leaf;

+};

+

 /*

  * Definitions of all functions exported by an Ethernet driver through the

  * the generic structure of type *eth_dev_ops* supplied in the *rte_eth_dev*

@@ -1421,6 +1569,120 @@ typedef int (*eth_get_dcb_info)(struct rte_eth_dev *dev,

                                struct rte_eth_dcb_info *dcb_info);

 /**< @internal Get dcb information on an Ethernet device */



+typedef int (*eth_sched_wred_profile_add_t)(struct rte_eth_dev *dev,

+       uint32_t wred_profile_id,

+       struct rte_eth_sched_wred_params *profile);

+/**< @internal Scheduler WRED profile add */

+

+typedef int (*eth_sched_wred_profile_delete_t)(struct rte_eth_dev *dev,

+       uint32_t wred_profile_id);

+/**< @internal Scheduler WRED profile delete */

+

+typedef int (*eth_sched_wred_context_add_t)(struct rte_eth_dev *dev,

+       uint32_t wred_context_id,

+       uint32_t wred_profile_id);

+/**< @internal Scheduler WRED context add */

+

+typedef int (*eth_sched_wred_context_delete_t)(struct rte_eth_dev *dev,

+       uint32_t wred_context_id);

+/**< @internal Scheduler WRED context delete */

+

+typedef int (*eth_sched_shaper_profile_add_t)(struct rte_eth_dev *dev,

+       uint32_t shaper_profile_id,

+       struct rte_eth_sched_shaper_params *profile);

+/**< @internal Scheduler shaper profile add */

+

+typedef int (*eth_sched_shaper_profile_delete_t)(struct rte_eth_dev *dev,

+       uint32_t shaper_profile_id);

+/**< @internal Scheduler shaper profile delete */

+

+typedef int (*eth_sched_shaper_add_t)(struct rte_eth_dev *dev,

+       uint32_t shaper_id,

+       uint32_t shaper_profile_id);

+/**< @internal Scheduler shaper instance add */

+

+typedef int (*eth_sched_shaper_delete_t)(struct rte_eth_dev *dev,

+       uint32_t shaper_id);

+/**< @internal Scheduler shaper instance delete */

+

+typedef int (*eth_sched_node_add_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       uint32_t parent_node_id,

+       struct rte_eth_sched_node_params *params);

+/**< @internal Scheduler node add */

+

+typedef int (*eth_sched_node_delete_t)(struct rte_eth_dev *dev,

+       uint32_t node_id);

+/**< @internal Scheduler node delete */

+

+typedef int (*eth_sched_hierarchy_set_t)(struct rte_eth_dev *dev,

+       int clear_on_fail);

+/**< @internal Scheduler hierarchy set */

+

+typedef int (*eth_sched_node_priority_set_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       uint32_t priority);

+/**< @internal Scheduler node priority set */

+

+typedef int (*eth_sched_node_weight_set_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       uint32_t weight);

+/**< @internal Scheduler node weight set */

+

+typedef int (*eth_sched_node_shaper_set_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       uint32_t shaper_pos,

+       uint32_t shaper_id);

+/**< @internal Scheduler node shaper set */

+

+typedef int (*eth_sched_node_queue_set_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       uint32_t queue_id);

+/**< @internal Scheduler node queue set */

+

+typedef int (*eth_sched_node_cman_set_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       enum rte_eth_sched_cman_mode cman);

+/**< @internal Scheduler node congestion management mode set */

+

+typedef int (*eth_sched_node_wred_context_set_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       uint32_t wred_context_pos,

+       uint32_t wred_context_id);

+/**< @internal Scheduler node WRED context set */

+

+typedef int (*eth_sched_stats_get_enabled_t)(struct rte_eth_dev *dev,

+       uint64_t *nonleaf_node_capability_stats_mask,

+       uint64_t *nonleaf_node_enabled_stats_mask,

+       uint64_t *leaf_node_capability_stats_mask,

+       uint64_t *leaf_node_enabled_stats_mask);

+/**< @internal Scheduler get set of stats counters enabled for all nodes */

+

+typedef int (*eth_sched_stats_enable_t)(struct rte_eth_dev *dev,

+       uint64_t nonleaf_node_enabled_stats_mask,

+       uint64_t leaf_node_enabled_stats_mask);

+/**< @internal Scheduler enable selected stats counters for all nodes */

+

+typedef int (*eth_sched_node_stats_get_enabled_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       uint64_t *capability_stats_mask,

+       uint64_t *enabled_stats_mask);

+/**< @internal Scheduler get set of stats counters enabled for specific node */

+

+typedef int (*eth_sched_node_stats_enable_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       uint64_t enabled_stats_mask);

+/**< @internal Scheduler enable selected stats counters for specific node */

+

+typedef int (*eth_sched_node_stats_read_t)(struct rte_eth_dev *dev,

+       uint32_t node_id,

+       struct rte_eth_sched_node_stats *stats,

+       int clear);

+/**< @internal Scheduler read stats counters for specific node */

+

+typedef int (*eth_sched_run_t)(struct rte_eth_dev *dev);

+/**< @internal Scheduler run */

+

 /**

  * @internal A structure containing the functions exported by an Ethernet driver.

  */

@@ -1547,6 +1809,53 @@ struct eth_dev_ops {

        eth_l2_tunnel_eth_type_conf_t l2_tunnel_eth_type_conf;

        /** Enable/disable l2 tunnel offload functions */

        eth_l2_tunnel_offload_set_t l2_tunnel_offload_set;

+

+       /** Scheduler WRED profile add */

+       eth_sched_wred_profile_add_t sched_wred_profile_add;

+       /** Scheduler WRED profile delete */

+       eth_sched_wred_profile_delete_t sched_wred_profile_delete;

+       /** Scheduler WRED context add */

+       eth_sched_wred_context_add_t sched_wred_context_add;

+       /** Scheduler WRED context delete */

+       eth_sched_wred_context_delete_t sched_wred_context_delete;

+       /** Scheduler shaper profile add */

+       eth_sched_shaper_profile_add_t sched_shaper_profile_add;

+       /** Scheduler shaper profile delete */

+       eth_sched_shaper_profile_delete_t sched_shaper_profile_delete;

+       /** Scheduler shaper instance add */

+       eth_sched_shaper_add_t sched_shaper_add;

+       /** Scheduler shaper instance delete */

+       eth_sched_shaper_delete_t sched_shaper_delete;

+       /** Scheduler node add */

+       eth_sched_node_add_t sched_node_add;

+       /** Scheduler node delete */

+       eth_sched_node_delete_t sched_node_delete;

+       /** Scheduler hierarchy set */

+       eth_sched_hierarchy_set_t sched_hierarchy_set;

+       /** Scheduler node priority set */

+       eth_sched_node_priority_set_t sched_node_priority_set;

+       /** Scheduler node weight set */

+       eth_sched_node_weight_set_t sched_node_weight_set;

+       /** Scheduler node shaper set */

+       eth_sched_node_shaper_set_t sched_node_shaper_set;

+       /** Scheduler node queue set */

+       eth_sched_node_queue_set_t sched_node_queue_set;

+       /** Scheduler node congestion management mode set */

+       eth_sched_node_cman_set_t sched_node_cman_set;

+       /** Scheduler node WRED context set */

+       eth_sched_node_wred_context_set_t sched_node_wred_context_set;

+       /** Scheduler get statistics counter type enabled for all nodes */

+       eth_sched_stats_get_enabled_t sched_stats_get_enabled;

+       /** Scheduler enable selected statistics counters for all nodes */

+       eth_sched_stats_enable_t sched_stats_enable;

+       /** Scheduler get statistics counter type enabled for current node */

+       eth_sched_node_stats_get_enabled_t sched_node_stats_get_enabled;

+       /** Scheduler enable selected statistics counters for current node */

+       eth_sched_node_stats_enable_t sched_node_stats_enable;

+       /** Scheduler read statistics counters for current node */

+       eth_sched_node_stats_read_t sched_node_stats_read;

+       /** Scheduler run */

+       eth_sched_run_t sched_run;

 };



 /**

@@ -4336,6 +4645,491 @@ rte_eth_dev_l2_tunnel_offload_set(uint8_t port_id,

                                 uint8_t en);



 /**

+ * Scheduler WRED profile add

+ *

+ * Create a new WRED profile with ID set to *wred_profile_id*. The new profile

+ * is used to create one or several WRED contexts.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param wred_profile_id

+ *   WRED profile ID for the new profile. Needs to be unused.

+ * @param profile

+ *   WRED profile parameters. Needs to be pre-allocated and valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_wred_profile_add(uint8_t port_id,

+       uint32_t wred_profile_id,

+       struct rte_eth_sched_wred_params *profile);

+

+/**

+ * Scheduler WRED profile delete

+ *

+ * Delete an existing WRED profile. This operation fails when there is currently

+ * at least one user (i.e. WRED context) of this WRED profile.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param wred_profile_id

+ *   WRED profile ID. Needs to be the valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_wred_profile_delete(uint8_t port_id,

+       uint32_t wred_profile_id);

+

+/**

+ * Scheduler WRED context add or update

+ *

+ * When *wred_context_id* is invalid, a new WRED context with this ID is created

+ * by using the WRED profile identified by *wred_profile_id*.

+ *

+ * When *wred_context_id* is valid, this WRED context is no longer using the

+ * profile previously assigned to it and is updated to use the profile

+ * identified by *wred_profile_id*.

+ *

+ * A valid WRED context is assigned to one or several scheduler hierarchy leaf

+ * nodes configured to use WRED as the congestion management mode.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param wred_context_id

+ *   WRED context ID

+ * @param wred_profile_id

+ *   WRED profile ID. Needs to be the valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_wred_context_add(uint8_t port_id,

+       uint32_t wred_context_id,

+       uint32_t wred_profile_id);

+

+/**

+ * Scheduler WRED context delete

+ *

+ * Delete an existing WRED context. This operation fails when there is currently

+ * at least one user (i.e. scheduler hierarchy leaf node) of this WRED context.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param wred_context_id

+ *   WRED context ID. Needs to be the valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_wred_context_delete(uint8_t port_id,

+       uint32_t wred_context_id);

+

+/**

+ * Scheduler shaper profile add

+ *

+ * Create a new shaper profile with ID set to *shaper_profile_id*. The new

+ * shaper profile is used to create one or several shapers.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param shaper_profile_id

+ *   Shaper profile ID for the new profile. Needs to be unused.

+ * @param profile

+ *   Shaper profile parameters. Needs to be pre-allocated and valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_shaper_profile_add(uint8_t port_id,

+       uint32_t shaper_profile_id,

+       struct rte_eth_sched_shaper_params *profile);

+

+/**

+ * Scheduler shaper profile delete

+ *

+ * Delete an existing shaper profile. This operation fails when there is

+ * currently at least one user (i.e. shaper) of this shaper profile.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param shaper_profile_id

+ *   Shaper profile ID. Needs to be the valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+/* no users (shapers) using this profile */

+int rte_eth_sched_shaper_profile_delete(uint8_t port_id,

+       uint32_t shaper_profile_id);

+

+/**

+ * Scheduler shaper add or update

+ *

+ * When *shaper_id* is not a valid shaper ID, a new shaper with this ID is

+ * created using the shaper profile identified by *shaper_profile_id*.

+ *

+ * When *shaper_id* is a valid shaper ID, this shaper is no longer using the

+ * shaper profile previously assigned to it and is updated to use the shaper

+ * profile identified by *shaper_profile_id*.

+ *

+ * A valid shaper is assigned to one or several scheduler hierarchy nodes.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param shaper_id

+ *   Shaper ID

+ * @param shaper_profile_id

+ *   Shaper profile ID. Needs to be the valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_shaper_add(uint8_t port_id,

+       uint32_t shaper_id,

+       uint32_t shaper_profile_id);

+

+/**

+ * Scheduler shaper delete

+ *

+ * Delete an existing shaper. This operation fails when there is currently at

+ * least one user (i.e. scheduler hierarchy node) of this shaper.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param shaper_id

+ *   Shaper ID. Needs to be the valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_shaper_delete(uint8_t port_id,

+       uint32_t shaper_id);

+

+/**

+ * Scheduler node add or remap

+ *

+ * When *node_id* is not a valid node ID, a new node with this ID is created and

+ * connected as child to the existing node identified by *parent_node_id*.

+ *

+ * When *node_id* is a valid node ID, this node is disconnected from its current

+ * parent and connected as child to another existing node identified by

+ * *parent_node_id *.

+ *

+ * This function can be called during port initialization phase (before the

+ * Ethernet port is started) for building the scheduler start-up hierarchy.

+ * Subject to the specific Ethernet port supporting on-the-fly scheduler

+ * hierarchy updates, this function can also be called during run-time (after

+ * the Ethernet port is started).

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID

+ * @param parent_node_id

+ *   Parent node ID. Needs to be the valid.

+ * @param params

+ *   Node parameters. Needs to be pre-allocated and valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_add(uint8_t port_id,

+       uint32_t node_id,

+       uint32_t parent_node_id,

+       struct rte_eth_sched_node_params *params);

+

+/**

+ * Scheduler node delete

+ *

+ * Delete an existing node. This operation fails when this node currently has at

+ * least one user (i.e. child node).

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_delete(uint8_t port_id,

+       uint32_t node_id);

+

+/**

+ * Scheduler hierarchy set

+ *

+ * This function is called during the port initialization phase (before the

+ * Ethernet port is started) to freeze the scheduler start-up hierarchy.

+ *

+ * This function fails when the currently configured scheduler hierarchy is not

+ * supported by the Ethernet port, in which case the user can abort or try out

+ * another hierarchy configuration (e.g. a hierarchy with less leaf nodes),

+ * which can be build from scratch (when *clear_on_fail* is enabled) or by

+ * modifying the existing hierarchy configuration (when *clear_on_fail* is

+ * disabled).

+ *

+ * Note that, even when the configured scheduler hierarchy is supported (so this

+ * function is successful), the Ethernet port start might still fail due to e.g.

+ * not enough memory being available in the system, etc.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param clear_on_fail

+ *   On function call failure, hierarchy is cleared when this parameter is

+ *   non-zero and preserved when this parameter is equal to zero.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_hierarchy_set(uint8_t port_id,

+       int clear_on_fail);

+

+/**

+ * Scheduler node priority set

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid.

+ * @param priority

+ *   Node priority. The highest node priority is zero. Used by the SP algorithm

+ *   running on the parent of the current node for scheduling this child node.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_priority_set(uint8_t port_id,

+       uint32_t node_id,

+       uint32_t priority);

+

+/**

+ * Scheduler node weight set

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid.

+ * @param weight

+ *   Node weight. The node weight is relative to the weight sum of all siblings

+ *   that have the same priority. The lowest weight is zero. Used by the WFQ

+ *   algorithm running on the parent of the current node for scheduling this

+ *   child node.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_weight_set(uint8_t port_id,

+       uint32_t node_id,

+       uint32_t weight);

+

+/**

+ * Scheduler node shaper set

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid.

+ * @param shaper_pos

+ *   Position in the shaper array of the current node

+ *   (0 .. RTE_ETH_SCHED_SHAPERS_PER_NODE-1).

+ * @param shaper_id

+ *   Shaper ID. Needs to be either valid shaper ID or set to

+ *   RTE_ETH_SCHED_SHAPER_ID_NONE in order to invalidate the shaper on position

+ *   *shaper_pos* within the current node.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_shaper_set(uint8_t port_id,

+       uint32_t node_id,

+       uint32_t shaper_pos,

+       uint32_t shaper_id);

+

+/**

+ * Scheduler node queue set

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid.

+ * @param queue_id

+ *   Queue ID. Needs to be valid.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_queue_set(uint8_t port_id,

+       uint32_t node_id,

+       uint32_t queue_id);

+

+/**

+ * Scheduler node congestion management mode set

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid leaf node ID.

+ * @param cman

+ *   Congestion management mode.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_cman_set(uint8_t port_id,

+       uint32_t node_id,

+       enum rte_eth_sched_cman_mode cman);

+

+/**

+ * Scheduler node WRED context set

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid leaf node ID that has WRED selected as the

+ *   congestion management mode.

+ * @param wred_context_pos

+ *   Position in the WRED context array of the current leaf node

+ *   (0 .. RTE_ETH_SCHED_WRED_CONTEXTS_PER_NODE-1)

+ * @param wred_context_id

+ *   WRED context ID. Needs to be either valid WRED context ID or set to

+ *   RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE in order to invalidate the WRED context

+ *   on position *wred_context_pos* within the current leaf node.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_wred_context_set(uint8_t port_id,

+       uint32_t node_id,

+       uint32_t wred_context_pos,

+       uint32_t wred_context_id);

+

+/**

+ * Scheduler get statistics counter types enabled for all nodes

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param nonleaf_node_capability_stats_mask

+ *   Statistics counter types available per node for all non-leaf nodes. Needs

+ *   to be pre-allocated.

+ * @param nonleaf_node_enabled_stats_mask

+ *   Statistics counter types currently enabled per node for each non-leaf node.

+ *   This is a subset of *nonleaf_node_capability_stats_mask*. Needs to be

+ *   pre-allocated.

+ * @param leaf_node_capability_stats_mask

+ *   Statistics counter types available per node for all leaf nodes. Needs to

+ *   be pre-allocated.

+ * @param leaf_node_enabled_stats_mask

+ *   Statistics counter types currently enabled for each leaf node. This is

+ *   a subset of *leaf_node_capability_stats_mask*. Needs to be pre-allocated.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_stats_get_enabled(uint8_t port_id,

+       uint64_t *nonleaf_node_capability_stats_mask,

+       uint64_t *nonleaf_node_enabled_stats_mask,

+       uint64_t *leaf_node_capability_stats_mask,

+       uint64_t *leaf_node_enabled_stats_mask);

+

+/**

+ * Scheduler enable selected statistics counters for all nodes

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param nonleaf_node_enabled_stats_mask

+ *   Statistics counter types to be enabled per node for each non-leaf node.

+ *   This needs to be a subset of the statistics counter types available per

+ *   node for all non-leaf nodes. Any statistics counter type not included in

+ *   this set is to be disabled for all non-leaf nodes.

+ * @param leaf_node_enabled_stats_mask

+ *   Statistics counter types to be enabled per node for each leaf node. This

+ *   needs to be a subset of the statistics counter types available per node for

+ *   all leaf nodes. Any statistics counter type not included in this set is to

+ *   be disabled for all leaf nodes.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_stats_enable(uint8_t port_id,

+       uint64_t nonleaf_node_enabled_stats_mask,

+       uint64_t leaf_node_enabled_stats_mask);

+

+/**

+ * Scheduler get statistics counter types enabled for current node

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid.

+ * @param capability_stats_mask

+ *   Statistics counter types available for the current node. Needs to be pre-allocated.

+ * @param enabled_stats_mask

+ *   Statistics counter types currently enabled for the current node. This is

+ *   a subset of *capability_stats_mask*. Needs to be pre-allocated.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_stats_get_enabled(uint8_t port_id,

+       uint32_t node_id,

+       uint64_t *capability_stats_mask,

+       uint64_t *enabled_stats_mask);

+

+/**

+ * Scheduler enable selected statistics counters for current node

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid.

+ * @param enabled_stats_mask

+ *   Statistics counter types to be enabled for the current node. This needs to

+ *   be a subset of the statistics counter types available for the current node.

+ *   Any statistics counter type not included in this set is to be disabled for

+ *   the current node.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_stats_enable(uint8_t port_id,

+       uint32_t node_id,

+       uint64_t enabled_stats_mask);

+

+/**

+ * Scheduler node statistics counters read

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @param node_id

+ *   Node ID. Needs to be valid.

+ * @param stats

+ *   When non-NULL, it contains the current value for the statistics counters

+ *   enabled for the current node.

+ * @param clear

+ *   When this parameter has a non-zero value, the statistics counters are

+ *   cleared (i.e. set to zero) immediately after they have been read, otherwise

+ *   the statistics counters are left untouched.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+int rte_eth_sched_node_stats_read(uint8_t port_id,

+       uint32_t node_id,

+       struct rte_eth_sched_node_stats *stats,

+       int clear);

+

+/**

+ * Scheduler run

+ *

+ * The packet enqueue side of the scheduler hierarchy is typically done through

+ * the Ethernet device TX function. For HW implementations, the packet dequeue

+ * side is typically done by the Ethernet device without any SW intervention,

+ * therefore this functions should not do anything.

+ *

+ * However, for poll-mode SW or mixed HW-SW implementations, the SW intervention

+ * is likely to be required for running the packet dequeue side of the scheduler

+ * hierarchy. Other potential task performed by this function is periodic flush

+ * of any packet enqueue-side buffers used by the burst-mode implementations.

+ *

+ * @param port_id

+ *   The port identifier of the Ethernet device.

+ * @return

+ *   0 on success, non-zero error code otherwise.

+ */

+static inline int

+rte_eth_sched_run(uint8_t port_id)

+{

+       struct rte_eth_dev *dev;

+

+#ifdef RTE_LIBRTE_ETHDEV_DEBUG

+       RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);

+#endif

+

+       dev = &rte_eth_devices[port_id];

+

+       return (dev->dev_ops->sched_run)? dev->dev_ops->sched_run(dev) : 0;

+}

+

+/**

 * Get the port id from pci adrress or device name

 * Ex: 0000:2:00.0 or vdev name net_pcap0

 *

--

2.5.0
  
Cristian Dumitrescu Dec. 7, 2016, 7:03 p.m. UTC | #4
Hi Steve,

Thanks for your comments!


> This seems to be more of an abstraction of existing QoS.

Really? Why exactly do you say this, any particular examples?

I think the current proposal provides an abstraction for far more features than librte_sched provides. The goal for this API is to be able to describe virtually any hierarchy that could be implemented in HW and/or SW, not just what is currently provided by librte_sched.

If your statement is true, then I failed in my mission, and hopefully I didn't :)


> Why not something like Linux Qdisc or FreeBSD DummyNet/PF/ALTQ where the Qos components are stackable objects? 

After designing Packet Framework, I don't think anybody could suspect me of not being a fan of stackable objects ;). Not sure why you say this either, as basically current proposal builds the hierarchy out of inter-connected nodes sitting on top of shapers and WRED contexts. To me, this is a decent stack?

I don't think this proposal is that far away from Linux qdisc: qdisc classes are nodes, shapers are present, WRED contexts as well. Any particular qdisc feature you see missing?

Of course, Linux qdisc also includes classification, policing, marking, etc which are outside of the hierarchical scheduling that is targeted by this proposal. But this is an interesting thought: designing a qdisc-like layer within DPDK that binds together classification, policing, filters, scheduling.


> And why not make it the same as existing OS abstractions?

Do you mean using the Linux qdisc API and implementation as is? Of course, this is GPL licensed code and we cannot do this in DPDK.

Do you mean having a Linux qdisc-like API? I largely agree with this, and I think the current proposal is very much inline with this; if you think otherwise, again, specific examples of what's missing would help a lot. I can also take a look at DummyNet to make sure there is nothing left behind.


> Rather than reinventing wheel which seems to be DPDK Standard Procedure, could an existing abstraction be used?

I thought we are just trying to create a car instead of a faster horse :)


Regards,
Cristian
  
Cristian Dumitrescu Dec. 7, 2016, 7:52 p.m. UTC | #5
Hi Alan,

Thanks for your comments!


> Hi Cristian,

> Looking at points 10 and 11 it's good to hear nodes can be dynamically added.

Yes, many implementations allow on-the-fly remapping a node from one parent to another
one, or simply adding more nodes post-initialization, so it is natural for the API to provide this.


> We've been trying to decide the best way to do this for support of qos on tunnels for
> some time now and the existing implementation doesn't allow this so effectively ruled
> out hierarchical queueing for tunnel targets on the output interface.

> Having said that, has thought been given to separating the queueing from being so closely
> tied to the Ethernet transmit process ?   When queueing on a tunnel for example we may
> be working with encryption.   When running with an anti-reply window it is really much
> better to do the QOS (packet reordering) before the encryption.  To support this would
> it be possible to have a separate scheduler structure which can be passed into the
> scheduling API ?  This means the calling code can hang the structure of whatever entity
> it wishes to perform qos on, and we get dynamic target support (sessions/tunnels etc).

Yes, this is one point where we need to look for a better solution. Current proposal attaches
the hierarchical scheduler function to an ethdev, so scheduling traffic for tunnels that have a
pre-defined bandwidth is not supported nicely. This question was also raised in VPP, but
there tunnels are supported as a type of output interfaces, so attaching scheduling to an
output interface also covers the tunnels case.

Looks to me that nice tunnel abstractions are a gap in DPDK as well. Any thoughts about
how tunnels should be supported in DPDK? What do other people think about this?


> Regarding the structure allocation, would it be possible to make the number of queues
> associated with a TC a compile time option which the scheduler would accommodate ?
> We frequently only use one queue per tc which means 75% of the space allocated at
> the queueing layer for that tc is never used.  This may be specific to our implementation
> but if other implementations do the same if folks could say we may get a better idea
> if this is a common case.

> Whilst touching on the scheduler, the token replenishment works using a division and
> multiplication obviously to cater for the fact that it may be run after several tc windows
> have passed.  The most commonly used industrial scheduler simply does a lapsed on the tc
> and then adds the bc.   This relies on the scheduler being called within the tc window
> though.  It would be nice to have this as a configurable option since it's much for efficient
> assuming the infra code from which it's called can guarantee the calling frequency.

This is probably feedback for librte_sched as opposed to the current API proposal, as the
Latter is intended to be generic/implementation-agnostic and therefor its scope far
exceeds the existing set of librte_sched features.

Btw, we do plan using the librte_sched feature as the default fall-back when the HW
ethdev is not scheduler-enabled, as well as the implementation of choice for a lot of
use-cases where it fits really well, so we do have to continue evolve and improve
librte_sched feature-wise and performance-wise.


> I hope you'll consider these points for inclusion into a future road map.  Hopefully in the
> future my employer will increase the priority of some of the tasks and a PR may appear
> on the mailing list.

> Thanks,
> Alan.
  
Cristian Dumitrescu Dec. 7, 2016, 8:13 p.m. UTC | #6
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Tuesday, December 6, 2016 10:14 PM
> To: Stephen Hemminger <stephen@networkplumber.org>
> Cc: dev@dpdk.org; Dumitrescu, Cristian <cristian.dumitrescu@intel.com>
> Subject: Re: [dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical
> scheduler
> 
> 2016-12-06 11:51, Stephen Hemminger:
> > Rather than reinventing wheel which seems to be DPDK Standard
> > Procedure, could an existing abstraction be used?
> 
> Stephen, you know that the DPDK standard procedure is to consider
> reviews and good comments ;)

Thomas, sorry for not CC-ing you (as the librte_ether maintainer) when
the initial RFC was sent out, comments from you would be welcomed and
much appreciated as well!

Also same kind reminder for other people in the community as well.
(CC-ing just a few). Some of you have expressed interest in the past for
building an abstraction layer for QoS hierarchical scheduler, so here is the
RFC for it which we can evolve together. Let's make sure we build a robust
layer that fits well for all devices and vendors that are part of the DPDK
community!
  
Bruce Richardson Dec. 8, 2016, 10:14 a.m. UTC | #7
On Wed, Dec 07, 2016 at 10:58:49AM +0000, Alan Robertson wrote:
> Hi Cristian,
> 
> Looking at points 10 and 11 it's good to hear nodes can be dynamically added.
> 
> We've been trying to decide the best way to do this for support of qos on tunnels for
> some time now and the existing implementation doesn't allow this so effectively ruled
> out hierarchical queueing for tunnel targets on the output interface.
> 
> Having said that, has thought been given to separating the queueing from being so closely
> tied to the Ethernet transmit process ?   When queueing on a tunnel for example we may
> be working with encryption.   When running with an anti-reply window it is really much
> better to do the QOS (packet reordering) before the encryption.  To support this would
> it be possible to have a separate scheduler structure which can be passed into the
> scheduling API ?  This means the calling code can hang the structure of whatever entity
> it wishes to perform qos on, and we get dynamic target support (sessions/tunnels etc).
>
Hi,

just to note that not all ethdevs need to be actual NICs (physical or
virtual). It was also for situations like this that the ring PMD was
created. For the QoS scheduler, the common "output port" type chosen was
the ethdev, to avoid having to support multiple underlying types. To use
a ring instead as the output port, just create a ring and then call
"rte_eth_from_ring" to get an ethdev port wrapper around the ring, and
which you can then use for just about any API that wants an ethdev.
[Note: the rte_eth_from_ring API is in the ring driver itself, so you do
need to link against that driver directly if using shared libs] 

Regards,
/Bruce
  
Alan Robertson Dec. 8, 2016, 3:41 p.m. UTC | #8
Hi Cristian,

The way qos works just now should be feasible for dynamic targets.   That is similar functions
to rte_sched_port_enqueue() and rte_sched_port_dequeue() would be called.  The first to
enqueue the mbufs onto the queues the second to dequeue.  The qos structures and scheduler
don't need to be as functionally rich though.  I would have thought a simple pipe with child
nodes should suffice for most.  That would allow each tunnel/session to be shaped and the
queueing and drop logic inherited from what is there just now.

Thanks,
Alan.

-----Original Message-----
From: Dumitrescu, Cristian [mailto:cristian.dumitrescu@intel.com] 
Sent: Wednesday, December 07, 2016 7:52 PM
To: Alan Robertson
Cc: dev@dpdk.org; Thomas Monjalon
Subject: RE: [dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler

Hi Alan,

Thanks for your comments!


> Hi Cristian,

> Looking at points 10 and 11 it's good to hear nodes can be dynamically added.

Yes, many implementations allow on-the-fly remapping a node from one parent to another one, or simply adding more nodes post-initialization, so it is natural for the API to provide this.


> We've been trying to decide the best way to do this for support of qos 
> on tunnels for some time now and the existing implementation doesn't 
> allow this so effectively ruled out hierarchical queueing for tunnel targets on the output interface.

> Having said that, has thought been given to separating the queueing from being so closely
> tied to the Ethernet transmit process ?   When queueing on a tunnel for example we may
> be working with encryption.   When running with an anti-reply window it is really much
> better to do the QOS (packet reordering) before the encryption.  To 
> support this would it be possible to have a separate scheduler 
> structure which can be passed into the scheduling API ?  This means 
> the calling code can hang the structure of whatever entity it wishes to perform qos on, and we get dynamic target support (sessions/tunnels etc).

Yes, this is one point where we need to look for a better solution. Current proposal attaches the hierarchical scheduler function to an ethdev, so scheduling traffic for tunnels that have a pre-defined bandwidth is not supported nicely. This question was also raised in VPP, but there tunnels are supported as a type of output interfaces, so attaching scheduling to an output interface also covers the tunnels case.

Looks to me that nice tunnel abstractions are a gap in DPDK as well. Any thoughts about how tunnels should be supported in DPDK? What do other people think about this?


> Regarding the structure allocation, would it be possible to make the 
> number of queues associated with a TC a compile time option which the scheduler would accommodate ?
> We frequently only use one queue per tc which means 75% of the space 
> allocated at the queueing layer for that tc is never used.  This may 
> be specific to our implementation but if other implementations do the 
> same if folks could say we may get a better idea if this is a common case.

> Whilst touching on the scheduler, the token replenishment works using 
> a division and multiplication obviously to cater for the fact that it 
> may be run after several tc windows have passed.  The most commonly 
> used industrial scheduler simply does a lapsed on the tc and then adds 
> the bc.   This relies on the scheduler being called within the tc 
> window though.  It would be nice to have this as a configurable option since it's much for efficient assuming the infra code from which it's called can guarantee the calling frequency.

This is probably feedback for librte_sched as opposed to the current API proposal, as the Latter is intended to be generic/implementation-agnostic and therefor its scope far exceeds the existing set of librte_sched features.

Btw, we do plan using the librte_sched feature as the default fall-back when the HW ethdev is not scheduler-enabled, as well as the implementation of choice for a lot of use-cases where it fits really well, so we do have to continue evolve and improve librte_sched feature-wise and performance-wise.


> I hope you'll consider these points for inclusion into a future road 
> map.  Hopefully in the future my employer will increase the priority 
> of some of the tasks and a PR may appear on the mailing list.

> Thanks,
> Alan.
  
Cristian Dumitrescu Dec. 8, 2016, 5:18 p.m. UTC | #9
> Hi Cristian,
> 
> The way qos works just now should be feasible for dynamic targets.   That is
> similar functions
> to rte_sched_port_enqueue() and rte_sched_port_dequeue() would be
> called.  The first to
> enqueue the mbufs onto the queues the second to dequeue.  The qos
> structures and scheduler
> don't need to be as functionally rich though.  I would have thought a simple
> pipe with child
> nodes should suffice for most.  That would allow each tunnel/session to be
> shaped and the
> queueing and drop logic inherited from what is there just now.
> 
> Thanks,
> Alan.

Hi Alan,

So just to make sure I get this right: you suggest that tunnels/sessions could simply be mapped as one of the layers under the port hierarchy?

Thanks,
Cristian
  
Alan Robertson Dec. 9, 2016, 9:28 a.m. UTC | #10
Hi Cristian,

No, it'll be done as a completely separate scheduling mechanism.  We'd allocate a much smaller
footprint equivalent to a pipe, TCs and queues.   This structure would be completely independent.
It would be up to the calling code to allocate, track and free it so it could be associated with any
target.  The equivalent of the enqueue and dequeue functions would be called wherever it
was required in the data path.  So if we look at an encrypted tunnel:

Ip forward -> qos enq/qos deq -> encrypt -> port forward (possibly qos again at port)

So each structure would work independently with the assumption that it's called frequently
enough to keep the state machine ticking over.  Pretty much as we do for a PMD scheduler.

Note that if we run the features in the above order encrypted frames aren't dropped by the
Qos enqueue.  Since encryption is probably the most expensive processing done on a packet it
should give a big performance gain.

Thanks,
Alan.

-----Original Message-----
From: Dumitrescu, Cristian [mailto:cristian.dumitrescu@intel.com] 
Sent: Thursday, December 08, 2016 5:18 PM
To: Alan Robertson
Cc: dev@dpdk.org; Thomas Monjalon
Subject: RE: [dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler

> Hi Cristian,
> 
> The way qos works just now should be feasible for dynamic targets.   That is
> similar functions
> to rte_sched_port_enqueue() and rte_sched_port_dequeue() would be 
> called.  The first to enqueue the mbufs onto the queues the second to 
> dequeue.  The qos structures and scheduler don't need to be as 
> functionally rich though.  I would have thought a simple pipe with 
> child nodes should suffice for most.  That would allow each 
> tunnel/session to be shaped and the queueing and drop logic inherited 
> from what is there just now.
> 
> Thanks,
> Alan.

Hi Alan,

So just to make sure I get this right: you suggest that tunnels/sessions could simply be mapped as one of the layers under the port hierarchy?

Thanks,
Cristian
  
Jerin Jacob Jan. 11, 2017, 1:56 p.m. UTC | #11
On Wed, Nov 30, 2016 at 06:16:50PM +0000, Cristian Dumitrescu wrote:
> This RFC proposes an ethdev-based abstraction layer for Quality of Service (QoS)
> hierarchical scheduler. The goal of the abstraction layer is to provide a simple
> generic API that is agnostic of the underlying HW, SW or mixed HW-SW complex
> implementation.

Thanks Cristian for bringing up this RFC.
This will help in integrating NPU's QoS hierarchical scheduler's into DPDK.

Overall the RFC looks very good as a generic traffic manager. However, as a
NPU HW vendor, we feel like we need to expose some of the HW constraints and
HW specific features in a generic way in this specification to effectively use
with HW based implementation.

I will try to describe HW constraints and HW features associated with HW based
hierarchical scheduler found in Cavium SoCs as inline. IMO, If other HW vendors
share the constraints on "hardware based hierarchical scheduler"
then we could have a realistic HW/SW abstraction for the hierarchical scheduler.

> 
> Q1: What is the benefit for having an abstraction layer for QoS hierarchical
> layer?
> A1: There is growing interest in the industry for handling various HW-based,
> SW-based or mixed hierarchical scheduler implementations using a unified DPDK
> API.

Yes.

> Q4: Why have this abstraction layer into ethdev as opposed to a new type of
> device (e.g. scheddev) similar to ethdev, cryptodev, eventdev, etc?
> A4: Packets are sent to the Ethernet device using the ethdev API
> rte_eth_tx_burst() function, with the hierarchical scheduling taking place
> automatically (i.e. no SW intervention) in HW implementations. Basically, the
> hierarchical scheduler is done as part of packet TX operation.
> The hierarchical scheduler is typically the last stage before packet TX and it
> is tightly integrated with the TX stage. The hierarchical scheduler is just
> another offload feature of the Ethernet device, which needs to be accommodated
> by the ethdev API similar to any other offload feature (such as RSS, DCB,
> flow director, etc).
> Once the decision to schedule a specific packet has been taken, this packet
> cannot be dropped and it has to be sent over the wire as is, otherwise what
> takes place on the wire is not what was planned at scheduling time, so the
> scheduling is not accurate (Note: there are some devices which allow prepending
> headers to the packet after the scheduling stage at the expense of sending
> correction requests back to the scheduler, but this only strengthens the bond
> between scheduling and TX).

Makes sense.

> 
> Q5: Given that the packet scheduling takes place automatically for pure HW
> implementations, how does packet scheduling take place for poll-mode SW
> implementations?
> A5: The API provided function rte_sched_run() is designed to take care of this.
> For HW implementations, this function typically does nothing. For SW
> implementations, this function is typically expected to perform dequeue of
> packets from the hierarchical scheduler and their write to Ethernet device TX
> queue, periodic flush of any buffers on enqueue-side into the hierarchical
> scheduler for burst-oriented implementations, etc.
>

Yes. In addition to that, if rte_sched_run() does nothing(for HW implementation)
then _application_ should not call the same. I think we need to introduce
"service core" concept in DPDK to make it very transparent from an application
perspective.

> Q6: Which are the scheduling algorithms supported?
> A6: The fundamental scheduling algorithms that are supported are Strict Priority
> (SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported at
> the level of each node of the scheduling hierarchy, regardless of the node
> level/position in the tree. The SP algorithm is used to schedule between sibling
> nodes with different priority, while WFQ is used to schedule between groups of
> siblings that have the same priority.
> Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR
> (DWRR), etc are considered approximations of the ideal WFQ and are therefore
> assimilated to WFQ, although an associated implementation-dependent accuracy,
> performance and resource usage trade-off might exist.

Makes sense.

> 
> Q7: Which are the supported congestion management algorithms?
> A7: Tail drop, head drop and Weighted Random Early Detection (WRED). They are
> available for every leaf node in the hierarchy, subject to the specific
> implementation supporting them.

We don't support Tail drop, head drop or WRED for each leaf node in the hierarchy,
Instead, in some sense, it is integrated into the HW mempool block at the ingress.
So, maybe we can have some sort of capability or info API to get the capability of
the scheduler to the application to get the big picture instead of trying
individual resource APIs in the spec.

We do have support for querying available free entries in the leaf queue to
figure out the load. But it may not be worth to start a service core(rte_sched_run())
for implementing the spec due to multicore communication overhead.
Instead, using the HW base support(a means to get the
depth of leaf queue) application/library can do congestion management.

Thoughts ?

Does any other HW vendor support egress congestion management in HW ?

> 
> Q8: Is traffic shaping supported?
> A8: Yes, there are a number of shapers (rate limiters) that can be supported for
> each node in the hierarchy (built-in limit is currently set to 4 per node). Each
> shaper can be private to a node (used only by that node) or shared between
> multiple nodes.

Makes sense. We have dual rate shapers(very similar to RFC-2697 and RFC-2698)
at all the nodes(obviously, an only single rate at the last node(the one close to
physical port)). Just to understand, When we say 4 shapers per node, Is it four
different rate limiters per node? Is there any RFC for four rate limiter like
single(RFC-2697) and dual(RFC-2698)?

> 
> Q9: What is the purpose of having shaper profiles and WRED profiles?
> A9: In most implementations, many shapers typically share the same configuration
> parameters, so defining shaper profiles simplifies the configuration task. Same
> considerations apply to WRED contexts and profiles.

Makes sense.

> Q11: Are on-the-fly changes of the scheduling hierarchy allowed by the API?
> A11: Yes. The actual changes take place subject to the specific implementation
> supporting them, otherwise error code is returned.

On-the-fly scheduling hierarchy is tricky in HW implementation and it comes with
a lot of constraints. Returning the error code is fine, But we need to define what
it takes to reconfigure the hierarchy if on-the-fly reconfiguring is not supported.

The high-level constraints for reconfiguring hierarchy in our HW is:
1) Stop adding additional packets in leaf node
2) Wait for the packets to drain out from the nodes.

Point (2) is internal to implementation so we can manage.
I guess, For, Point (1), Application may need to know the constraint.

> 
> Q13: Which are the possible options for the user when the Ethernet port does not
> support the scheduling hierarchy required by the user?
> A13: The following options are available to the user:
> i) abort
> ii) try out a new hierarchy (e.g. with less leaf nodes), if acceptable

As mentioned earlier, Additional API to get the capability will help here.
Some of the other capabilities, we believe it will be useful for the applications.

1) maximum number of levels,
2) maximum nodes per level,
3) Is congestion management supported?
4) maximum priority per node?

At least it will be useful for writing the example application


> iii) wrap the Ethernet device into a new type of Ethernet device that has a SW
> front-end implementing the hierarchical scheduler (e.g. existing DPDK library
> librte_sched); instantiate the new device type on-the-fly and check if the
> hierarchy requirements can be met by the new device.

Do we want to wrap to new ethernet device or let application to use software
library directly ? If it is former then,

Are we planning for a generic SW based driver for this? So that the NICs don't have
HW support can just reuse the SW driver.Instead of duplicating the code in
all the PMD drivers?

> 
> 
> Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
> ---
>  lib/librte_ether/rte_ethdev.h | 794 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 794 insertions(+)
>  mode change 100644 => 100755 lib/librte_ether/rte_ethdev.h
> 
> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> old mode 100644
> new mode 100755
> index 9678179..d4d8604
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -182,6 +182,8 @@ extern "C" {
>  #include <rte_pci.h>
>  #include <rte_dev.h>
>  #include <rte_devargs.h>
> +#include <rte_meter.h>
> +#include <rte_red.h>

[snip]

> +
> +enum rte_eth_sched_stats_counter {
> +	/**< Number of packets scheduled from current node. */
> +	RTE_ETH_SCHED_STATS_COUNTER_N_PKTS = 1<< 0,
> +	/**< Number of bytes scheduled from current node. */
> +	RTE_ETH_SCHED_STATS_COUNTER_N_BYTES = 1 << 1,
> +	RTE_ETH_SCHED_STATS_COUNTER_N_PKTS_DROPPED = 1 << 2,
> +	RTE_ETH_SCHED_STATS_COUNTER_N_BYTES_DROPPED = 1 << 3,
> +	/**< Number of packets currently waiting in the packet queue of current
> +	     leaf node. */
> +	RTE_ETH_SCHED_STATS_COUNTER_N_PKTS_QUEUED = 1 << 4,
> +	/**< Number of bytes currently waiting in the packet queue of current
> +	     leaf node. */
> +	RTE_ETH_SCHED_STATS_COUNTER_N_BYTES_QUEUED = 1 << 5,

Some of the other counters seen in HW implementations from the shaper(rate limiter) are

RED_PACKETS,
RED_BYTES,
YELLOW_PACKETS,
YELLOW_BYTES,
GREEN_PACKETS,
GREEN_BYTES


> +};
> +
> +/**
> +  * Node statistics counters
> +  */
> +struct rte_eth_sched_node_stats {
> +	/**< Number of packets scheduled from current node. */
> +	uint64_t n_pkts;
> +	/**< Number of bytes scheduled from current node. */
> +	uint64_t n_bytes;
> +	/**< Statistics counters for leaf nodes only */

We don't have support for the stats for the all nodes.Since you have the
rte_eth_sched_node_stats_get_enabled(), We are good.

> +	struct {
> +		/**< Number of packets dropped by current leaf node. */
> +		uint64_t n_pkts_dropped;
> +		/**< Number of bytes dropped by current leaf node. */
> +		uint64_t n_bytes_dropped;
> +		/**< Number of packets currently waiting in the packet queue of
> +		     current leaf node. */
> +		uint64_t n_pkts_queued;
> +		/**< Number of bytes currently waiting in the packet queue of
> +		     current leaf node. */
> +		uint64_t n_bytes_queued;
> +	} leaf;

leaf stats looks good to us.

> +};
> +
>  /**
> + * Scheduler WRED profile add
> + *
> + * Create a new WRED profile with ID set to *wred_profile_id*. The new profile
> + * is used to create one or several WRED contexts.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param wred_profile_id
> + *   WRED profile ID for the new profile. Needs to be unused.
> + * @param profile
> + *   WRED profile parameters. Needs to be pre-allocated and valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_wred_profile_add(uint8_t port_id,
> +	uint32_t wred_profile_id,
> +	struct rte_eth_sched_wred_params *profile);

How about returning wred_profile_id from the driver? looks like, that is the easy
way to manage from driver perspective(driver can pass the same handle for similar
profiles and have an opaque number for embedding some other information)
and it is kind of norm.
i.e
int rte_eth_sched_wred_profile_add(uint8_t port_id,
		struct rte_eth_sched_wred_params *profile);


> +/**
> + * Scheduler node add or remap
> + *
> + * When *node_id* is not a valid node ID, a new node with this ID is created and
> + * connected as child to the existing node identified by *parent_node_id*.
> + *
> + * When *node_id* is a valid node ID, this node is disconnected from its current
> + * parent and connected as child to another existing node identified by
> + * *parent_node_id *.
> + *
> + * This function can be called during port initialization phase (before the
> + * Ethernet port is started) for building the scheduler start-up hierarchy.
> + * Subject to the specific Ethernet port supporting on-the-fly scheduler
> + * hierarchy updates, this function can also be called during run-time (after
> + * the Ethernet port is started).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID
> + * @param parent_node_id
> + *   Parent node ID. Needs to be the valid.
> + * @param params
> + *   Node parameters. Needs to be pre-allocated and valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.

IMO, We need an explicit error number to differentiate the configuration
error due do Ethernet port has been started.
And on receiving on such error code, we need to define what is the procedure
to reconfigure the topology.

The recent rte_flow spec has own error codes to get more visibility on the failure,
so that application can choose better attributes for configuring.
For example, Some of those limitations in our HW are

1) priorities are from 0 to 9(error type: PRIORITY_NOT_SUPPORTED)
2) DDWR is applicable only for one set priorities per children to parent
connection. example, valid case: 0-1-1-1-2-3. Invalid case: 0-1-1-1-3-2-(2),


> + */
> +int rte_eth_sched_node_add(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	struct rte_eth_sched_node_params *params);
> +
> +/**
> + * 
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param queue_id
> + *   Queue ID. Needs to be valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_queue_set(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t queue_id);
> +

In HW based implementation leaf node id == tx_queue_id as hierarchical
scheduling is tightly coupled with tx_queues(ie leaf nodes), Do we need
such translation? like  specifying "queue_id" in struct rte_eth_sched_node_params ?
since tx_queues are expressed in 0..n. How about making the leaf node id as
same. There is no such translation in HW so may be it will difficult to implement.
Do we really need this translation?

Other points:

HW can't understand any SW marking schemes applied at ingress classification
level. For us, at leaf node level all packets are in color unaware mode with
input color set as green(aka (color blind mode). On the subsequent levels,HW
adds color meta on the packet based on the shapers.

With above scheme, we have few features where we need figure out how to abstract in
the generic way based on SW implementation or other HW vendors constraints.

1) If last level color meta is YELLOW, HW can mark(write) 3 bits in the packet.
It will be useful for sharing the color info across two different systems.(like
updating IP  diffserv bits)

2) The need for additional shaping param called _adjust_.
Typically the conditioning and scheduling algorithm is measured in bytes of
IP packets per second. We have a _signed_ adjust(-255 to 255) field
(looks like other HW implementations also) to express packet length with reference
to L2 length. Positive value to include L1 header
(typically 20B, Ethernet preamble and Inter Frame Gap)
and negative value to express to remove L2 + VLAN header and take only IP len etc

/Jerin
Cavium
  
Hemant Agrawal Jan. 13, 2017, 10:36 a.m. UTC | #12
On 11/30/2016 11:46 PM, Cristian Dumitrescu wrote:
> This RFC proposes an ethdev-based abstraction layer for Quality of Service (QoS)
> hierarchical scheduler. The goal of the abstraction layer is to provide a simple
> generic API that is agnostic of the underlying HW, SW or mixed HW-SW complex
> implementation.
>
> Q1: What is the benefit for having an abstraction layer for QoS hierarchical
> layer?
> A1: There is growing interest in the industry for handling various HW-based,
> SW-based or mixed hierarchical scheduler implementations using a unified DPDK
> API.
>
> Q2: Which devices are targeted by this abstraction layer?
> A2: All current and future devices that expose a hierarchical scheduler feature
> under DPDK, including NICs, FPGAs, ASICs, SOCs, SW libraries.
>
> Q3: Which scheduler hierarchies are supported by the API?
> A3: Hopefully any scheduler hierarchy can be described and covered by the
> current API. Of course, functional correctness, accuracy and performance levels
> depend on the specific implementations of this API.
>
> Q4: Why have this abstraction layer into ethdev as opposed to a new type of
> device (e.g. scheddev) similar to ethdev, cryptodev, eventdev, etc?
> A4: Packets are sent to the Ethernet device using the ethdev API
> rte_eth_tx_burst() function, with the hierarchical scheduling taking place
> automatically (i.e. no SW intervention) in HW implementations. Basically, the
> hierarchical scheduler is done as part of packet TX operation.
> The hierarchical scheduler is typically the last stage before packet TX and it
> is tightly integrated with the TX stage. The hierarchical scheduler is just
> another offload feature of the Ethernet device, which needs to be accommodated
> by the ethdev API similar to any other offload feature (such as RSS, DCB,
> flow director, etc).
> Once the decision to schedule a specific packet has been taken, this packet
> cannot be dropped and it has to be sent over the wire as is, otherwise what
> takes place on the wire is not what was planned at scheduling time, so the
> scheduling is not accurate (Note: there are some devices which allow prepending
> headers to the packet after the scheduling stage at the expense of sending
> correction requests back to the scheduler, but this only strengthens the bond
> between scheduling and TX).
>
egress QoS can be applied to a physical or a logical network device.
At present the network devices are presented as ethdev in DPDK. Even a 
logical device can also be presented by creating a new ethdev. So it 
seems to be a good idea to associate it with ethdev.


> Q5: Given that the packet scheduling takes place automatically for pure HW
> implementations, how does packet scheduling take place for poll-mode SW
> implementations?
> A5: The API provided function rte_sched_run() is designed to take care of this.
> For HW implementations, this function typically does nothing. For SW
> implementations, this function is typically expected to perform dequeue of
> packets from the hierarchical scheduler and their write to Ethernet device TX
> queue, periodic flush of any buffers on enqueue-side into the hierarchical
> scheduler for burst-oriented implementations, etc.
>

I think this is *rte_eth_sched_run* in your APIs.

It will be a no-ops for hw, how do you envision it's usages in the 
typical software. e.g. in the l3fwd application,
  - every time you do a rte_eth_tx_burst - there may be locking concern 
here.
  - creating a per port thread to continue doing rte_eth_sched_run
  - call it in one of the existing polling thread for a port.


> Q6: Which are the scheduling algorithms supported?
> A6: The fundamental scheduling algorithms that are supported are Strict Priority
> (SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported at
> the level of each node of the scheduling hierarchy, regardless of the node
> level/position in the tree. The SP algorithm is used to schedule between sibling
> nodes with different priority, while WFQ is used to schedule between groups of
> siblings that have the same priority.
> Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR
> (DWRR), etc are considered approximations of the ideal WFQ and are therefore
> assimilated to WFQ, although an associated implementation-dependent accuracy,
> performance and resource usage trade-off might exist.
>
> Q7: Which are the supported congestion management algorithms?
> A7: Tail drop, head drop and Weighted Random Early Detection (WRED). They are
> available for every leaf node in the hierarchy, subject to the specific
> implementation supporting them.
>
We may need to introduce some kind capability APIS. e.g. NXP HW do not 
support headdrop.

> Q8: Is traffic shaping supported?
> A8: Yes, there are a number of shapers (rate limiters) that can be supported for
> each node in the hierarchy (built-in limit is currently set to 4 per node). Each
> shaper can be private to a node (used only by that node) or shared between
> multiple nodes.
>

What do you mean by supporting 4 shaper per node? if you need more 
shapers than create new hierarchy nodes.
Also, similarly if a shaper is to be shared between two nodes, than it 
should be in parent node?
Why you want to create shaper hierarchy within a node of hierarchical 
QoS.


> Q9: What is the purpose of having shaper profiles and WRED profiles?
> A9: In most implementations, many shapers typically share the same configuration
> parameters, so defining shaper profiles simplifies the configuration task. Same
> considerations apply to WRED contexts and profiles.
>

Agree

> Q10: How is the scheduling hierarchy defined and created?
> A10: Scheduler hierarchy tree is set up by creating new nodes and connecting
> them to other existing nodes, which thus become parent nodes. The unique ID that
> is assigned to each node when the node is created is further used to update the
> node configuration or to connect children nodes to it. The leaf nodes of the
> scheduler hierarchy are each attached to one of the Ethernet device TX queues.

It may be cleaner to differentiate between a leaf (i.e. a qos_queue) and 
scheduling node.

> Q11: Are on-the-fly changes of the scheduling hierarchy allowed by the API?
> A11: Yes. The actual changes take place subject to the specific implementation
> supporting them, otherwise error code is returned.

What kind of change are you seeing here? creating new nodes/levels? 
reconnecting a node from one parent node to another?

This is more like a implementation capability.


> Q12: What is the typical function call sequence to set up and run the Ethernet
> device scheduler?
> A12: The typical simplified function call sequence is listed below:
> i) Configure the Ethernet device and its TX queues: rte_eth_dev_configure(),
> rte_eth_tx_queue_setup()
> ii) Create WRED profiles and WRED contexts, shaper profiles and shapers:
> rte_eth_sched_wred_profile_add(), rte_eth_sched_wred_context_add(),
> rte_eth_sched_shaper_profile_add(), rte_eth_sched_shaper_add()
> iii) Create the scheduler hierarchy nodes and tree: rte_eth_sched_node_add()
> iv) Freeze the start-up hierarchy and ask the device whether it supports it:
> rte_eth_sched_node_add()
> v) Start the Ethernet port: rte_eth_dev_start()
> vi) Run-time scheduler hierarchy updates: rte_eth_sched_node_add(),
> rte_eth_sched_node_<attribute>_set()
> vii) Run-time packet enqueue into the hierarchical scheduler: rte_eth_tx_burst()
> viii) Run-time support for SW poll-mode implementations (see previous answer):
> rte_sched_run()
>
> Q13: Which are the possible options for the user when the Ethernet port does not
> support the scheduling hierarchy required by the user?
> A13: The following options are available to the user:
> i) abort
> ii) try out a new hierarchy (e.g. with less leaf nodes), if acceptable
> iii) wrap the Ethernet device into a new type of Ethernet device that has a SW
> front-end implementing the hierarchical scheduler (e.g. existing DPDK library
> librte_sched); instantiate the new device type on-the-fly and check if the
> hierarchy requirements can be met by the new device.
>
>

I will like to see some kind of capability APIs upfront.

1. Number of Levels supported
2. Per level capability (capability of each level may be different)
3. - Number of nodes support at a given level
4. - Max Number of input nodes supported
5. - Type of scheduling algo supported (SP, WFQ etc)
6. - Shaper support - Dual Rate
7. - Congestion control
8. - max priorities.


> Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
> ---
>  lib/librte_ether/rte_ethdev.h | 794 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 794 insertions(+)
>  mode change 100644 => 100755 lib/librte_ether/rte_ethdev.h
>
> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> old mode 100644
> new mode 100755
> index 9678179..d4d8604
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -182,6 +182,8 @@ extern "C" {
>  #include <rte_pci.h>
>  #include <rte_dev.h>
>  #include <rte_devargs.h>
> +#include <rte_meter.h>
> +#include <rte_red.h>
>  #include "rte_ether.h"
>  #include "rte_eth_ctrl.h"
>  #include "rte_dev_info.h"
> @@ -1038,6 +1040,152 @@ TAILQ_HEAD(rte_eth_dev_cb_list, rte_eth_dev_callback);
>  /**< l2 tunnel forwarding mask */
>  #define ETH_L2_TUNNEL_FORWARDING_MASK   0x00000008
>
> +/**
> + * Scheduler configuration
> + */
> +
> +/**< Max number of shapers per node */
> +#define RTE_ETH_SCHED_SHAPERS_PER_NODE                     4
> +/**< Invalid shaper ID */
> +#define RTE_ETH_SCHED_SHAPER_ID_NONE                       UINT32_MAX
> +/**< Max number of WRED contexts per node */
> +#define RTE_ETH_SCHED_WRED_CONTEXTS_PER_NODE               4
> +/**< Invalid WRED context ID */
> +#define RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE                 UINT32_MAX
> +/**< Invalid node ID */
> +#define RTE_ETH_SCHED_NODE_NULL                            UINT32_MAX
> +
> +/**
> +  * Congestion management (CMAN) mode
> +  *
> +  * This is used for controlling the admission of packets into a packet queue or
> +  * group of packet queues on congestion. On request of writing a new packet
> +  * into the current queue while the queue is full, the *tail drop* algorithm
> +  * drops the new packet while leaving the queue unmodified, as opposed to *head
> +  * drop* algorithm, which drops the packet at the head of the queue (the oldest
> +  * packet waiting in the queue) and admits the new packet at the tail of the
> +  * queue.
> +  *
> +  * The *Random Early Detection (RED)* algorithm works by proactively dropping
> +  * more and more input packets as the queue occupancy builds up. When the queue
> +  * is full or almost full, RED effectively works as *tail drop*. The *Weighted
> +  * RED* algorithm uses a separate set of RED thresholds per packet color.
> +  */
> +enum rte_eth_sched_cman_mode {
> +	RTE_ETH_SCHED_CMAN_TAIL_DROP = 0, /**< Tail drop */
> +	RTE_ETH_SCHED_CMAN_HEAD_DROP, /**< Head drop */
> +	RTE_ETH_SCHED_CMAN_WRED, /**< Weighted Random Early Detection (WRED) */
> +};
> +

you may also need parameters whether the cman is byte based or frame based.

> +/**
> +  * WRED profile
> +  */
> +struct rte_eth_sched_wred_params {
> +	/**< One set of RED parameters per packet color */
> +	struct rte_red_params red_params[e_RTE_METER_COLORS];
> +};
> +
> +/**
> +  * Shaper (rate limiter) profile
> +  *
> +  * Multiple shaper instances can share the same shaper profile. Each node can
> +  * have multiple shapers enabled (up to RTE_ETH_SCHED_SHAPERS_PER_NODE). Each
> +  * shaper can be private to a node (only one node using it) or shared (multiple
> +  * nodes use the same shaper instance).
> +  */
> +struct rte_eth_sched_shaper_params {
> +	uint64_t rate; /**< Token bucket rate (bytes per second) */
> +	uint64_t size; /**< Token bucket size (bytes) */
> +};
> +

dual rate shaper can be supported here.

I guess by size you mean the max burst size?


> +/**
> +  * Node parameters
> +  *
> +  * Each scheduler hierarchy node has multiple inputs (children nodes of the
> +  * current parent node) and a single output (which is input to its parent
> +  * node). The current node arbitrates its inputs using Strict Priority (SP)
> +  * and Weighted Fair Queuing (WFQ) algorithms to schedule input packets on its
> +  * output while observing its shaping/rate limiting constraints.  Algorithms
> +  * such as Weighted Round Robin (WRR), byte-level WRR, Deficit WRR (DWRR), etc
> +  * are considered approximations of the ideal WFQ and are assimilated to WFQ,
> +  * although an associated implementation-dependent trade-off on accuracy,
> +  * performance and resource usage might exist.
> +  *
> +  * Children nodes with different priorities are scheduled using the SP
> +  * algorithm, based on their priority, with zero (0) as the highest priority.
> +  * Children with same priority are scheduled using the WFQ algorithm, based on
> +  * their weight, which is relative to the sum of the weights of all siblings
> +  * with same priority, with one (1) as the lowest weight.
> +  */
> +struct rte_eth_sched_node_params {
> +	/**< Child node priority (used by SP). The highest priority is zero. */
> +	uint32_t priority;
> +	/**< Child node weight (used by WFQ), relative to some of weights of all
> +	     siblings with same priority). The lowest weight is one. */
> +	uint32_t weight;
> +	/**< Set of shaper instances enabled for current node. Each node shaper
> +	     can be disabled by setting it to RTE_ETH_SCHED_SHAPER_ID_NONE. */
> +	uint32_t shaper_id[RTE_ETH_SCHED_SHAPERS_PER_NODE];
> +	/**< Set to zero if current node is not a hierarchy leaf node, set to a
> +	     non-zero value otherwise. A leaf node is a hierarchy node that does
> +	     not have any children. A leaf node has to be connected to a valid
> +	     packet queue. */
> +	int is_leaf;
> +	/**< Parameters valid for leaf nodes only */
> +	struct {
> +		/**< Packet queue ID */
> +		uint64_t queue_id;
> +		/**< Congestion management mode */
> +		enum rte_eth_sched_cman_mode cman;
> +		/**< Set of WRED contexts enabled for current leaf node. Each
> +		     leaf node WRED context can be disabled by setting it to
> +		     RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE. Only valid when
> +		     congestion management for current leaf node is set to WRED. */
> +		uint32_t wred_context_id[RTE_ETH_SCHED_WRED_CONTEXTS_PER_NODE];
> +	} leaf;
> +};
> +

It will be better to separate the leaf i.e. a qos_queue from a sched 
node, it will simplify.

e.g.

struct rte_eth_sched_qos_queue{
	/**< Child node priority (used by SP). The highest priority is zero. */
	uint32_t priority;
	/**< Child node weight (used by WFQ), relative to some of weights of 
all siblings with same priority). The lowest weight is one. */
	uint32_t weight;

	/**< Packet queue ID */
	uint64_t queue_id;

	/**< Congestion management params*/
	enum rte_eth_sched_cman_mode cman;
};


struct rte_eth_sched_node_params {
	/**< Child node priority (used by SP). The highest priority is zero. */
	uint32_t priority;
	/**< Child node weight (used by WFQ), relative to some of weights of 
all siblings with same priority). The lowest weight is one. */
	uint32_t weight;
	/**< Set of shaper instances enabled for current node. Each node shaper 
can be disabled by setting it to RTE_ETH_SCHED_SHAPER_ID_NONE. */
	uint32_t shaper_id;
	
	/**< WRED contexts enabled for current leaf node. Each leaf node WRED 
context can be disabled by setting it to
	     RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE. Only valid when
	     congestion management for current leaf node is set to  WRED. */
	uint32_t wred_context_id;
};
sched_qos_queue (s) will be connected to schdule node.


> +/**
> +  * Node statistics counter type
> +  */
> +enum rte_eth_sched_stats_counter {
> +	/**< Number of packets scheduled from current node. */
> +	RTE_ETH_SCHED_STATS_COUNTER_N_PKTS = 1<< 0,
> +	/**< Number of bytes scheduled from current node. */
> +	RTE_ETH_SCHED_STATS_COUNTER_N_BYTES = 1 << 1,
> +	RTE_ETH_SCHED_STATS_COUNTER_N_PKTS_DROPPED = 1 << 2,
> +	RTE_ETH_SCHED_STATS_COUNTER_N_BYTES_DROPPED = 1 << 3,
> +	/**< Number of packets currently waiting in the packet queue of current
> +	     leaf node. */
> +	RTE_ETH_SCHED_STATS_COUNTER_N_PKTS_QUEUED = 1 << 4,
> +	/**< Number of bytes currently waiting in the packet queue of current
> +	     leaf node. */
> +	RTE_ETH_SCHED_STATS_COUNTER_N_BYTES_QUEUED = 1 << 5,
> +};
> +
> +/**
> +  * Node statistics counters
> +  */
> +struct rte_eth_sched_node_stats {
> +	/**< Number of packets scheduled from current node. */
> +	uint64_t n_pkts;
> +	/**< Number of bytes scheduled from current node. */
> +	uint64_t n_bytes;
> +	/**< Statistics counters for leaf nodes only */
> +	struct {
> +		/**< Number of packets dropped by current leaf node. */
> +		uint64_t n_pkts_dropped;
> +		/**< Number of bytes dropped by current leaf node. */
> +		uint64_t n_bytes_dropped;
> +		/**< Number of packets currently waiting in the packet queue of
> +		     current leaf node. */
> +		uint64_t n_pkts_queued;
> +		/**< Number of bytes currently waiting in the packet queue of
> +		     current leaf node. */
> +		uint64_t n_bytes_queued;
> +	} leaf;
> +};
> +
>  /*
>   * Definitions of all functions exported by an Ethernet driver through the
>   * the generic structure of type *eth_dev_ops* supplied in the *rte_eth_dev*
> @@ -1421,6 +1569,120 @@ typedef int (*eth_get_dcb_info)(struct rte_eth_dev *dev,
>  				 struct rte_eth_dcb_info *dcb_info);
>  /**< @internal Get dcb information on an Ethernet device */
>
> +typedef int (*eth_sched_wred_profile_add_t)(struct rte_eth_dev *dev,
> +	uint32_t wred_profile_id,
> +	struct rte_eth_sched_wred_params *profile);
> +/**< @internal Scheduler WRED profile add */
> +
> +typedef int (*eth_sched_wred_profile_delete_t)(struct rte_eth_dev *dev,
> +	uint32_t wred_profile_id);
> +/**< @internal Scheduler WRED profile delete */
> +
> +typedef int (*eth_sched_wred_context_add_t)(struct rte_eth_dev *dev,
> +	uint32_t wred_context_id,
> +	uint32_t wred_profile_id);
> +/**< @internal Scheduler WRED context add */
> +
> +typedef int (*eth_sched_wred_context_delete_t)(struct rte_eth_dev *dev,
> +	uint32_t wred_context_id);
> +/**< @internal Scheduler WRED context delete */
> +
> +typedef int (*eth_sched_shaper_profile_add_t)(struct rte_eth_dev *dev,
> +	uint32_t shaper_profile_id,
> +	struct rte_eth_sched_shaper_params *profile);
> +/**< @internal Scheduler shaper profile add */
> +
> +typedef int (*eth_sched_shaper_profile_delete_t)(struct rte_eth_dev *dev,
> +	uint32_t shaper_profile_id);
> +/**< @internal Scheduler shaper profile delete */
> +
> +typedef int (*eth_sched_shaper_add_t)(struct rte_eth_dev *dev,
> +	uint32_t shaper_id,
> +	uint32_t shaper_profile_id);
> +/**< @internal Scheduler shaper instance add */
> +
> +typedef int (*eth_sched_shaper_delete_t)(struct rte_eth_dev *dev,
> +	uint32_t shaper_id);
> +/**< @internal Scheduler shaper instance delete */
> +
> +typedef int (*eth_sched_node_add_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	struct rte_eth_sched_node_params *params);
> +/**< @internal Scheduler node add */
> +
> +typedef int (*eth_sched_node_delete_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id);
> +/**< @internal Scheduler node delete */
> +
> +typedef int (*eth_sched_hierarchy_set_t)(struct rte_eth_dev *dev,
> +	int clear_on_fail);
> +/**< @internal Scheduler hierarchy set */
> +
> +typedef int (*eth_sched_node_priority_set_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t priority);
> +/**< @internal Scheduler node priority set */
> +
> +typedef int (*eth_sched_node_weight_set_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t weight);
> +/**< @internal Scheduler node weight set */
> +
> +typedef int (*eth_sched_node_shaper_set_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t shaper_pos,
> +	uint32_t shaper_id);
> +/**< @internal Scheduler node shaper set */
> +
> +typedef int (*eth_sched_node_queue_set_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t queue_id);
> +/**< @internal Scheduler node queue set */
> +
> +typedef int (*eth_sched_node_cman_set_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	enum rte_eth_sched_cman_mode cman);
> +/**< @internal Scheduler node congestion management mode set */
> +
> +typedef int (*eth_sched_node_wred_context_set_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t wred_context_pos,
> +	uint32_t wred_context_id);
> +/**< @internal Scheduler node WRED context set */
> +
> +typedef int (*eth_sched_stats_get_enabled_t)(struct rte_eth_dev *dev,
> +	uint64_t *nonleaf_node_capability_stats_mask,
> +	uint64_t *nonleaf_node_enabled_stats_mask,
> +	uint64_t *leaf_node_capability_stats_mask,
> +	uint64_t *leaf_node_enabled_stats_mask);
> +/**< @internal Scheduler get set of stats counters enabled for all nodes */
> +
> +typedef int (*eth_sched_stats_enable_t)(struct rte_eth_dev *dev,
> +	uint64_t nonleaf_node_enabled_stats_mask,
> +	uint64_t leaf_node_enabled_stats_mask);
> +/**< @internal Scheduler enable selected stats counters for all nodes */
> +
> +typedef int (*eth_sched_node_stats_get_enabled_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint64_t *capability_stats_mask,
> +	uint64_t *enabled_stats_mask);
> +/**< @internal Scheduler get set of stats counters enabled for specific node */
> +
> +typedef int (*eth_sched_node_stats_enable_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint64_t enabled_stats_mask);
> +/**< @internal Scheduler enable selected stats counters for specific node */
> +
> +typedef int (*eth_sched_node_stats_read_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	struct rte_eth_sched_node_stats *stats,
> +	int clear);
> +/**< @internal Scheduler read stats counters for specific node */
> +
> +typedef int (*eth_sched_run_t)(struct rte_eth_dev *dev);
> +/**< @internal Scheduler run */
> +
>  /**
>   * @internal A structure containing the functions exported by an Ethernet driver.
>   */
> @@ -1547,6 +1809,53 @@ struct eth_dev_ops {
>  	eth_l2_tunnel_eth_type_conf_t l2_tunnel_eth_type_conf;
>  	/** Enable/disable l2 tunnel offload functions */
>  	eth_l2_tunnel_offload_set_t l2_tunnel_offload_set;
> +
> +	/** Scheduler WRED profile add */
> +	eth_sched_wred_profile_add_t sched_wred_profile_add;
> +	/** Scheduler WRED profile delete */
> +	eth_sched_wred_profile_delete_t sched_wred_profile_delete;
> +	/** Scheduler WRED context add */
> +	eth_sched_wred_context_add_t sched_wred_context_add;
> +	/** Scheduler WRED context delete */
> +	eth_sched_wred_context_delete_t sched_wred_context_delete;
> +	/** Scheduler shaper profile add */
> +	eth_sched_shaper_profile_add_t sched_shaper_profile_add;
> +	/** Scheduler shaper profile delete */
> +	eth_sched_shaper_profile_delete_t sched_shaper_profile_delete;
> +	/** Scheduler shaper instance add */
> +	eth_sched_shaper_add_t sched_shaper_add;
> +	/** Scheduler shaper instance delete */
> +	eth_sched_shaper_delete_t sched_shaper_delete;
> +	/** Scheduler node add */
> +	eth_sched_node_add_t sched_node_add;
> +	/** Scheduler node delete */
> +	eth_sched_node_delete_t sched_node_delete;
> +	/** Scheduler hierarchy set */
> +	eth_sched_hierarchy_set_t sched_hierarchy_set;
> +	/** Scheduler node priority set */
> +	eth_sched_node_priority_set_t sched_node_priority_set;
> +	/** Scheduler node weight set */
> +	eth_sched_node_weight_set_t sched_node_weight_set;
> +	/** Scheduler node shaper set */
> +	eth_sched_node_shaper_set_t sched_node_shaper_set;
> +	/** Scheduler node queue set */
> +	eth_sched_node_queue_set_t sched_node_queue_set;
> +	/** Scheduler node congestion management mode set */
> +	eth_sched_node_cman_set_t sched_node_cman_set;
> +	/** Scheduler node WRED context set */
> +	eth_sched_node_wred_context_set_t sched_node_wred_context_set;
> +	/** Scheduler get statistics counter type enabled for all nodes */
> +	eth_sched_stats_get_enabled_t sched_stats_get_enabled;
> +	/** Scheduler enable selected statistics counters for all nodes */
> +	eth_sched_stats_enable_t sched_stats_enable;
> +	/** Scheduler get statistics counter type enabled for current node */
> +	eth_sched_node_stats_get_enabled_t sched_node_stats_get_enabled;
> +	/** Scheduler enable selected statistics counters for current node */
> +	eth_sched_node_stats_enable_t sched_node_stats_enable;
> +	/** Scheduler read statistics counters for current node */
> +	eth_sched_node_stats_read_t sched_node_stats_read;
> +	/** Scheduler run */
> +	eth_sched_run_t sched_run;
>  };
>
>  /**
> @@ -4336,6 +4645,491 @@ rte_eth_dev_l2_tunnel_offload_set(uint8_t port_id,
>  				  uint8_t en);
>
>  /**
> + * Scheduler WRED profile add
> + *
> + * Create a new WRED profile with ID set to *wred_profile_id*. The new profile
> + * is used to create one or several WRED contexts.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param wred_profile_id
> + *   WRED profile ID for the new profile. Needs to be unused.
> + * @param profile
> + *   WRED profile parameters. Needs to be pre-allocated and valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_wred_profile_add(uint8_t port_id,
> +	uint32_t wred_profile_id,
> +	struct rte_eth_sched_wred_params *profile);
> +
> +/**
> + * Scheduler WRED profile delete
> + *
> + * Delete an existing WRED profile. This operation fails when there is currently
> + * at least one user (i.e. WRED context) of this WRED profile.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param wred_profile_id
> + *   WRED profile ID. Needs to be the valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_wred_profile_delete(uint8_t port_id,
> +	uint32_t wred_profile_id);
> +
> +/**
> + * Scheduler WRED context add or update
> + *
> + * When *wred_context_id* is invalid, a new WRED context with this ID is created
> + * by using the WRED profile identified by *wred_profile_id*.
> + *
> + * When *wred_context_id* is valid, this WRED context is no longer using the
> + * profile previously assigned to it and is updated to use the profile
> + * identified by *wred_profile_id*.
> + *
> + * A valid WRED context is assigned to one or several scheduler hierarchy leaf
> + * nodes configured to use WRED as the congestion management mode.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param wred_context_id
> + *   WRED context ID
> + * @param wred_profile_id
> + *   WRED profile ID. Needs to be the valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_wred_context_add(uint8_t port_id,
> +	uint32_t wred_context_id,
> +	uint32_t wred_profile_id);
> +
> +/**
> + * Scheduler WRED context delete
> + *
> + * Delete an existing WRED context. This operation fails when there is currently
> + * at least one user (i.e. scheduler hierarchy leaf node) of this WRED context.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param wred_context_id
> + *   WRED context ID. Needs to be the valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_wred_context_delete(uint8_t port_id,
> +	uint32_t wred_context_id);
> +
> +/**
> + * Scheduler shaper profile add
> + *
> + * Create a new shaper profile with ID set to *shaper_profile_id*. The new
> + * shaper profile is used to create one or several shapers.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shaper_profile_id
> + *   Shaper profile ID for the new profile. Needs to be unused.
> + * @param profile
> + *   Shaper profile parameters. Needs to be pre-allocated and valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_shaper_profile_add(uint8_t port_id,
> +	uint32_t shaper_profile_id,
> +	struct rte_eth_sched_shaper_params *profile);
> +
> +/**
> + * Scheduler shaper profile delete
> + *
> + * Delete an existing shaper profile. This operation fails when there is
> + * currently at least one user (i.e. shaper) of this shaper profile.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shaper_profile_id
> + *   Shaper profile ID. Needs to be the valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +/* no users (shapers) using this profile */
> +int rte_eth_sched_shaper_profile_delete(uint8_t port_id,
> +	uint32_t shaper_profile_id);
> +
> +/**
> + * Scheduler shaper add or update
> + *
> + * When *shaper_id* is not a valid shaper ID, a new shaper with this ID is
> + * created using the shaper profile identified by *shaper_profile_id*.
> + *
> + * When *shaper_id* is a valid shaper ID, this shaper is no longer using the
> + * shaper profile previously assigned to it and is updated to use the shaper
> + * profile identified by *shaper_profile_id*.
> + *
> + * A valid shaper is assigned to one or several scheduler hierarchy nodes.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shaper_id
> + *   Shaper ID
> + * @param shaper_profile_id
> + *   Shaper profile ID. Needs to be the valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_shaper_add(uint8_t port_id,
> +	uint32_t shaper_id,
> +	uint32_t shaper_profile_id);
> +
> +/**
> + * Scheduler shaper delete
> + *
> + * Delete an existing shaper. This operation fails when there is currently at
> + * least one user (i.e. scheduler hierarchy node) of this shaper.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shaper_id
> + *   Shaper ID. Needs to be the valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_shaper_delete(uint8_t port_id,
> +	uint32_t shaper_id);
> +
> +/**
> + * Scheduler node add or remap
> + *
> + * When *node_id* is not a valid node ID, a new node with this ID is created and
> + * connected as child to the existing node identified by *parent_node_id*.
> + *
> + * When *node_id* is a valid node ID, this node is disconnected from its current
> + * parent and connected as child to another existing node identified by
> + * *parent_node_id *.
> + *
> + * This function can be called during port initialization phase (before the
> + * Ethernet port is started) for building the scheduler start-up hierarchy.
> + * Subject to the specific Ethernet port supporting on-the-fly scheduler
> + * hierarchy updates, this function can also be called during run-time (after
> + * the Ethernet port is started).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID
> + * @param parent_node_id
> + *   Parent node ID. Needs to be the valid.
> + * @param params
> + *   Node parameters. Needs to be pre-allocated and valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_add(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	struct rte_eth_sched_node_params *params);
> +
> +/**
> + * Scheduler node delete
> + *
> + * Delete an existing node. This operation fails when this node currently has at
> + * least one user (i.e. child node).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_delete(uint8_t port_id,
> +	uint32_t node_id);
> +
> +/**
> + * Scheduler hierarchy set
> + *
> + * This function is called during the port initialization phase (before the
> + * Ethernet port is started) to freeze the scheduler start-up hierarchy.
> + *
> + * This function fails when the currently configured scheduler hierarchy is not
> + * supported by the Ethernet port, in which case the user can abort or try out
> + * another hierarchy configuration (e.g. a hierarchy with less leaf nodes),
> + * which can be build from scratch (when *clear_on_fail* is enabled) or by
> + * modifying the existing hierarchy configuration (when *clear_on_fail* is
> + * disabled).
> + *
> + * Note that, even when the configured scheduler hierarchy is supported (so this
> + * function is successful), the Ethernet port start might still fail due to e.g.
> + * not enough memory being available in the system, etc.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param clear_on_fail
> + *   On function call failure, hierarchy is cleared when this parameter is
> + *   non-zero and preserved when this parameter is equal to zero.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_hierarchy_set(uint8_t port_id,
> +	int clear_on_fail);
> +
> +/**
> + * Scheduler node priority set
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param priority
> + *   Node priority. The highest node priority is zero. Used by the SP algorithm
> + *   running on the parent of the current node for scheduling this child node.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_priority_set(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t priority);
> +
> +/**
> + * Scheduler node weight set
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param weight
> + *   Node weight. The node weight is relative to the weight sum of all siblings
> + *   that have the same priority. The lowest weight is zero. Used by the WFQ
> + *   algorithm running on the parent of the current node for scheduling this
> + *   child node.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_weight_set(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t weight);
> +
> +/**
> + * Scheduler node shaper set
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param shaper_pos
> + *   Position in the shaper array of the current node
> + *   (0 .. RTE_ETH_SCHED_SHAPERS_PER_NODE-1).
> + * @param shaper_id
> + *   Shaper ID. Needs to be either valid shaper ID or set to
> + *   RTE_ETH_SCHED_SHAPER_ID_NONE in order to invalidate the shaper on position
> + *   *shaper_pos* within the current node.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_shaper_set(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t shaper_pos,
> +	uint32_t shaper_id);
> +
> +/**
> + * Scheduler node queue set
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param queue_id
> + *   Queue ID. Needs to be valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_queue_set(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t queue_id);
> +
> +/**
> + * Scheduler node congestion management mode set
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid leaf node ID.
> + * @param cman
> + *   Congestion management mode.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_cman_set(uint8_t port_id,
> +	uint32_t node_id,
> +	enum rte_eth_sched_cman_mode cman);
> +
> +/**
> + * Scheduler node WRED context set
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid leaf node ID that has WRED selected as the
> + *   congestion management mode.
> + * @param wred_context_pos
> + *   Position in the WRED context array of the current leaf node
> + *   (0 .. RTE_ETH_SCHED_WRED_CONTEXTS_PER_NODE-1)
> + * @param wred_context_id
> + *   WRED context ID. Needs to be either valid WRED context ID or set to
> + *   RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE in order to invalidate the WRED context
> + *   on position *wred_context_pos* within the current leaf node.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_wred_context_set(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t wred_context_pos,
> +	uint32_t wred_context_id);
> +
> +/**
> + * Scheduler get statistics counter types enabled for all nodes
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param nonleaf_node_capability_stats_mask
> + *   Statistics counter types available per node for all non-leaf nodes. Needs
> + *   to be pre-allocated.
> + * @param nonleaf_node_enabled_stats_mask
> + *   Statistics counter types currently enabled per node for each non-leaf node.
> + *   This is a subset of *nonleaf_node_capability_stats_mask*. Needs to be
> + *   pre-allocated.
> + * @param leaf_node_capability_stats_mask
> + *   Statistics counter types available per node for all leaf nodes. Needs to
> + *   be pre-allocated.
> + * @param leaf_node_enabled_stats_mask
> + *   Statistics counter types currently enabled for each leaf node. This is
> + *   a subset of *leaf_node_capability_stats_mask*. Needs to be pre-allocated.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_stats_get_enabled(uint8_t port_id,
> +	uint64_t *nonleaf_node_capability_stats_mask,
> +	uint64_t *nonleaf_node_enabled_stats_mask,
> +	uint64_t *leaf_node_capability_stats_mask,
> +	uint64_t *leaf_node_enabled_stats_mask);
> +
> +/**
> + * Scheduler enable selected statistics counters for all nodes
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param nonleaf_node_enabled_stats_mask
> + *   Statistics counter types to be enabled per node for each non-leaf node.
> + *   This needs to be a subset of the statistics counter types available per
> + *   node for all non-leaf nodes. Any statistics counter type not included in
> + *   this set is to be disabled for all non-leaf nodes.
> + * @param leaf_node_enabled_stats_mask
> + *   Statistics counter types to be enabled per node for each leaf node. This
> + *   needs to be a subset of the statistics counter types available per node for
> + *   all leaf nodes. Any statistics counter type not included in this set is to
> + *   be disabled for all leaf nodes.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_stats_enable(uint8_t port_id,
> +	uint64_t nonleaf_node_enabled_stats_mask,
> +	uint64_t leaf_node_enabled_stats_mask);
> +
> +/**
> + * Scheduler get statistics counter types enabled for current node
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param capability_stats_mask
> + *   Statistics counter types available for the current node. Needs to be pre-allocated.
> + * @param enabled_stats_mask
> + *   Statistics counter types currently enabled for the current node. This is
> + *   a subset of *capability_stats_mask*. Needs to be pre-allocated.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_stats_get_enabled(uint8_t port_id,
> +	uint32_t node_id,
> +	uint64_t *capability_stats_mask,
> +	uint64_t *enabled_stats_mask);
> +
> +/**
> + * Scheduler enable selected statistics counters for current node
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param enabled_stats_mask
> + *   Statistics counter types to be enabled for the current node. This needs to
> + *   be a subset of the statistics counter types available for the current node.
> + *   Any statistics counter type not included in this set is to be disabled for
> + *   the current node.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_stats_enable(uint8_t port_id,
> +	uint32_t node_id,
> +	uint64_t enabled_stats_mask);
> +
> +/**
> + * Scheduler node statistics counters read
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param stats
> + *   When non-NULL, it contains the current value for the statistics counters
> + *   enabled for the current node.
> + * @param clear
> + *   When this parameter has a non-zero value, the statistics counters are
> + *   cleared (i.e. set to zero) immediately after they have been read, otherwise
> + *   the statistics counters are left untouched.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_stats_read(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_eth_sched_node_stats *stats,
> +	int clear);
> +
> +/**
> + * Scheduler run
> + *
> + * The packet enqueue side of the scheduler hierarchy is typically done through
> + * the Ethernet device TX function. For HW implementations, the packet dequeue
> + * side is typically done by the Ethernet device without any SW intervention,
> + * therefore this functions should not do anything.
> + *
> + * However, for poll-mode SW or mixed HW-SW implementations, the SW intervention
> + * is likely to be required for running the packet dequeue side of the scheduler
> + * hierarchy. Other potential task performed by this function is periodic flush
> + * of any packet enqueue-side buffers used by the burst-mode implementations.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +static inline int
> +rte_eth_sched_run(uint8_t port_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +#ifdef RTE_LIBRTE_ETHDEV_DEBUG
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> +#endif
> +
> +	dev = &rte_eth_devices[port_id];
> +
> +	return (dev->dev_ops->sched_run)? dev->dev_ops->sched_run(dev) : 0;
> +}
> +
> +/**
>  * Get the port id from pci adrress or device name
>  * Ex: 0000:2:00.0 or vdev name net_pcap0
>  *
>
  

Patch

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
old mode 100644
new mode 100755
index 9678179..d4d8604
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -182,6 +182,8 @@  extern "C" {
 #include <rte_pci.h>
 #include <rte_dev.h>
 #include <rte_devargs.h>
+#include <rte_meter.h>
+#include <rte_red.h>
 #include "rte_ether.h"
 #include "rte_eth_ctrl.h"
 #include "rte_dev_info.h"
@@ -1038,6 +1040,152 @@  TAILQ_HEAD(rte_eth_dev_cb_list, rte_eth_dev_callback);
 /**< l2 tunnel forwarding mask */
 #define ETH_L2_TUNNEL_FORWARDING_MASK   0x00000008
 
+/**
+ * Scheduler configuration
+ */
+
+/**< Max number of shapers per node */
+#define RTE_ETH_SCHED_SHAPERS_PER_NODE                     4
+/**< Invalid shaper ID */
+#define RTE_ETH_SCHED_SHAPER_ID_NONE                       UINT32_MAX
+/**< Max number of WRED contexts per node */
+#define RTE_ETH_SCHED_WRED_CONTEXTS_PER_NODE               4
+/**< Invalid WRED context ID */
+#define RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE                 UINT32_MAX
+/**< Invalid node ID */
+#define RTE_ETH_SCHED_NODE_NULL                            UINT32_MAX
+
+/**
+  * Congestion management (CMAN) mode
+  *
+  * This is used for controlling the admission of packets into a packet queue or
+  * group of packet queues on congestion. On request of writing a new packet
+  * into the current queue while the queue is full, the *tail drop* algorithm
+  * drops the new packet while leaving the queue unmodified, as opposed to *head
+  * drop* algorithm, which drops the packet at the head of the queue (the oldest
+  * packet waiting in the queue) and admits the new packet at the tail of the
+  * queue.
+  *
+  * The *Random Early Detection (RED)* algorithm works by proactively dropping
+  * more and more input packets as the queue occupancy builds up. When the queue
+  * is full or almost full, RED effectively works as *tail drop*. The *Weighted
+  * RED* algorithm uses a separate set of RED thresholds per packet color.
+  */
+enum rte_eth_sched_cman_mode {
+	RTE_ETH_SCHED_CMAN_TAIL_DROP = 0, /**< Tail drop */
+	RTE_ETH_SCHED_CMAN_HEAD_DROP, /**< Head drop */
+	RTE_ETH_SCHED_CMAN_WRED, /**< Weighted Random Early Detection (WRED) */
+};
+
+/**
+  * WRED profile
+  */
+struct rte_eth_sched_wred_params {
+	/**< One set of RED parameters per packet color */
+	struct rte_red_params red_params[e_RTE_METER_COLORS];
+};
+
+/**
+  * Shaper (rate limiter) profile
+  *
+  * Multiple shaper instances can share the same shaper profile. Each node can
+  * have multiple shapers enabled (up to RTE_ETH_SCHED_SHAPERS_PER_NODE). Each
+  * shaper can be private to a node (only one node using it) or shared (multiple
+  * nodes use the same shaper instance).
+  */
+struct rte_eth_sched_shaper_params {
+	uint64_t rate; /**< Token bucket rate (bytes per second) */
+	uint64_t size; /**< Token bucket size (bytes) */
+};
+
+/**
+  * Node parameters
+  *
+  * Each scheduler hierarchy node has multiple inputs (children nodes of the
+  * current parent node) and a single output (which is input to its parent
+  * node). The current node arbitrates its inputs using Strict Priority (SP)
+  * and Weighted Fair Queuing (WFQ) algorithms to schedule input packets on its
+  * output while observing its shaping/rate limiting constraints.  Algorithms
+  * such as Weighted Round Robin (WRR), byte-level WRR, Deficit WRR (DWRR), etc
+  * are considered approximations of the ideal WFQ and are assimilated to WFQ,
+  * although an associated implementation-dependent trade-off on accuracy,
+  * performance and resource usage might exist.
+  *
+  * Children nodes with different priorities are scheduled using the SP
+  * algorithm, based on their priority, with zero (0) as the highest priority.
+  * Children with same priority are scheduled using the WFQ algorithm, based on
+  * their weight, which is relative to the sum of the weights of all siblings
+  * with same priority, with one (1) as the lowest weight.
+  */
+struct rte_eth_sched_node_params {
+	/**< Child node priority (used by SP). The highest priority is zero. */
+	uint32_t priority;
+	/**< Child node weight (used by WFQ), relative to some of weights of all
+	     siblings with same priority). The lowest weight is one. */
+	uint32_t weight;
+	/**< Set of shaper instances enabled for current node. Each node shaper
+	     can be disabled by setting it to RTE_ETH_SCHED_SHAPER_ID_NONE. */
+	uint32_t shaper_id[RTE_ETH_SCHED_SHAPERS_PER_NODE];
+	/**< Set to zero if current node is not a hierarchy leaf node, set to a
+	     non-zero value otherwise. A leaf node is a hierarchy node that does
+	     not have any children. A leaf node has to be connected to a valid
+	     packet queue. */
+	int is_leaf;
+	/**< Parameters valid for leaf nodes only */
+	struct {
+		/**< Packet queue ID */
+		uint64_t queue_id;
+		/**< Congestion management mode */
+		enum rte_eth_sched_cman_mode cman;
+		/**< Set of WRED contexts enabled for current leaf node. Each
+		     leaf node WRED context can be disabled by setting it to
+		     RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE. Only valid when
+		     congestion management for current leaf node is set to WRED. */
+		uint32_t wred_context_id[RTE_ETH_SCHED_WRED_CONTEXTS_PER_NODE];
+	} leaf;
+};
+
+/**
+  * Node statistics counter type
+  */
+enum rte_eth_sched_stats_counter {
+	/**< Number of packets scheduled from current node. */
+	RTE_ETH_SCHED_STATS_COUNTER_N_PKTS = 1<< 0,
+	/**< Number of bytes scheduled from current node. */
+	RTE_ETH_SCHED_STATS_COUNTER_N_BYTES = 1 << 1,
+	RTE_ETH_SCHED_STATS_COUNTER_N_PKTS_DROPPED = 1 << 2,
+	RTE_ETH_SCHED_STATS_COUNTER_N_BYTES_DROPPED = 1 << 3,
+	/**< Number of packets currently waiting in the packet queue of current
+	     leaf node. */
+	RTE_ETH_SCHED_STATS_COUNTER_N_PKTS_QUEUED = 1 << 4,
+	/**< Number of bytes currently waiting in the packet queue of current
+	     leaf node. */
+	RTE_ETH_SCHED_STATS_COUNTER_N_BYTES_QUEUED = 1 << 5,
+};
+
+/**
+  * Node statistics counters
+  */
+struct rte_eth_sched_node_stats {
+	/**< Number of packets scheduled from current node. */
+	uint64_t n_pkts;
+	/**< Number of bytes scheduled from current node. */
+	uint64_t n_bytes;
+	/**< Statistics counters for leaf nodes only */
+	struct {
+		/**< Number of packets dropped by current leaf node. */
+		uint64_t n_pkts_dropped;
+		/**< Number of bytes dropped by current leaf node. */
+		uint64_t n_bytes_dropped;
+		/**< Number of packets currently waiting in the packet queue of
+		     current leaf node. */
+		uint64_t n_pkts_queued;
+		/**< Number of bytes currently waiting in the packet queue of
+		     current leaf node. */
+		uint64_t n_bytes_queued;
+	} leaf;
+};
+
 /*
  * Definitions of all functions exported by an Ethernet driver through the
  * the generic structure of type *eth_dev_ops* supplied in the *rte_eth_dev*
@@ -1421,6 +1569,120 @@  typedef int (*eth_get_dcb_info)(struct rte_eth_dev *dev,
 				 struct rte_eth_dcb_info *dcb_info);
 /**< @internal Get dcb information on an Ethernet device */
 
+typedef int (*eth_sched_wred_profile_add_t)(struct rte_eth_dev *dev,
+	uint32_t wred_profile_id,
+	struct rte_eth_sched_wred_params *profile);
+/**< @internal Scheduler WRED profile add */
+
+typedef int (*eth_sched_wred_profile_delete_t)(struct rte_eth_dev *dev,
+	uint32_t wred_profile_id);
+/**< @internal Scheduler WRED profile delete */
+
+typedef int (*eth_sched_wred_context_add_t)(struct rte_eth_dev *dev,
+	uint32_t wred_context_id,
+	uint32_t wred_profile_id);
+/**< @internal Scheduler WRED context add */
+
+typedef int (*eth_sched_wred_context_delete_t)(struct rte_eth_dev *dev,
+	uint32_t wred_context_id);
+/**< @internal Scheduler WRED context delete */
+
+typedef int (*eth_sched_shaper_profile_add_t)(struct rte_eth_dev *dev,
+	uint32_t shaper_profile_id,
+	struct rte_eth_sched_shaper_params *profile);
+/**< @internal Scheduler shaper profile add */
+
+typedef int (*eth_sched_shaper_profile_delete_t)(struct rte_eth_dev *dev,
+	uint32_t shaper_profile_id);
+/**< @internal Scheduler shaper profile delete */
+
+typedef int (*eth_sched_shaper_add_t)(struct rte_eth_dev *dev,
+	uint32_t shaper_id,
+	uint32_t shaper_profile_id);
+/**< @internal Scheduler shaper instance add */
+
+typedef int (*eth_sched_shaper_delete_t)(struct rte_eth_dev *dev,
+	uint32_t shaper_id);
+/**< @internal Scheduler shaper instance delete */
+
+typedef int (*eth_sched_node_add_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t parent_node_id,
+	struct rte_eth_sched_node_params *params);
+/**< @internal Scheduler node add */
+
+typedef int (*eth_sched_node_delete_t)(struct rte_eth_dev *dev,
+	uint32_t node_id);
+/**< @internal Scheduler node delete */
+
+typedef int (*eth_sched_hierarchy_set_t)(struct rte_eth_dev *dev,
+	int clear_on_fail);
+/**< @internal Scheduler hierarchy set */
+
+typedef int (*eth_sched_node_priority_set_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t priority);
+/**< @internal Scheduler node priority set */
+
+typedef int (*eth_sched_node_weight_set_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t weight);
+/**< @internal Scheduler node weight set */
+
+typedef int (*eth_sched_node_shaper_set_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t shaper_pos,
+	uint32_t shaper_id);
+/**< @internal Scheduler node shaper set */
+
+typedef int (*eth_sched_node_queue_set_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t queue_id);
+/**< @internal Scheduler node queue set */
+
+typedef int (*eth_sched_node_cman_set_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	enum rte_eth_sched_cman_mode cman);
+/**< @internal Scheduler node congestion management mode set */
+
+typedef int (*eth_sched_node_wred_context_set_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t wred_context_pos,
+	uint32_t wred_context_id);
+/**< @internal Scheduler node WRED context set */
+
+typedef int (*eth_sched_stats_get_enabled_t)(struct rte_eth_dev *dev,
+	uint64_t *nonleaf_node_capability_stats_mask,
+	uint64_t *nonleaf_node_enabled_stats_mask,
+	uint64_t *leaf_node_capability_stats_mask,
+	uint64_t *leaf_node_enabled_stats_mask);
+/**< @internal Scheduler get set of stats counters enabled for all nodes */
+
+typedef int (*eth_sched_stats_enable_t)(struct rte_eth_dev *dev,
+	uint64_t nonleaf_node_enabled_stats_mask,
+	uint64_t leaf_node_enabled_stats_mask);
+/**< @internal Scheduler enable selected stats counters for all nodes */
+
+typedef int (*eth_sched_node_stats_get_enabled_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint64_t *capability_stats_mask,
+	uint64_t *enabled_stats_mask);
+/**< @internal Scheduler get set of stats counters enabled for specific node */
+
+typedef int (*eth_sched_node_stats_enable_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint64_t enabled_stats_mask);
+/**< @internal Scheduler enable selected stats counters for specific node */
+
+typedef int (*eth_sched_node_stats_read_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	struct rte_eth_sched_node_stats *stats,
+	int clear);
+/**< @internal Scheduler read stats counters for specific node */
+
+typedef int (*eth_sched_run_t)(struct rte_eth_dev *dev);
+/**< @internal Scheduler run */
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -1547,6 +1809,53 @@  struct eth_dev_ops {
 	eth_l2_tunnel_eth_type_conf_t l2_tunnel_eth_type_conf;
 	/** Enable/disable l2 tunnel offload functions */
 	eth_l2_tunnel_offload_set_t l2_tunnel_offload_set;
+
+	/** Scheduler WRED profile add */
+	eth_sched_wred_profile_add_t sched_wred_profile_add;
+	/** Scheduler WRED profile delete */
+	eth_sched_wred_profile_delete_t sched_wred_profile_delete;
+	/** Scheduler WRED context add */
+	eth_sched_wred_context_add_t sched_wred_context_add;
+	/** Scheduler WRED context delete */
+	eth_sched_wred_context_delete_t sched_wred_context_delete;
+	/** Scheduler shaper profile add */
+	eth_sched_shaper_profile_add_t sched_shaper_profile_add;
+	/** Scheduler shaper profile delete */
+	eth_sched_shaper_profile_delete_t sched_shaper_profile_delete;
+	/** Scheduler shaper instance add */
+	eth_sched_shaper_add_t sched_shaper_add;
+	/** Scheduler shaper instance delete */
+	eth_sched_shaper_delete_t sched_shaper_delete;
+	/** Scheduler node add */
+	eth_sched_node_add_t sched_node_add;
+	/** Scheduler node delete */
+	eth_sched_node_delete_t sched_node_delete;
+	/** Scheduler hierarchy set */
+	eth_sched_hierarchy_set_t sched_hierarchy_set;
+	/** Scheduler node priority set */
+	eth_sched_node_priority_set_t sched_node_priority_set;
+	/** Scheduler node weight set */
+	eth_sched_node_weight_set_t sched_node_weight_set;
+	/** Scheduler node shaper set */
+	eth_sched_node_shaper_set_t sched_node_shaper_set;
+	/** Scheduler node queue set */
+	eth_sched_node_queue_set_t sched_node_queue_set;
+	/** Scheduler node congestion management mode set */
+	eth_sched_node_cman_set_t sched_node_cman_set;
+	/** Scheduler node WRED context set */
+	eth_sched_node_wred_context_set_t sched_node_wred_context_set;
+	/** Scheduler get statistics counter type enabled for all nodes */
+	eth_sched_stats_get_enabled_t sched_stats_get_enabled;
+	/** Scheduler enable selected statistics counters for all nodes */
+	eth_sched_stats_enable_t sched_stats_enable;
+	/** Scheduler get statistics counter type enabled for current node */
+	eth_sched_node_stats_get_enabled_t sched_node_stats_get_enabled;
+	/** Scheduler enable selected statistics counters for current node */
+	eth_sched_node_stats_enable_t sched_node_stats_enable;
+	/** Scheduler read statistics counters for current node */
+	eth_sched_node_stats_read_t sched_node_stats_read;
+	/** Scheduler run */
+	eth_sched_run_t sched_run;
 };
 
 /**
@@ -4336,6 +4645,491 @@  rte_eth_dev_l2_tunnel_offload_set(uint8_t port_id,
 				  uint8_t en);
 
 /**
+ * Scheduler WRED profile add
+ *
+ * Create a new WRED profile with ID set to *wred_profile_id*. The new profile
+ * is used to create one or several WRED contexts.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param wred_profile_id
+ *   WRED profile ID for the new profile. Needs to be unused.
+ * @param profile
+ *   WRED profile parameters. Needs to be pre-allocated and valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_wred_profile_add(uint8_t port_id,
+	uint32_t wred_profile_id,
+	struct rte_eth_sched_wred_params *profile);
+
+/**
+ * Scheduler WRED profile delete
+ *
+ * Delete an existing WRED profile. This operation fails when there is currently
+ * at least one user (i.e. WRED context) of this WRED profile.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param wred_profile_id
+ *   WRED profile ID. Needs to be the valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_wred_profile_delete(uint8_t port_id,
+	uint32_t wred_profile_id);
+
+/**
+ * Scheduler WRED context add or update
+ *
+ * When *wred_context_id* is invalid, a new WRED context with this ID is created
+ * by using the WRED profile identified by *wred_profile_id*.
+ *
+ * When *wred_context_id* is valid, this WRED context is no longer using the
+ * profile previously assigned to it and is updated to use the profile
+ * identified by *wred_profile_id*.
+ *
+ * A valid WRED context is assigned to one or several scheduler hierarchy leaf
+ * nodes configured to use WRED as the congestion management mode.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param wred_context_id
+ *   WRED context ID
+ * @param wred_profile_id
+ *   WRED profile ID. Needs to be the valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_wred_context_add(uint8_t port_id,
+	uint32_t wred_context_id,
+	uint32_t wred_profile_id);
+
+/**
+ * Scheduler WRED context delete
+ *
+ * Delete an existing WRED context. This operation fails when there is currently
+ * at least one user (i.e. scheduler hierarchy leaf node) of this WRED context.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param wred_context_id
+ *   WRED context ID. Needs to be the valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_wred_context_delete(uint8_t port_id,
+	uint32_t wred_context_id);
+
+/**
+ * Scheduler shaper profile add
+ *
+ * Create a new shaper profile with ID set to *shaper_profile_id*. The new
+ * shaper profile is used to create one or several shapers.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shaper_profile_id
+ *   Shaper profile ID for the new profile. Needs to be unused.
+ * @param profile
+ *   Shaper profile parameters. Needs to be pre-allocated and valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_shaper_profile_add(uint8_t port_id,
+	uint32_t shaper_profile_id,
+	struct rte_eth_sched_shaper_params *profile);
+
+/**
+ * Scheduler shaper profile delete
+ *
+ * Delete an existing shaper profile. This operation fails when there is
+ * currently at least one user (i.e. shaper) of this shaper profile.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shaper_profile_id
+ *   Shaper profile ID. Needs to be the valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+/* no users (shapers) using this profile */
+int rte_eth_sched_shaper_profile_delete(uint8_t port_id,
+	uint32_t shaper_profile_id);
+
+/**
+ * Scheduler shaper add or update
+ *
+ * When *shaper_id* is not a valid shaper ID, a new shaper with this ID is
+ * created using the shaper profile identified by *shaper_profile_id*.
+ *
+ * When *shaper_id* is a valid shaper ID, this shaper is no longer using the
+ * shaper profile previously assigned to it and is updated to use the shaper
+ * profile identified by *shaper_profile_id*.
+ *
+ * A valid shaper is assigned to one or several scheduler hierarchy nodes.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shaper_id
+ *   Shaper ID
+ * @param shaper_profile_id
+ *   Shaper profile ID. Needs to be the valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_shaper_add(uint8_t port_id,
+	uint32_t shaper_id,
+	uint32_t shaper_profile_id);
+
+/**
+ * Scheduler shaper delete
+ *
+ * Delete an existing shaper. This operation fails when there is currently at
+ * least one user (i.e. scheduler hierarchy node) of this shaper.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shaper_id
+ *   Shaper ID. Needs to be the valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_shaper_delete(uint8_t port_id,
+	uint32_t shaper_id);
+
+/**
+ * Scheduler node add or remap
+ *
+ * When *node_id* is not a valid node ID, a new node with this ID is created and
+ * connected as child to the existing node identified by *parent_node_id*.
+ *
+ * When *node_id* is a valid node ID, this node is disconnected from its current
+ * parent and connected as child to another existing node identified by
+ * *parent_node_id *.
+ *
+ * This function can be called during port initialization phase (before the
+ * Ethernet port is started) for building the scheduler start-up hierarchy.
+ * Subject to the specific Ethernet port supporting on-the-fly scheduler
+ * hierarchy updates, this function can also be called during run-time (after
+ * the Ethernet port is started).
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID
+ * @param parent_node_id
+ *   Parent node ID. Needs to be the valid.
+ * @param params
+ *   Node parameters. Needs to be pre-allocated and valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_add(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t parent_node_id,
+	struct rte_eth_sched_node_params *params);
+
+/**
+ * Scheduler node delete
+ *
+ * Delete an existing node. This operation fails when this node currently has at
+ * least one user (i.e. child node).
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_delete(uint8_t port_id,
+	uint32_t node_id);
+
+/**
+ * Scheduler hierarchy set
+ *
+ * This function is called during the port initialization phase (before the
+ * Ethernet port is started) to freeze the scheduler start-up hierarchy.
+ *
+ * This function fails when the currently configured scheduler hierarchy is not
+ * supported by the Ethernet port, in which case the user can abort or try out
+ * another hierarchy configuration (e.g. a hierarchy with less leaf nodes),
+ * which can be build from scratch (when *clear_on_fail* is enabled) or by
+ * modifying the existing hierarchy configuration (when *clear_on_fail* is
+ * disabled).
+ *
+ * Note that, even when the configured scheduler hierarchy is supported (so this
+ * function is successful), the Ethernet port start might still fail due to e.g.
+ * not enough memory being available in the system, etc.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param clear_on_fail
+ *   On function call failure, hierarchy is cleared when this parameter is
+ *   non-zero and preserved when this parameter is equal to zero.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_hierarchy_set(uint8_t port_id,
+	int clear_on_fail);
+
+/**
+ * Scheduler node priority set
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param priority
+ *   Node priority. The highest node priority is zero. Used by the SP algorithm
+ *   running on the parent of the current node for scheduling this child node.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_priority_set(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t priority);
+
+/**
+ * Scheduler node weight set
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param weight
+ *   Node weight. The node weight is relative to the weight sum of all siblings
+ *   that have the same priority. The lowest weight is zero. Used by the WFQ
+ *   algorithm running on the parent of the current node for scheduling this
+ *   child node.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_weight_set(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t weight);
+
+/**
+ * Scheduler node shaper set
+ * 
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param shaper_pos
+ *   Position in the shaper array of the current node
+ *   (0 .. RTE_ETH_SCHED_SHAPERS_PER_NODE-1).
+ * @param shaper_id
+ *   Shaper ID. Needs to be either valid shaper ID or set to
+ *   RTE_ETH_SCHED_SHAPER_ID_NONE in order to invalidate the shaper on position
+ *   *shaper_pos* within the current node.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_shaper_set(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t shaper_pos,
+	uint32_t shaper_id);
+
+/**
+ * Scheduler node queue set
+ * 
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param queue_id
+ *   Queue ID. Needs to be valid.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_queue_set(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t queue_id);
+
+/**
+ * Scheduler node congestion management mode set
+ * 
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid leaf node ID.
+ * @param cman
+ *   Congestion management mode.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_cman_set(uint8_t port_id,
+	uint32_t node_id,
+	enum rte_eth_sched_cman_mode cman);
+
+/**
+ * Scheduler node WRED context set
+ * 
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid leaf node ID that has WRED selected as the
+ *   congestion management mode.
+ * @param wred_context_pos
+ *   Position in the WRED context array of the current leaf node
+ *   (0 .. RTE_ETH_SCHED_WRED_CONTEXTS_PER_NODE-1)
+ * @param wred_context_id
+ *   WRED context ID. Needs to be either valid WRED context ID or set to
+ *   RTE_ETH_SCHED_WRED_CONTEXT_ID_NONE in order to invalidate the WRED context
+ *   on position *wred_context_pos* within the current leaf node.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_wred_context_set(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t wred_context_pos,
+	uint32_t wred_context_id);
+
+/**
+ * Scheduler get statistics counter types enabled for all nodes
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param nonleaf_node_capability_stats_mask
+ *   Statistics counter types available per node for all non-leaf nodes. Needs
+ *   to be pre-allocated.
+ * @param nonleaf_node_enabled_stats_mask
+ *   Statistics counter types currently enabled per node for each non-leaf node.
+ *   This is a subset of *nonleaf_node_capability_stats_mask*. Needs to be
+ *   pre-allocated.
+ * @param leaf_node_capability_stats_mask
+ *   Statistics counter types available per node for all leaf nodes. Needs to
+ *   be pre-allocated.
+ * @param leaf_node_enabled_stats_mask
+ *   Statistics counter types currently enabled for each leaf node. This is
+ *   a subset of *leaf_node_capability_stats_mask*. Needs to be pre-allocated.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_stats_get_enabled(uint8_t port_id,
+	uint64_t *nonleaf_node_capability_stats_mask,
+	uint64_t *nonleaf_node_enabled_stats_mask,
+	uint64_t *leaf_node_capability_stats_mask,
+	uint64_t *leaf_node_enabled_stats_mask);
+
+/**
+ * Scheduler enable selected statistics counters for all nodes
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param nonleaf_node_enabled_stats_mask
+ *   Statistics counter types to be enabled per node for each non-leaf node.
+ *   This needs to be a subset of the statistics counter types available per
+ *   node for all non-leaf nodes. Any statistics counter type not included in
+ *   this set is to be disabled for all non-leaf nodes.
+ * @param leaf_node_enabled_stats_mask
+ *   Statistics counter types to be enabled per node for each leaf node. This
+ *   needs to be a subset of the statistics counter types available per node for
+ *   all leaf nodes. Any statistics counter type not included in this set is to
+ *   be disabled for all leaf nodes.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_stats_enable(uint8_t port_id,
+	uint64_t nonleaf_node_enabled_stats_mask,
+	uint64_t leaf_node_enabled_stats_mask);
+
+/**
+ * Scheduler get statistics counter types enabled for current node
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param capability_stats_mask
+ *   Statistics counter types available for the current node. Needs to be pre-allocated.
+ * @param enabled_stats_mask
+ *   Statistics counter types currently enabled for the current node. This is
+ *   a subset of *capability_stats_mask*. Needs to be pre-allocated.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_stats_get_enabled(uint8_t port_id,
+	uint32_t node_id,
+	uint64_t *capability_stats_mask,
+	uint64_t *enabled_stats_mask);
+
+/**
+ * Scheduler enable selected statistics counters for current node
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param enabled_stats_mask
+ *   Statistics counter types to be enabled for the current node. This needs to
+ *   be a subset of the statistics counter types available for the current node.
+ *   Any statistics counter type not included in this set is to be disabled for
+ *   the current node.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_stats_enable(uint8_t port_id,
+	uint32_t node_id,
+	uint64_t enabled_stats_mask);
+
+/**
+ * Scheduler node statistics counters read
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param stats
+ *   When non-NULL, it contains the current value for the statistics counters
+ *   enabled for the current node.
+ * @param clear
+ *   When this parameter has a non-zero value, the statistics counters are
+ *   cleared (i.e. set to zero) immediately after they have been read, otherwise
+ *   the statistics counters are left untouched.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_eth_sched_node_stats_read(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_eth_sched_node_stats *stats,
+	int clear);
+
+/**
+ * Scheduler run
+ *
+ * The packet enqueue side of the scheduler hierarchy is typically done through
+ * the Ethernet device TX function. For HW implementations, the packet dequeue
+ * side is typically done by the Ethernet device without any SW intervention,
+ * therefore this functions should not do anything.
+ *
+ * However, for poll-mode SW or mixed HW-SW implementations, the SW intervention
+ * is likely to be required for running the packet dequeue side of the scheduler
+ * hierarchy. Other potential task performed by this function is periodic flush
+ * of any packet enqueue-side buffers used by the burst-mode implementations.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+static inline int
+rte_eth_sched_run(uint8_t port_id)
+{
+	struct rte_eth_dev *dev;
+
+#ifdef RTE_LIBRTE_ETHDEV_DEBUG
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
+#endif
+
+	dev = &rte_eth_devices[port_id];
+
+	return (dev->dev_ops->sched_run)? dev->dev_ops->sched_run(dev) : 0;
+}
+
+/**
 * Get the port id from pci adrress or device name
 * Ex: 0000:2:00.0 or vdev name net_pcap0
 *