[dpdk-dev,2/2] ethdev: add hierarchical scheduler API

Message ID 1486735550-149878-3-git-send-email-cristian.dumitrescu@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel compilation fail Compilation issues

Commit Message

Cristian Dumitrescu Feb. 10, 2017, 2:05 p.m. UTC
  This patch introduces the generic ethdev API for the hierarchical scheduler
capability.

Main features:
- Exposed as ethdev plugin capability (similar to rte_flow approach)
- Capability query API per port and per hierarchy node
- Scheduling algorithms: strict priority (SP), Weighed Fair Queuing (WFQ),
  Weighted Round Robin (WRR)
- Traffic shaping: single/dual rate, private (per node) and shared (by multiple
  nodes) shapers
- Congestion management for hierarchy leaf nodes: algorithms of tail drop,
  head drop, WRED; private (per node) and shared (by multiple nodes) WRED
  contexts
- Packet marking: IEEE 802.1q (VLAN DEI), IETF RFC 3168 (IPv4/IPv6 ECN for
  TCP and SCTP), IETF RFC 2597 (IPv4 / IPv6 DSCP)

Changes since RFC [1]:
- Implemented as ethdev plugin (similar to rte_flow) as opposed to more
  monolithic additions to ethdev itself
- Implemented feedback from Jerin [2] and Hemant [3]. Implemented all the
  suggested items with only one exception, see the long list below, hopefully
  nothing was forgotten.
    - The item not done (hopefully for a good reason): driver-generated object
      IDs. IMO the choice to have application-generated object IDs adds marginal
      complexity to the driver (search ID function required), but it provides
      huge simplification for the application. The app does not need to worry
      about building & managing tree-like structure for storing driver-generated
      object IDs, the app can use its own convention for node IDs depending on
      the specific hierarchy that it needs. Trivial example: identify all
      level-2 nodes with IDs like 100, 200, 300, … and the level-3 nodes based
      on their level-2 parents: 110, 120, 130, 140, …, 210, 220, 230, 240, …,
      310, 320, 330, … and level-4 nodes based on their level-3 parents: 111,
      112, 113, 114, …, 121, 122, 123, 124, …). Moreover, see the change log for
      the other related simplification that was implemented: leaf nodes now have
      predefined IDs that are the same with their Ethernet TX queue ID (
      therefore no translation is required for leaf nodes).
- Capability API. Done per port and per node as well.
- Dual rate shapers
- Added configuration of private shaper (per node) directly from the shaper
  profile as part of node API (no shaper ID needed for private shapers), while
  the shared shapers are configured outside of the node API using shaper profile
  and communicated to the node using shared shaper ID. So there is no
  configuration overhead for shared shapers if the app does not use any of them.
- Leaf nodes now have predefined IDs that are the same with their Ethernet TX
  queue ID (therefore no translation is required for leaf nodes). This is also
  used to differentiate between a leaf node and a non-leaf node.
- Domain-specific errors to give a precise indication of the error cause (same
  as done by rte_flow)
- Packet marking API
- Packet length optional adjustment for shapers, positive (e.g. for adding
  Ethernet framing overhead of 20 bytes) or negative (e.g. for rate limiting
  based on IP packet bytes)

Next steps:
- SW fallback based on librte_sched library (to be later introduced by
  standalone patch set)

[1] RFC: http://dpdk.org/ml/archives/dev/2016-November/050956.html
[2] Jerin’s feedback on RFC: http://www.dpdk.org/ml/archives/dev/2017-January/054484.html
[3] Hemants’s feedback on RFC: http://www.dpdk.org/ml/archives/dev/2017-January/054866.html

Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
---
 MAINTAINERS                            |    4 +
 lib/librte_ether/Makefile              |    5 +-
 lib/librte_ether/rte_ether_version.map |   30 +
 lib/librte_ether/rte_scheddev.c        |  790 ++++++++++++++++++++
 lib/librte_ether/rte_scheddev.h        | 1273 ++++++++++++++++++++++++++++++++
 lib/librte_ether/rte_scheddev_driver.h |  374 ++++++++++
 6 files changed, 2475 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_ether/rte_scheddev.c
 create mode 100644 lib/librte_ether/rte_scheddev.h
 create mode 100644 lib/librte_ether/rte_scheddev_driver.h
  

Comments

Hemant Agrawal Feb. 21, 2017, 10:35 a.m. UTC | #1
On 2/10/2017 7:35 PM, Cristian Dumitrescu wrote:
> This patch introduces the generic ethdev API for the hierarchical scheduler
> capability.
>
> Main features:
> - Exposed as ethdev plugin capability (similar to rte_flow approach)
> - Capability query API per port and per hierarchy node
> - Scheduling algorithms: strict priority (SP), Weighed Fair Queuing (WFQ),
>   Weighted Round Robin (WRR)
> - Traffic shaping: single/dual rate, private (per node) and shared (by multiple
>   nodes) shapers
> - Congestion management for hierarchy leaf nodes: algorithms of tail drop,
>   head drop, WRED; private (per node) and shared (by multiple nodes) WRED
>   contexts
> - Packet marking: IEEE 802.1q (VLAN DEI), IETF RFC 3168 (IPv4/IPv6 ECN for
>   TCP and SCTP), IETF RFC 2597 (IPv4 / IPv6 DSCP)
>
> Changes since RFC [1]:
> - Implemented as ethdev plugin (similar to rte_flow) as opposed to more
>   monolithic additions to ethdev itself
> - Implemented feedback from Jerin [2] and Hemant [3]. Implemented all the
>   suggested items with only one exception, see the long list below, hopefully
>   nothing was forgotten.
>     - The item not done (hopefully for a good reason): driver-generated object
>       IDs. IMO the choice to have application-generated object IDs adds marginal
>       complexity to the driver (search ID function required), but it provides
>       huge simplification for the application. The app does not need to worry
>       about building & managing tree-like structure for storing driver-generated
>       object IDs, the app can use its own convention for node IDs depending on
>       the specific hierarchy that it needs. Trivial example: identify all
>       level-2 nodes with IDs like 100, 200, 300, … and the level-3 nodes based
>       on their level-2 parents: 110, 120, 130, 140, …, 210, 220, 230, 240, …,
>       310, 320, 330, … and level-4 nodes based on their level-3 parents: 111,
>       112, 113, 114, …, 121, 122, 123, 124, …). Moreover, see the change log for
>       the other related simplification that was implemented: leaf nodes now have
>       predefined IDs that are the same with their Ethernet TX queue ID (
>       therefore no translation is required for leaf nodes).
> - Capability API. Done per port and per node as well.
> - Dual rate shapers
> - Added configuration of private shaper (per node) directly from the shaper
>   profile as part of node API (no shaper ID needed for private shapers), while
>   the shared shapers are configured outside of the node API using shaper profile
>   and communicated to the node using shared shaper ID. So there is no
>   configuration overhead for shared shapers if the app does not use any of them.
> - Leaf nodes now have predefined IDs that are the same with their Ethernet TX
>   queue ID (therefore no translation is required for leaf nodes). This is also
>   used to differentiate between a leaf node and a non-leaf node.
> - Domain-specific errors to give a precise indication of the error cause (same
>   as done by rte_flow)
> - Packet marking API
> - Packet length optional adjustment for shapers, positive (e.g. for adding
>   Ethernet framing overhead of 20 bytes) or negative (e.g. for rate limiting
>   based on IP packet bytes)
>
> Next steps:
> - SW fallback based on librte_sched library (to be later introduced by
>   standalone patch set)
>
> [1] RFC: http://dpdk.org/ml/archives/dev/2016-November/050956.html
> [2] Jerin’s feedback on RFC: http://www.dpdk.org/ml/archives/dev/2017-January/054484.html
> [3] Hemants’s feedback on RFC: http://www.dpdk.org/ml/archives/dev/2017-January/054866.html
>
> Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
> ---
>  MAINTAINERS                            |    4 +
>  lib/librte_ether/Makefile              |    5 +-
>  lib/librte_ether/rte_ether_version.map |   30 +
>  lib/librte_ether/rte_scheddev.c        |  790 ++++++++++++++++++++
>  lib/librte_ether/rte_scheddev.h        | 1273 ++++++++++++++++++++++++++++++++
>  lib/librte_ether/rte_scheddev_driver.h |  374 ++++++++++
>  6 files changed, 2475 insertions(+), 1 deletion(-)
>  create mode 100644 lib/librte_ether/rte_scheddev.c
>  create mode 100644 lib/librte_ether/rte_scheddev.h
>  create mode 100644 lib/librte_ether/rte_scheddev_driver.h
>

...<snip>

> +
> +#ifndef __INCLUDE_RTE_SCHEDDEV_H__
> +#define __INCLUDE_RTE_SCHEDDEV_H__
> +
> +/**
> + * @file
> + * RTE Generic Hierarchical Scheduler API
> + *
> + * This interface provides the ability to configure the hierarchical scheduler
> + * feature in a generic way.
> + */
> +
> +#include <stdint.h>
> +
> +#include <rte_red.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/** Ethernet framing overhead
> +  *
> +  * Overhead fields per Ethernet frame:
> +  * 1. Preamble:                                            7 bytes;
> +  * 2. Start of Frame Delimiter (SFD):                      1 byte;
> +  * 3. Inter-Frame Gap (IFG):                              12 bytes.
> +  */
> +#define RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD                  20
> +
> +/**
> +  * Ethernet framing overhead plus Frame Check Sequence (FCS). Useful when FCS
> +  * is generated and added at the end of the Ethernet frame on TX side without
> +  * any SW intervention.
> +  */
> +#define RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD_FCS              24
> +
> +/**< Invalid WRED profile ID */
> +#define RTE_SCHEDDEV_WRED_PROFILE_ID_NONE                  UINT32_MAX
> +
> +/**< Invalid shaper profile ID */
> +#define RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE                UINT32_MAX
> +
> +/**< Scheduler hierarchy root node ID */
> +#define RTE_SCHEDDEV_ROOT_NODE_ID                          UINT32_MAX
> +
> +
> +/**
> +  * Scheduler node capabilities
> +  */
> +struct rte_scheddev_node_capabilities {
> +	/**< Private shaper support. */
> +	int shaper_private_supported;
> +
> +	/**< Dual rate shaping support for private shaper. Valid only when
> +	 * private shaper is supported.
> +	 */
> +	int shaper_private_dual_rate_supported;
> +
> +	/**< Minimum committed/peak rate (bytes per second) for private
> +	 * shaper. Valid only when private shaper is supported.
> +	 */
> +	uint64_t shaper_private_rate_min;
> +
> +	/**< Maximum committed/peak rate (bytes per second) for private
> +	 * shaper. Valid only when private shaper is supported.
> +	 */
> +	uint64_t shaper_private_rate_max;
> +
> +	/**< Maximum number of supported shared shapers. The value of zero
> +	 * indicates that shared shapers are not supported.
> +	 */
> +	uint32_t shaper_shared_n_max;
> +
> +	/**< Items valid only for non-leaf nodes. */
> +	struct {
> +		/**< Maximum number of children nodes. */
> +		uint32_t n_children_max;
> +
> +		/**< Lowest priority supported. The value of 1 indicates that
> +		 * only priority 0 is supported, which essentially means that
> +		 * Strict Priority (SP) algorithm is not supported.
> +		 */
> +		uint32_t sp_priority_min;
> +
This can be  simply sp_priority_level, with 0 indicating no support
1 indicates '0' and '1' priority.  or 7 indicates '0' to '7' i.e. total 
8 priorities.

> +		/**< Maximum number of sibling nodes that can have the same
> +		 * priority at any given time. When equal to *n_children_max*,
> +		 * it indicates that WFQ/WRR algorithms are not supported.
> +		 */
> +		uint32_t sp_n_children_max;
not clear to me.
OK, more than 1 children can have same priority, than you apply WRR/WFQ 
among them.

However, there can be different sets,  e.g prio '0' and '1' has only 1 
children. while prio '2' has 6 children, than you apply WRR/WFQ among them.

> +
> +		/**< WFQ algorithm support. */
> +		int scheduling_wfq_supported;
> +
> +		/**< WRR algorithm support. */
> +		int scheduling_wrr_supported;
> +
> +		/**< Maximum WFQ/WRR weight. */
> +		uint32_t scheduling_wfq_wrr_weight_max;
> +	} nonleaf;
> +
> +	/**< Items valid only for leaf nodes. */
> +	struct {
> +		/**< Head drop algorithm support. */
> +		int cman_head_drop_supported;
> +
> +		/**< Private WRED context support. */
> +		int cman_wred_context_private_supported;
> +

The context part is not clear to me.

> +		/**< Maximum number of shared WRED contexts supported. The value
> +		 * of zero indicates that shared WRED contexts are not
> +		 * supported.
> +		 */
> +		uint32_t cman_wred_context_shared_n_max;
> +	} leaf;

non-leaf nodes may have different capabilities.

your leaf node is like a QoS Queue, are you supporting shapper on leaf 
node as well?


I will still prefer if you separate QoS Queue from a standard Sched 
node, the capabilities are different and it will be cleaner at the cost 
of increased structure and number of APIs.

> +};
> +
> +/**
> +  * Scheduler capabilities
> +  */
> +struct rte_scheddev_capabilities {
> +	/**< Maximum number of nodes. */
> +	uint32_t n_nodes_max;
> +
> +	/**< Maximum number of levels (i.e. number of nodes connecting the root
> +	 * node with any leaf node, including the root and the leaf).
> +	 */
> +	uint32_t n_levels_max;
> +
> +	/**< Maximum number of shapers, either private or shared. In case the
> +	 * implementation does not share any resource between private and
> +	 * shared shapers, it is typically equal to the sum between
> +	 * *shaper_private_n_max* and *shaper_shared_n_max*.
> +	 */
> +	uint32_t shaper_n_max;
> +
> +	/**< Maximum number of private shapers. Indicates the maximum number of
> +	 * nodes that can concurrently have the private shaper enabled.
> +	 */
> +	uint32_t shaper_private_n_max;
> +
> +	/**< Maximum number of shared shapers. The value of zero indicates that
> +	  * shared shapers are not supported.
> +	  */
> +	uint32_t shaper_shared_n_max;
> +
> +	/**< Maximum number of nodes that can share the same shared shaper. Only
> +	  * valid when shared shapers are supported.
> +	  */
> +	uint32_t shaper_shared_n_nodes_max;
> +
> +	/**< Maximum number of shared shapers that can be configured with dual
> +	  * rate shaping. The value of zero indicates that dual rate shaping
> +	  * support is not available for shared shapers.
> +	  */
> +	uint32_t shaper_shared_dual_rate_n_max;
> +
> +	/**< Minimum committed/peak rate (bytes per second) for shared
> +	  * shapers. Only valid when shared shapers are supported.
> +	  */
> +	uint64_t shaper_shared_rate_min;
> +
> +	/**< Maximum committed/peak rate (bytes per second) for shared
> +	  * shaper. Only valid when shared shapers are supported.
> +	  */
> +	uint64_t shaper_shared_rate_max;
> +
> +	/**< Minimum value allowed for packet length adjustment for
> +	  * private/shared shapers.
> +	  */
> +	int shaper_pkt_length_adjust_min;
> +
> +	/**< Maximum value allowed for packet length adjustment for
> +	  * private/shared shapers.
> +	  */
> +	int shaper_pkt_length_adjust_max;
> +
> +	/**< Maximum number of WRED contexts. */
> +	uint32_t cman_wred_context_n_max;
> +
> +	/**< Maximum number of private WRED contexts. Indicates the maximum
> +	  * number of leaf nodes that can concurrently have the private WRED
> +	  * context enabled.
> +	  */
> +	uint32_t cman_wred_context_private_n_max;
> +
> +	/**< Maximum number of shared WRED contexts. The value of zero indicates
> +	  * that shared WRED contexts are not supported.
> +	  */
> +	uint32_t cman_wred_context_shared_n_max;
> +
> +	/**< Maximum number of leaf nodes that can share the same WRED context.
> +	  * Only valid when shared WRED contexts are supported.
> +	  */
> +	uint32_t cman_wred_context_shared_n_nodes_max;
> +
> +	/**< Support for VLAN DEI packet marking. */
> +	int mark_vlan_dei_supported;
> +
> +	/**< Support for IPv4/IPv6 ECN marking of TCP packets. */
> +	int mark_ip_ecn_tcp_supported;
> +
> +	/**< Support for IPv4/IPv6 ECN marking of SCTP packets. */
> +	int mark_ip_ecn_sctp_supported;
> +
> +	/**< Support for IPv4/IPv6 DSCP packet marking. */
> +	int mark_ip_dscp_supported;
> +
> +	/**< Summary of node-level capabilities across all nodes. */
> +	struct rte_scheddev_node_capabilities node;

This should be array of numbers of levels supported in the system. 
Non-leaf node at level 2 can have different capabilities than level 3 node.

> +};
> +
> +/**
> +  * Congestion management (CMAN) mode
> +  *
> +  * This is used for controlling the admission of packets into a packet queue or
> +  * group of packet queues on congestion. On request of writing a new packet
> +  * into the current queue while the queue is full, the *tail drop* algorithm
> +  * drops the new packet while leaving the queue unmodified, as opposed to *head
> +  * drop* algorithm, which drops the packet at the head of the queue (the oldest
> +  * packet waiting in the queue) and admits the new packet at the tail of the
> +  * queue.
> +  *
> +  * The *Random Early Detection (RED)* algorithm works by proactively dropping
> +  * more and more input packets as the queue occupancy builds up. When the queue
> +  * is full or almost full, RED effectively works as *tail drop*. The *Weighted
> +  * RED* algorithm uses a separate set of RED thresholds for each packet color.
> +  */
> +enum rte_scheddev_cman_mode {
> +	RTE_SCHEDDEV_CMAN_TAIL_DROP = 0, /**< Tail drop */
> +	RTE_SCHEDDEV_CMAN_HEAD_DROP, /**< Head drop */
> +	RTE_SCHEDDEV_CMAN_WRED, /**< Weighted Random Early Detection (WRED) */
> +};
> +
> +/**
> +  * Color
> +  */
> +enum rte_scheddev_color {
> +	e_RTE_SCHEDDEV_GREEN = 0, /**< Green */
> +	e_RTE_SCHEDDEV_YELLOW,    /**< Yellow */
> +	e_RTE_SCHEDDEV_RED,       /**< Red */
> +	e_RTE_SCHEDDEV_COLORS     /**< Number of colors */
> +};
> +
> +/**
> +  * WRED profile
> +  */
> +struct rte_scheddev_wred_params {
> +	/**< One set of RED parameters per packet color */
> +	struct rte_red_params red_params[e_RTE_SCHEDDEV_COLORS];
> +};
> +
> +/**
> +  * Token bucket
> +  */
> +struct rte_scheddev_token_bucket {
> +	/**< Token bucket rate (bytes per second) */
> +	uint64_t rate;
> +
> +	/**< Token bucket size (bytes), a.k.a. max burst size */
> +	uint64_t size;
> +};
> +
> +/**
> +  * Shaper (rate limiter) profile
> +  *
> +  * Multiple shaper instances can share the same shaper profile. Each node has
> +  * zero or one private shaper (only one node using it) and/or zero, one or
> +  * several shared shapers (multiple nodes use the same shaper instance).
> +  *
> +  * Single rate shapers use a single token bucket. A single rate shaper can be
> +  * configured by setting the rate of the committed bucket to zero, which
> +  * effectively disables this bucket. The peak bucket is used to limit the rate
> +  * and the burst size for the current shaper.
> +  *
> +  * Dual rate shapers use both the committed and the peak token buckets. The
> +  * rate of the committed bucket has to be less than or equal to the rate of the
> +  * peak bucket.
> +  */
> +struct rte_scheddev_shaper_params {
> +	/**< Committed token bucket */
> +	struct rte_scheddev_token_bucket committed;
> +
> +	/**< Peak token bucket */
> +	struct rte_scheddev_token_bucket peak;
> +
> +	/**< Signed value to be added to the length of each packet for the
> +	 * purpose of shaping. Can be used to correct the packet length with
> +	 * the framing overhead bytes that are also consumed on the wire (e.g.
> +	 * RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD_FCS).
> +	 */
> +	int32_t pkt_length_adjust;
> +};
> +
> +/**
> +  * Node parameters
> +  *
> +  * Each scheduler hierarchy node has multiple inputs (children nodes of the
> +  * current parent node) and a single output (which is input to its parent
> +  * node). The current node arbitrates its inputs using Strict Priority (SP),
> +  * Weighted Fair Queuing (WFQ) and Weighted Round Robin (WRR) algorithms to
> +  * schedule input packets on its output while observing its shaping (rate
> +  * limiting) constraints.
> +  *
> +  * Algorithms such as byte-level WRR, Deficit WRR (DWRR), etc are considered
> +  * approximations of the ideal of WFQ and are assimilated to WFQ, although
> +  * an associated implementation-dependent trade-off on accuracy, performance
> +  * and resource usage might exist.
> +  *
> +  * Children nodes with different priorities are scheduled using the SP
> +  * algorithm, based on their priority, with zero (0) as the highest priority.
> +  * Children with same priority are scheduled using the WFQ or WRR algorithm,
> +  * based on their weight, which is relative to the sum of the weights of all
> +  * siblings with same priority, with one (1) as the lowest weight.
> +  *
> +  * Each leaf node sits on on top of a TX queue of the current Ethernet port.
> +  * Therefore, the leaf nodes are predefined with the node IDs of 0 .. (N-1),
> +  * where N is the number of TX queues configured for the current Ethernet port.
> +  * The non-leaf nodes have their IDs generated by the application.
> +  */


Ok, that means 0 to N-1 is reserved for leaf nodes. the application will 
choose any value for non-leaf nodes?
What will be the parent node id for the root node?

> +struct rte_scheddev_node_params {
> +	/**< Shaper profile for the private shaper. The absence of the private
> +	 * shaper for the current node is indicated by setting this parameter
> +	 * to RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE.
> +	 */
> +	uint32_t shaper_profile_id;
> +
> +	/**< User allocated array of valid shared shaper IDs. */
> +	uint32_t *shared_shaper_id;
> +
> +	/**< Number of shared shaper IDs in the *shared_shaper_id* array. */
> +	uint32_t n_shared_shapers;
> +
> +	union {
> +		/**< Parameters only valid for non-leaf nodes. */
> +		struct {
> +			/**< For each priority, indicates whether the children
> +			 * nodes sharing the same priority are to be scheduled
> +			 * by WFQ or by WRR. When NULL, it indicates that WFQ
> +			 * is to be used for all priorities. When non-NULL, it
> +			 * points to a pre-allocated array of *n_priority*
> +			 * elements, with a non-zero value element indicating
> +			 * WFQ and a zero value element for WRR.
> +			 */
> +			int *scheduling_mode_per_priority;

what is the structure of the pointer element? Just a bool array?

> +
> +			/**< Number of priorities. */
> +			uint32_t n_priorities;
> +		} nonleaf;
> +
> +		/**< Parameters only valid for leaf nodes. */
> +		struct {
> +			/**< Congestion management mode */
> +			enum rte_scheddev_cman_mode cman;
> +
> +			/**< WRED parameters (valid when *cman* is WRED). */
> +			struct {
> +				/**< WRED profile for private WRED context. */
> +				uint32_t wred_profile_id;
> +
> +				/**< User allocated array of shared WRED context
> +				 * IDs. The absence of a private WRED context
> +				 * for current leaf node is indicated by value
> +				 * RTE_SCHEDDEV_WRED_PROFILE_ID_NONE.
> +				 */
> +				uint32_t *shared_wred_context_id;
> +
> +				/**< Number of shared WRED context IDs in the
> +				 * *shared_wred_context_id* array.
> +				 */
> +				uint32_t n_shared_wred_contexts;
> +			} wred;
> +		} leaf;

need a bool is_leaf here to differentiate between leaf and non-leaf node.

> +	};
> +};
> +
> +/**
> +  * Node statistics counter type
> +  */
> +enum rte_scheddev_stats_counter {
> +	/**< Number of packets scheduled from current node. */
> +	RTE_SCHEDDEV_STATS_COUNTER_N_PKTS = 1 << 0,
> +
> +	/**< Number of bytes scheduled from current node. */
> +	RTE_SCHEDDEV_STATS_COUNTER_N_BYTES = 1 << 1,
> +
> +	/**< Number of packets dropped by current node.  */
> +	RTE_SCHEDDEV_STATS_COUNTER_N_PKTS_DROPPED = 1 << 2,
> +
> +	/**< Number of bytes dropped by current node.  */
> +	RTE_SCHEDDEV_STATS_COUNTER_N_BYTES_DROPPED = 1 << 3,
> +
> +	/**< Number of packets currently waiting in the packet queue of current
> +	 * leaf node.
> +	 */
> +	RTE_SCHEDDEV_STATS_COUNTER_N_PKTS_QUEUED = 1 << 4,
> +
> +	/**< Number of bytes currently waiting in the packet queue of current
> +	 * leaf node.
> +	 */
> +	RTE_SCHEDDEV_STATS_COUNTER_N_BYTES_QUEUED = 1 << 5,
> +};
> +
> +/**
> +  * Node statistics counters
> +  */
> +struct rte_scheddev_node_stats {
> +	/**< Number of packets scheduled from current node. */
> +	uint64_t n_pkts;
> +
> +	/**< Number of bytes scheduled from current node. */
> +	uint64_t n_bytes;
> +
> +	/**< Statistics counters for leaf nodes only. */
> +	struct {
> +		/**< Number of packets dropped by current leaf node. */
> +		uint64_t n_pkts_dropped;
> +
> +		/**< Number of bytes dropped by current leaf node. */
> +		uint64_t n_bytes_dropped;
> +
> +		/**< Number of packets currently waiting in the packet queue of
> +		 * current leaf node.
> +		 */
> +		uint64_t n_pkts_queued;
> +
> +		/**< Number of bytes currently waiting in the packet queue of
> +		 * current leaf node.
> +		 */
> +		uint64_t n_bytes_queued;
> +	} leaf;
> +};
> +
> +/**
> + * Verbose error types.
> + *
> + * Most of them provide the type of the object referenced by struct
> + * rte_scheddev_error::cause.
> + */
> +enum rte_scheddev_error_type {
> +	RTE_SCHEDDEV_ERROR_TYPE_NONE, /**< No error. */
> +	RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED, /**< Cause unspecified. */
> +	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE,
> +	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE_GREEN,
> +	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE_YELLOW,
> +	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE_RED,
> +	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE_ID,
> +	RTE_SCHEDDEV_ERROR_TYPE_SHARED_WRED_CONTEXT_ID,
> +	RTE_SCHEDDEV_ERROR_TYPE_SHAPER_PROFILE,
> +	RTE_SCHEDDEV_ERROR_TYPE_SHARED_SHAPER_ID,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_PARENT_NODE_ID,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_PRIORITY,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_WEIGHT,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_SCHEDULING_MODE,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_SHAPER_PROFILE_ID,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_SHARED_SHAPER_ID,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_LEAF,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_LEAF_CMAN,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_LEAF_WRED_PROFILE_ID,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_LEAF_SHARED_WRED_CONTEXT_ID,
> +	RTE_SCHEDDEV_ERROR_TYPE_NODE_ID,
> +};
> +
> +/**
> + * Verbose error structure definition.
> + *
> + * This object is normally allocated by applications and set by PMDs, the
> + * message points to a constant string which does not need to be freed by
> + * the application, however its pointer can be considered valid only as long
> + * as its associated DPDK port remains configured. Closing the underlying
> + * device or unloading the PMD invalidates it.
> + *
> + * Both cause and message may be NULL regardless of the error type.
> + */
> +struct rte_scheddev_error {
> +	enum rte_scheddev_error_type type; /**< Cause field and error type. */
> +	const void *cause; /**< Object responsible for the error. */
> +	const char *message; /**< Human-readable error message. */
> +};
> +
> +/**
> + * Scheduler capabilities get
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param cap
> + *   Scheduler capabilities. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_capabilities_get(uint8_t port_id,
> +	struct rte_scheddev_capabilities *cap,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node capabilities get
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param cap
> + *   Scheduler node capabilities. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_capabilities_get(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_node_capabilities *cap,
> +	struct rte_scheddev_error *error);
> +

Node capabilities is already part of scheddev_capabilities?

What are you expecting different here. Unless you support different 
capability for each level, this may not be useful.

> +/**
> + * Scheduler WRED profile add
> + *
> + * Create a new WRED profile with ID set to *wred_profile_id*. The new profile
> + * is used to create one or several WRED contexts.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param wred_profile_id
> + *   WRED profile ID for the new profile. Needs to be unused.
> + * @param profile
> + *   WRED profile parameters. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_wred_profile_add(uint8_t port_id,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_wred_params *profile,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler WRED profile delete
> + *
> + * Delete an existing WRED profile. This operation fails when there is currently
> + * at least one user (i.e. WRED context) of this WRED profile.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param wred_profile_id
> + *   WRED profile ID. Needs to be the valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_wred_profile_delete(uint8_t port_id,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler shared WRED context add or update
> + *
> + * When *shared_wred_context_id* is invalid, a new WRED context with this ID is
> + * created by using the WRED profile identified by *wred_profile_id*.
> + *
> + * When *shared_wred_context_id* is valid, this WRED context is no longer using
> + * the profile previously assigned to it and is updated to use the profile
> + * identified by *wred_profile_id*.
> + *
> + * A valid shared WRED context can be assigned to several scheduler hierarchy
> + * leaf nodes configured to use WRED as the congestion management mode.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shared_wred_context_id
> + *   Shared WRED context ID
> + * @param wred_profile_id
> + *   WRED profile ID. Needs to be the valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_shared_wred_context_add_update(uint8_t port_id,
> +	uint32_t shared_wred_context_id,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler shared WRED context delete
> + *
> + * Delete an existing shared WRED context. This operation fails when there is
> + * currently at least one user (i.e. scheduler hierarchy leaf node) of this
> + * shared WRED context.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shared_wred_context_id
> + *   Shared WRED context ID. Needs to be the valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_shared_wred_context_delete(uint8_t port_id,
> +	uint32_t shared_wred_context_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler shaper profile add
> + *
> + * Create a new shaper profile with ID set to *shaper_profile_id*. The new
> + * shaper profile is used to create one or several shapers.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shaper_profile_id
> + *   Shaper profile ID for the new profile. Needs to be unused.
> + * @param profile
> + *   Shaper profile parameters. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_shaper_profile_add(uint8_t port_id,
> +	uint32_t shaper_profile_id,
> +	struct rte_scheddev_shaper_params *profile,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler shaper profile delete
> + *
> + * Delete an existing shaper profile. This operation fails when there is
> + * currently at least one user (i.e. shaper) of this shaper profile.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shaper_profile_id
> + *   Shaper profile ID. Needs to be the valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_shaper_profile_delete(uint8_t port_id,
> +	uint32_t shaper_profile_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler shared shaper add or update
> + *
> + * When *shared_shaper_id* is not a valid shared shaper ID, a new shared shaper
> + * with this ID is created using the shaper profile identified by
> + * *shaper_profile_id*.
> + *
> + * When *shared_shaper_id* is a valid shared shaper ID, this shared shaper is no
> + * longer using the shaper profile previously assigned to it and is updated to
> + * use the shaper profile identified by *shaper_profile_id*.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shared_shaper_id
> + *   Shared shaper ID
> + * @param shaper_profile_id
> + *   Shaper profile ID. Needs to be the valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_shared_shaper_add_update(uint8_t port_id,
> +	uint32_t shared_shaper_id,
> +	uint32_t shaper_profile_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler shared shaper delete
> + *
> + * Delete an existing shared shaper. This operation fails when there is
> + * currently at least one user (i.e. scheduler hierarchy node) of this shared
> + * shaper.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param shared_shaper_id
> + *   Shared shaper ID. Needs to be the valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_shared_shaper_delete(uint8_t port_id,
> +	uint32_t shared_shaper_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node add
> + *
> + * When *node_id* is not a valid node ID, a new node with this ID is created and
> + * connected as child to the existing node identified by *parent_node_id*.
> + *
> + * When *node_id* is a valid node ID, this node is disconnected from its current
> + * parent and connected as child to another existing node identified by
> + * *parent_node_id *.
> + *
> + * This function can be called during port initialization phase (before the
> + * Ethernet port is started) for building the scheduler start-up hierarchy.
> + * Subject to the specific Ethernet port supporting on-the-fly scheduler
> + * hierarchy updates, this function can also be called during run-time (after
> + * the Ethernet port is started).

This should  a capability, whether dynamic_hierarchy_updates are 
supported or not.

> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID
> + * @param parent_node_id
> + *   Parent node ID. Needs to be the valid.

What will be the parent node id for the root node?  how the root node is 
created on the ethernet port?

> + * @param priority
> + *   Node priority. The highest node priority is zero. Used by the SP algorithm
> + *   running on the parent of the current node for scheduling this child node.
> + * @param weight
> + *   Node weight. The node weight is relative to the weight sum of all siblings
> + *   that have the same priority. The lowest weight is one. Used by the WFQ/WRR
> + *   algorithm running on the parent of the current node for scheduling this
> + *   child node.
> + * @param params
> + *   Node parameters. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_add(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,
> +	struct rte_scheddev_node_params *params,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node delete
> + *
> + * Delete an existing node. This operation fails when this node currently has at
> + * least one user (i.e. child node).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_delete(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node suspend
> + *
> + * Suspend an existing node.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_suspend(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node resume
> + *
> + * Resume an existing node that was previously suspended.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_resume(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler hierarchy set
> + *
> + * This function is called during the port initialization phase (before the
> + * Ethernet port is started) to freeze the scheduler start-up hierarchy.
> + *
> + * This function fails when the currently configured scheduler hierarchy is not
> + * supported by the Ethernet port, in which case the user can abort or try out
> + * another hierarchy configuration (e.g. a hierarchy with less leaf nodes),
> + * which can be build from scratch (when *clear_on_fail* is enabled) or by
> + * modifying the existing hierarchy configuration (when *clear_on_fail* is
> + * disabled).
> + *
> + * Note that, even when the configured scheduler hierarchy is supported (so this
> + * function is successful), the Ethernet port start might still fail due to e.g.
> + * not enough memory being available in the system, etc.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param clear_on_fail
> + *   On function call failure, hierarchy is cleared when this parameter is
> + *   non-zero and preserved when this parameter is equal to zero.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_hierarchy_set(uint8_t port_id,
> +	int clear_on_fail,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node parent update
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param parent_node_id
> + *   Node ID for the new parent. Needs to be valid.
> + * @param priority
> + *   Node priority. The highest node priority is zero. Used by the SP algorithm
> + *   running on the parent of the current node for scheduling this child node.
> + * @param weight
> + *   Node weight. The node weight is relative to the weight sum of all siblings
> + *   that have the same priority. The lowest weight is zero. Used by the WFQ/WRR
> + *   algorithm running on the parent of the current node for scheduling this
> + *   child node.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_parent_update(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,
> +	struct rte_scheddev_error *error);
> +

The usages are not clear. How it is different from node_add API.
is the intention to update a specific node or change the connection of a 
specific node to a existing or new parent.


> +/**
> + * Scheduler node private shaper update
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param shaper_profile_id
> + *   Shaper profile ID for the private shaper of the current node. Needs to be
> + *   either valid shaper profile ID or RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE, with
> + *   the latter disabling the private shaper of the current node.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_shaper_update(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t shaper_profile_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node shared shapers update
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param shared_shaper_id
> + *   Shared shaper ID. Needs to be valid.
> + * @param add
> + *   Set to non-zero value to add this shared shaper to current node or to zero
> + *   to delete this shared shaper from current node.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_shared_shaper_update(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t shared_shaper_id,
> +	int add,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node scheduling mode update
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid leaf node ID.
> + * @param scheduling_mode_per_priority
> + *   For each priority, indicates whether the children nodes sharing the same
> + *   priority are to be scheduled by WFQ or by WRR. When NULL, it indicates that
> + *   WFQ is to be used for all priorities. When non-NULL, it points to a
> + *   pre-allocated array of *n_priority* elements, with a non-zero value element
> + *   indicating WFQ and a zero value element for WRR.
> + * @param n_priorities
> + *   Number of priorities.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_scheduling_mode_update(uint8_t port_id,
> +	uint32_t node_id,
> +	int *scheduling_mode_per_priority,
> +	uint32_t n_priorities,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node congestion management mode update
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid leaf node ID.
> + * @param cman
> + *   Congestion management mode.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_cman_update(uint8_t port_id,
> +	uint32_t node_id,
> +	enum rte_scheddev_cman_mode cman,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node private WRED context update
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid leaf node ID.
> + * @param wred_profile_id
> + *   WRED profile ID for the private WRED context of the current node. Needs to
> + *   be either valid WRED profile ID or RTE_SCHEDDEV_WRED_PROFILE_ID_NONE, with
> + *   the latter disabling the private WRED context of the current node.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_wred_context_update(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node shared WRED context update
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid leaf node ID.
> + * @param shared_wred_context_id
> + *   Shared WRED context ID. Needs to be valid.
> + * @param add
> + *   Set to non-zero value to add this shared WRED context to current node or to
> + *   zero to delete this shared WRED context from current node.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_shared_wred_context_update(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t shared_wred_context_id,
> +	int add,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler packet marking - VLAN DEI (IEEE 802.1Q)
> + *
> + * IEEE 802.1p maps the traffic class to the VLAN Priority Code Point (PCP)
> + * field (3 bits), while IEEE 802.1q maps the drop priority to the VLAN Drop
> + * Eligible Indicator (DEI) field (1 bit), which was previously named Canonical
> + * Format Indicator (CFI).
> + *
> + * All VLAN frames of a given color get their DEI bit set if marking is enabled
> + * for this color; otherwise, their DEI bit is left as is (either set or not).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param mark_green
> + *   Set to non-zero value to enable marking of green packets and to zero to
> + *   disable it.
> + * @param mark_yellow
> + *   Set to non-zero value to enable marking of yellow packets and to zero to
> + *   disable it.
> + * @param mark_red
> + *   Set to non-zero value to enable marking of red packets and to zero to
> + *   disable it.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_mark_vlan_dei(uint8_t port_id,
> +	int mark_green,
> +	int mark_yellow,
> +	int mark_red,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler packet marking - IPv4 / IPv6 ECN (IETF RFC 3168)
> + *
> + * IETF RFCs 2474 and 3168 reorganize the IPv4 Type of Service (TOS) field
> + * (8 bits) and the IPv6 Traffic Class (TC) field (8 bits) into Differentiated
> + * Services Codepoint (DSCP) field (6 bits) and Explicit Congestion Notification
> + * (ECN) field (2 bits). The DSCP field is typically used to encode the traffic
> + * class and/or drop priority (RFC 2597), while the ECN field is used by RFC
> + * 3168 to implement a congestion notification mechanism to be leveraged by
> + * transport layer protocols such as TCP and SCTP that have congestion control
> + * mechanisms.
> + *
> + * When congestion is experienced, as alternative to dropping the packet,
> + * routers can change the ECN field of input packets from 2'b01 or 2'b10 (values
> + * indicating that source endpoint is ECN-capable) to 2'b11 (meaning that
> + * congestion is experienced). The destination endpoint can use the ECN-Echo
> + * (ECE) TCP flag to relay the congestion indication back to the source
> + * endpoint, which acknowledges it back to the destination endpoint with the
> + * Congestion Window Reduced (CWR) TCP flag.
> + *
> + * All IPv4/IPv6 packets of a given color with ECN set to 2’b01 or 2’b10
> + * carrying TCP or SCTP have their ECN set to 2’b11 if the marking feature is
> + * enabled for the current color, otherwise the ECN field is left as is.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param mark_green
> + *   Set to non-zero value to enable marking of green packets and to zero to
> + *   disable it.
> + * @param mark_yellow
> + *   Set to non-zero value to enable marking of yellow packets and to zero to
> + *   disable it.
> + * @param mark_red
> + *   Set to non-zero value to enable marking of red packets and to zero to
> + *   disable it.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_mark_ip_ecn(uint8_t port_id,
> +	int mark_green,
> +	int mark_yellow,
> +	int mark_red,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler packet marking - IPv4 / IPv6 DSCP (IETF RFC 2597)
> + *
> + * IETF RFC 2597 maps the traffic class and the drop priority to the IPv4/IPv6
> + * Differentiated Services Codepoint (DSCP) field (6 bits). Here are the DSCP
> + * values proposed by this RFC:
> + *
> + *                       Class 1    Class 2    Class 3    Class 4
> + *                     +----------+----------+----------+----------+
> + *    Low Drop Prec    |  001010  |  010010  |  011010  |  100010  |
> + *    Medium Drop Prec |  001100  |  010100  |  011100  |  100100  |
> + *    High Drop Prec   |  001110  |  010110  |  011110  |  100110  |
> + *                     +----------+----------+----------+----------+
> + *
> + * There are 4 traffic classes (classes 1 .. 4) encoded by DSCP bits 1 and 2, as
> + * well as 3 drop priorities (low/medium/high) encoded by DSCP bits 3 and 4.
> + *
> + * All IPv4/IPv6 packets have their color marked into DSCP bits 3 and 4 as
> + * follows: green mapped to Low Drop Precedence (2’b01), yellow to Medium
> + * (2’b10) and red to High (2’b11). Marking needs to be explicitly enabled
> + * for each color; when not enabled for a given color, the DSCP field of all
> + * packets with that color is left as is.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param mark_green
> + *   Set to non-zero value to enable marking of green packets and to zero to
> + *   disable it.
> + * @param mark_yellow
> + *   Set to non-zero value to enable marking of yellow packets and to zero to
> + *   disable it.
> + * @param mark_red
> + *   Set to non-zero value to enable marking of red packets and to zero to
> + *   disable it.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_mark_ip_dscp(uint8_t port_id,
> +	int mark_green,
> +	int mark_yellow,
> +	int mark_red,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler get statistics counter types enabled for all nodes
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param nonleaf_node_capability_stats_mask
> + *   Statistics counter types available per node for all non-leaf nodes. Needs
> + *   to be pre-allocated.
> + * @param nonleaf_node_enabled_stats_mask
> + *   Statistics counter types currently enabled per node for each non-leaf node.
> + *   This is a subset of *nonleaf_node_capability_stats_mask*. Needs to be
> + *   pre-allocated.
> + * @param leaf_node_capability_stats_mask
> + *   Statistics counter types available per node for all leaf nodes. Needs to
> + *   be pre-allocated.
> + * @param leaf_node_enabled_stats_mask
> + *   Statistics counter types currently enabled for each leaf node. This is
> + *   a subset of *leaf_node_capability_stats_mask*. Needs to be pre-allocated.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_stats_get_enabled(uint8_t port_id,
> +	uint64_t *nonleaf_node_capability_stats_mask,
> +	uint64_t *nonleaf_node_enabled_stats_mask,
> +	uint64_t *leaf_node_capability_stats_mask,
> +	uint64_t *leaf_node_enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler enable selected statistics counters for all nodes
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param nonleaf_node_enabled_stats_mask
> + *   Statistics counter types to be enabled per node for each non-leaf node.
> + *   This needs to be a subset of the statistics counter types available per
> + *   node for all non-leaf nodes. Any statistics counter type not included in
> + *   this set is to be disabled for all non-leaf nodes.
> + * @param leaf_node_enabled_stats_mask
> + *   Statistics counter types to be enabled per node for each leaf node. This
> + *   needs to be a subset of the statistics counter types available per node for
> + *   all leaf nodes. Any statistics counter type not included in this set is to
> + *   be disabled for all leaf nodes.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_stats_enable(uint8_t port_id,
> +	uint64_t nonleaf_node_enabled_stats_mask,
> +	uint64_t leaf_node_enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler get statistics counter types enabled for current node
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param capability_stats_mask
> + *   Statistics counter types available for the current node. Needs to be
> + *   pre-allocated.
> + * @param enabled_stats_mask
> + *   Statistics counter types currently enabled for the current node. This is
> + *   a subset of *capability_stats_mask*. Needs to be pre-allocated.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_stats_get_enabled(uint8_t port_id,
> +	uint32_t node_id,
> +	uint64_t *capability_stats_mask,
> +	uint64_t *enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler enable selected statistics counters for current node
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param enabled_stats_mask
> + *   Statistics counter types to be enabled for the current node. This needs to
> + *   be a subset of the statistics counter types available for the current node.
> + *   Any statistics counter type not included in this set is to be disabled for
> + *   the current node.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_stats_enable(uint8_t port_id,
> +	uint32_t node_id,
> +	uint64_t enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node statistics counters read
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param stats
> + *   When non-NULL, it contains the current value for the statistics counters
> + *   enabled for the current node.
> + * @param clear
> + *   When this parameter has a non-zero value, the statistics counters are
> + *   cleared (i.e. set to zero) immediately after they have been read, otherwise
> + *   the statistics counters are left untouched.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_stats_read(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_node_stats *stats,
> +	int clear,
> +	struct rte_scheddev_error *error);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* __INCLUDE_RTE_SCHEDDEV_H__ */
> diff --git a/lib/librte_ether/rte_scheddev_driver.h b/lib/librte_ether/rte_scheddev_driver.h
> new file mode 100644
> index 0000000..c0a0321
> --- /dev/null
> +++ b/lib/librte_ether/rte_scheddev_driver.h
> @@ -0,0 +1,374 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2017 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef __INCLUDE_RTE_SCHEDDEV_DRIVER_H__
> +#define __INCLUDE_RTE_SCHEDDEV_DRIVER_H__
> +
> +/**
> + * @file
> + * RTE Generic Hierarchical Scheduler API (Driver Side)
> + *
> + * This file provides implementation helpers for internal use by PMDs, they
> + * are not intended to be exposed to applications and are not subject to ABI
> + * versioning.
> + */
> +
> +#include <stdint.h>
> +
> +#include <rte_errno.h>
> +#include "rte_ethdev.h"
> +#include "rte_scheddev.h"
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +typedef int (*rte_scheddev_capabilities_get_t)(struct rte_eth_dev *dev,
> +	struct rte_scheddev_capabilities *cap,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler capabilities get */
> +
> +typedef int (*rte_scheddev_node_capabilities_get_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	struct rte_scheddev_node_capabilities *cap,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node capabilities get */
> +
> +typedef int (*rte_scheddev_wred_profile_add_t)(struct rte_eth_dev *dev,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_wred_params *profile,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler WRED profile add */
> +
> +typedef int (*rte_scheddev_wred_profile_delete_t)(struct rte_eth_dev *dev,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler WRED profile delete */
> +
> +typedef int (*rte_scheddev_shared_wred_context_add_update_t)(
> +	struct rte_eth_dev *dev,
> +	uint32_t shared_wred_context_id,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler shared WRED context add */
> +
> +typedef int (*rte_scheddev_shared_wred_context_delete_t)(
> +	struct rte_eth_dev *dev,
> +	uint32_t shared_wred_context_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler shared WRED context delete */
> +
> +typedef int (*rte_scheddev_shaper_profile_add_t)(struct rte_eth_dev *dev,
> +	uint32_t shaper_profile_id,
> +	struct rte_scheddev_shaper_params *profile,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler shaper profile add */
> +
> +typedef int (*rte_scheddev_shaper_profile_delete_t)(struct rte_eth_dev *dev,
> +	uint32_t shaper_profile_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler shaper profile delete */
> +
> +typedef int (*rte_scheddev_shared_shaper_add_update_t)(struct rte_eth_dev *dev,
> +	uint32_t shared_shaper_id,
> +	uint32_t shaper_profile_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler shared shaper add/update */
> +
> +typedef int (*rte_scheddev_shared_shaper_delete_t)(struct rte_eth_dev *dev,
> +	uint32_t shared_shaper_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler shared shaper delete */
> +
> +typedef int (*rte_scheddev_node_add_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,
> +	struct rte_scheddev_node_params *params,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node add */
> +
> +typedef int (*rte_scheddev_node_delete_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node delete */
> +
> +typedef int (*rte_scheddev_node_suspend_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node suspend */
> +
> +typedef int (*rte_scheddev_node_resume_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node resume */
> +
> +typedef int (*rte_scheddev_hierarchy_set_t)(struct rte_eth_dev *dev,
> +	int clear_on_fail,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler hierarchy set */
> +
> +typedef int (*rte_scheddev_node_parent_update_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node parent update */
> +
> +typedef int (*rte_scheddev_node_shaper_update_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t shaper_profile_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node shaper update */
> +
> +typedef int (*rte_scheddev_node_shared_shaper_update_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t shared_shaper_id,
> +	int32_t add,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node shaper update */
> +
> +typedef int (*rte_scheddev_node_scheduling_mode_update_t)(
> +	struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	int *scheduling_mode_per_priority,
> +	uint32_t n_priorities,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node scheduling mode update */
> +
> +typedef int (*rte_scheddev_node_cman_update_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	enum rte_scheddev_cman_mode cman,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node congestion management mode update */
> +
> +typedef int (*rte_scheddev_node_wred_context_update_t)(
> +	struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node WRED context update */
> +
> +typedef int (*rte_scheddev_node_shared_wred_context_update_t)(
> +	struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint32_t shared_wred_context_id,
> +	int add,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler node WRED context update */
> +
> +typedef int (*rte_scheddev_mark_vlan_dei_t)(struct rte_eth_dev *dev,
> +	int mark_green,
> +	int mark_yellow,
> +	int mark_red,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler packet marking - VLAN DEI */
> +
> +typedef int (*rte_scheddev_mark_ip_ecn_t)(struct rte_eth_dev *dev,
> +	int mark_green,
> +	int mark_yellow,
> +	int mark_red,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler packet marking - IPv4/IPv6 ECN */
> +
> +typedef int (*rte_scheddev_mark_ip_dscp_t)(struct rte_eth_dev *dev,
> +	int mark_green,
> +	int mark_yellow,
> +	int mark_red,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler packet marking - IPv4/IPv6 DSCP */
> +
> +typedef int (*rte_scheddev_stats_get_enabled_t)(struct rte_eth_dev *dev,
> +	uint64_t *nonleaf_node_capability_stats_mask,
> +	uint64_t *nonleaf_node_enabled_stats_mask,
> +	uint64_t *leaf_node_capability_stats_mask,
> +	uint64_t *leaf_node_enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler get set of stats counters enabled for all nodes */
> +
> +typedef int (*rte_scheddev_stats_enable_t)(struct rte_eth_dev *dev,
> +	uint64_t nonleaf_node_enabled_stats_mask,
> +	uint64_t leaf_node_enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler enable selected stats counters for all nodes */
> +
> +typedef int (*rte_scheddev_node_stats_get_enabled_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint64_t *capability_stats_mask,
> +	uint64_t *enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler get set of stats counters enabled for specific node */
> +
> +typedef int (*rte_scheddev_node_stats_enable_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	uint64_t enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler enable selected stats counters for specific node */
> +
> +typedef int (*rte_scheddev_node_stats_read_t)(struct rte_eth_dev *dev,
> +	uint32_t node_id,
> +	struct rte_scheddev_node_stats *stats,
> +	int clear,
> +	struct rte_scheddev_error *error);
> +/**< @internal Scheduler read stats counters for specific node */
> +
> +struct rte_scheddev_ops {
> +	/** Scheduler capabilities_get */
> +	rte_scheddev_capabilities_get_t capabilities_get;
> +	/** Scheduler node capabilities get */
> +	rte_scheddev_node_capabilities_get_t node_capabilities_get;
> +
> +	/** Scheduler WRED profile add */
> +	rte_scheddev_wred_profile_add_t wred_profile_add;
> +	/** Scheduler WRED profile delete */
> +	rte_scheddev_wred_profile_delete_t wred_profile_delete;
> +	/** Scheduler shared WRED context add/update */
> +	rte_scheddev_shared_wred_context_add_update_t
> +		shared_wred_context_add_update;
> +	/** Scheduler shared WRED context delete */
> +	rte_scheddev_shared_wred_context_delete_t
> +		shared_wred_context_delete;
> +	/** Scheduler shaper profile add */
> +	rte_scheddev_shaper_profile_add_t shaper_profile_add;
> +	/** Scheduler shaper profile delete */
> +	rte_scheddev_shaper_profile_delete_t shaper_profile_delete;
> +	/** Scheduler shared shaper add/update */
> +	rte_scheddev_shared_shaper_add_update_t shared_shaper_add_update;
> +	/** Scheduler shared shaper delete */
> +	rte_scheddev_shared_shaper_delete_t shared_shaper_delete;
> +
> +	/** Scheduler node add */
> +	rte_scheddev_node_add_t node_add;
> +	/** Scheduler node delete */
> +	rte_scheddev_node_delete_t node_delete;
> +	/** Scheduler node suspend */
> +	rte_scheddev_node_suspend_t node_suspend;
> +	/** Scheduler node resume */
> +	rte_scheddev_node_resume_t node_resume;
> +	/** Scheduler hierarchy set */
> +	rte_scheddev_hierarchy_set_t hierarchy_set;
> +
> +	/** Scheduler node parent update */
> +	rte_scheddev_node_parent_update_t node_parent_update;
> +	/** Scheduler node shaper update */
> +	rte_scheddev_node_shaper_update_t node_shaper_update;
> +	/** Scheduler node shared shaper update */
> +	rte_scheddev_node_shared_shaper_update_t node_shared_shaper_update;
> +	/** Scheduler node scheduling mode update */
> +	rte_scheddev_node_scheduling_mode_update_t node_scheduling_mode_update;
> +	/** Scheduler node congestion management mode update */
> +	rte_scheddev_node_cman_update_t node_cman_update;
> +	/** Scheduler node WRED context update */
> +	rte_scheddev_node_wred_context_update_t node_wred_context_update;
> +	/** Scheduler node shared WRED context update */
> +	rte_scheddev_node_shared_wred_context_update_t
> +		node_shared_wred_context_update;
> +
> +	/** Scheduler packet marking - VLAN DEI */
> +	rte_scheddev_mark_vlan_dei_t mark_vlan_dei;
> +	/** Scheduler packet marking - IPv4/IPv6 ECN */
> +	rte_scheddev_mark_ip_ecn_t mark_ip_ecn;
> +	/** Scheduler packet marking - IPv4/IPv6 DSCP */
> +	rte_scheddev_mark_ip_dscp_t mark_ip_dscp;
> +
> +	/** Scheduler get statistics counter type enabled for all nodes */
> +	rte_scheddev_stats_get_enabled_t stats_get_enabled;
> +	/** Scheduler enable selected statistics counters for all nodes */
> +	rte_scheddev_stats_enable_t stats_enable;
> +	/** Scheduler get statistics counter type enabled for current node */
> +	rte_scheddev_node_stats_get_enabled_t node_stats_get_enabled;
> +	/** Scheduler enable selected statistics counters for current node */
> +	rte_scheddev_node_stats_enable_t node_stats_enable;
> +	/** Scheduler read statistics counters for current node */
> +	rte_scheddev_node_stats_read_t node_stats_read;
> +};
> +
> +/**
> + * Initialize generic error structure.
> + *
> + * This function also sets rte_errno to a given value.
> + *
> + * @param error
> + *   Pointer to error structure (may be NULL).
> + * @param code
> + *   Related error code (rte_errno).
> + * @param type
> + *   Cause field and error type.
> + * @param cause
> + *   Object responsible for the error.
> + * @param message
> + *   Human-readable error message.
> + *
> + * @return
> + *   Error code.
> + */
> +static inline int
> +rte_scheddev_error_set(struct rte_scheddev_error *error,
> +		   int code,
> +		   enum rte_scheddev_error_type type,
> +		   const void *cause,
> +		   const char *message)
> +{
> +	if (error) {
> +		*error = (struct rte_scheddev_error){
> +			.type = type,
> +			.cause = cause,
> +			.message = message,
> +		};
> +	}
> +	rte_errno = code;
> +	return code;
> +}
> +
> +/**
> + * Get generic hierarchical scheduler operations structure from a port
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param error
> + *   Error details
> + *
> + * @return
> + *   The hierarchical scheduler operations structure associated with port_id on
> + *   success, NULL otherwise.
> + */
> +const struct rte_scheddev_ops *
> +rte_scheddev_ops_get(uint8_t port_id, struct rte_scheddev_error *error);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* __INCLUDE_RTE_SCHEDDEV_DRIVER_H__ */
>
  
Cristian Dumitrescu Feb. 21, 2017, 1:44 p.m. UTC | #2
Hi Hemant,

> > +  * Scheduler node capabilities

> > +  */

> > +struct rte_scheddev_node_capabilities {


...<snip>

> > +		/**< Lowest priority supported. The value of 1 indicates that

> > +		 * only priority 0 is supported, which essentially means that

> > +		 * Strict Priority (SP) algorithm is not supported.

> > +		 */

> > +		uint32_t sp_priority_min;

> > +

> This can be  simply sp_priority_level, with 0 indicating no support

> 1 indicates '0' and '1' priority.  or 7 indicates '0' to '7' i.e. total

> 8 priorities.


Yes, will pick a better name, as you suggested. 

> 

> > +		/**< Maximum number of sibling nodes that can have the

> same

> > +		 * priority at any given time. When equal to

> *n_children_max*,

> > +		 * it indicates that WFQ/WRR algorithms are not supported.

> > +		 */

> > +		uint32_t sp_n_children_max;

> not clear to me.

> OK, more than 1 children can have same priority, than you apply WRR/WFQ

> among them.

> 

> However, there can be different sets,  e.g prio '0' and '1' has only 1

> children. while prio '2' has 6 children, than you apply WRR/WFQ among them.

> 


Yes, the parameter description seems wrong to me as well. The correct statement should
be: "When equal to 1, it indicates that WFQ/WRR algorithms are not supported", right?
I will fix the description.

Also, for the sake of clarity, it makes sense to mention in the
description that sp_n_children_max value range is 1 .. n_children_max.

> > +

> > +		/**< WFQ algorithm support. */

> > +		int scheduling_wfq_supported;

> > +

> > +		/**< WRR algorithm support. */

> > +		int scheduling_wrr_supported;

> > +

> > +		/**< Maximum WFQ/WRR weight. */

> > +		uint32_t scheduling_wfq_wrr_weight_max;

> > +	} nonleaf;

> > +

> > +	/**< Items valid only for leaf nodes. */

> > +	struct {

> > +		/**< Head drop algorithm support. */

> > +		int cman_head_drop_supported;

> > +

> > +		/**< Private WRED context support. */

> > +		int cman_wred_context_private_supported;

> > +

> 

> The context part is not clear to me.

> 


The WRED context is the WRED object/instance initialized based on a WRED profile.

The leaf node can support or not a WRED context that it owns (private WRED context),
which is what this parameter refers to, and zero or several shared WRED contexts (shared
with other nodes), which is what the below parameter cman_wred_context_shared_n_max
refers to. Makes sense?

> > +		/**< Maximum number of shared WRED contexts

> supported. The value

> > +		 * of zero indicates that shared WRED contexts are not

> > +		 * supported.

> > +		 */

> > +		uint32_t cman_wred_context_shared_n_max;

> > +	} leaf;

> 

> non-leaf nodes may have different capabilities.

> 

> your leaf node is like a QoS Queue, are you supporting shapper on leaf

> node as well?

> 


Yes, we do support shapers per leaf nodes as well. Probably a misunderstanding:
if you look closer to struct rte_scheddev_node_capabilities, the shaper related
parameters are in the common part of the structure, not in the inner non-leaf/leaf
specific parameter area, agree?

> 

> I will still prefer if you separate QoS Queue from a standard Sched

> node, the capabilities are different and it will be cleaner at the cost

> of increased structure and number of APIs.

> 


I seriously thought about this suggestion (which you stated in your previous email), I even went down
this path just to revert back eventually, as my conclusion was this it is not the best approach. What I
found is this has the tendency to create a lot of confusion for which API functions and data structures
are applicable to: (a) both leaf and non-leaf nodes, (b) leaf nodes only, (c) non-leaf nodes only.

Approach 1: Have separate API for leaf nodes and non-leaf nodes. We now effectively have to different
API objects: leaf_node and nonleaf_node.
My conclusions against it:
- We end up duplicating 90% of the API;
- We also need to document and enforce clear rules for the interaction between leaf nodes and
non-leaf nodes, which adds a lot of complexity, as it doubles the amount of variables the user needs
to consider.

2. Have most of the API common for leaf nodes and non-leaf nodes (for common object called node),
but have some API only applicable to leaf nodes (name using leaf_node). It still creates the impression
that there are two different API objects (leaf_node and non-leaf node), which is false in this case.
My conclusions against it:
- It is likely the user will look for leaf-node specific API for most basic operations, and will not find them;
- Even if the user is aware that the "node" API is applicable to both leaf nodes and non-leaf nodes, it still
requires documenting and enforcing clear rules for the interaction between leaf nodes and non-leaf nodes,
which increases the complexity.

3. Have all API refer to a single object consistently called node.
This is the approach that I ended up picking:
- It really simplifies the API and adds a lot of clarity
- Clearly enforce differences between leaf and non-leaf nodes in API data structures using unions or structs
clearly called leaf and nonleaf.
- The only functions that are applicable to leaf nodes only / non-leaf nodes only are just the node_xyz_update()
functions to be called for updating node parameter xyz post -node-add operation. Easy to get the type of node
they are applicable for, as the parameter itself leaves no room for confusion.

One more note: it is straightforward to determine if an existing node is leaf or not, as the API predefines leaf node IDs
to 0 ... (N_TXQs - 1), which cannot be used for non-leaf nodes. This was done as result of some of your previous
suggestions. I will actually add a new API (inline) function rte_scheddev_node_is_leaf() to further clarify & enforce
this rule.

> > +};

> > +

> > +/**

> > +  * Scheduler capabilities

> > +  */

> > +struct rte_scheddev_capabilities {

...<snip>
> > +	/**< Summary of node-level capabilities across all nodes. */

> > +	struct rte_scheddev_node_capabilities node;

> 

> This should be array of numbers of levels supported in the system.

> Non-leaf node at level 2 can have different capabilities than level 3 node.

> 


My initial thinking: To get the per-level capabilities, simply call the node-level
capability API for the first node (or any node) on that level. But now I see your point,
as this approach can be used only after the nodes have been added (as the ID of an existing node is required).

On the other hand, some implementations can be very flexible: have a pool of nodes
that can be added in any topology to create an arbitrary number of hierarchy levels
(i.e. distance from leaf to root), so the level notion is not that relevant anymore.

Therefore, I propose we add a new API function for level-specific capability query (at 2. below), so
we come to the following set of capability related API functions:
1. rte_scheddev_capabilities_get(sched_params) = summary of capabilities for entire hierarchy
2. rte_scheddev_level_capabilities_get(level_id, node_params) = summary of node capabilities for all nodes on same level
3. rte_scheddev_node_capabilities_get(node_id, node_params) = summary of node capabilities for an existing node

Any concerns?

...<snip>

> > +  * Each leaf node sits on on top of a TX queue of the current Ethernet

> port.

> > +  * Therefore, the leaf nodes are predefined with the node IDs of 0 .. (N-

> 1),

> > +  * where N is the number of TX queues configured for the current

> Ethernet port.

> > +  * The non-leaf nodes have their IDs generated by the application.

> > +  */

> 

> 

> Ok, that means 0 to N-1 is reserved for leaf nodes. the application will

> choose any value for non-leaf nodes?

> What will be the parent node id for the root node?

> 


Yes!

For parent node ID, see the at the top of the file:
	/**< Scheduler hierarchy root node ID */
	#define RTE_SCHEDDEV_ROOT_NODE_ID  UINT32_MAX
I will add a comment for node_add() function as well to furher clarify.

> > +struct rte_scheddev_node_params {

> > +	/**< Shaper profile for the private shaper. The absence of the

> private

> > +	 * shaper for the current node is indicated by setting this parameter

> > +	 * to RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE.

> > +	 */

> > +	uint32_t shaper_profile_id;

> > +

> > +	/**< User allocated array of valid shared shaper IDs. */

> > +	uint32_t *shared_shaper_id;

> > +

> > +	/**< Number of shared shaper IDs in the *shared_shaper_id* array.

> */

> > +	uint32_t n_shared_shapers;

> > +

> > +	union {

> > +		/**< Parameters only valid for non-leaf nodes. */

> > +		struct {

> > +			/**< For each priority, indicates whether the children

> > +			 * nodes sharing the same priority are to be

> scheduled

> > +			 * by WFQ or by WRR. When NULL, it indicates that

> WFQ

> > +			 * is to be used for all priorities. When non-NULL, it

> > +			 * points to a pre-allocated array of *n_priority*

> > +			 * elements, with a non-zero value element

> indicating

> > +			 * WFQ and a zero value element for WRR.

> > +			 */

> > +			int *scheduling_mode_per_priority;

> 

> what is the structure of the pointer element? Just a bool array?

> 


Yes, we decide between WFQ and WRR for each group of children with same priority.

> > +

> > +			/**< Number of priorities. */

> > +			uint32_t n_priorities;

> > +		} nonleaf;

> > +

> > +		/**< Parameters only valid for leaf nodes. */

> > +		struct {

> > +			/**< Congestion management mode */

> > +			enum rte_scheddev_cman_mode cman;

> > +

> > +			/**< WRED parameters (valid when *cman* is

> WRED). */

> > +			struct {

> > +				/**< WRED profile for private WRED context.

> */

> > +				uint32_t wred_profile_id;

> > +

> > +				/**< User allocated array of shared WRED

> context

> > +				 * IDs. The absence of a private WRED

> context

> > +				 * for current leaf node is indicated by value

> > +				 *

> RTE_SCHEDDEV_WRED_PROFILE_ID_NONE.

> > +				 */

> > +				uint32_t *shared_wred_context_id;

> > +

> > +				/**< Number of shared WRED context IDs in

> the

> > +				 * *shared_wred_context_id* array.

> > +				 */

> > +				uint32_t n_shared_wred_contexts;

> > +			} wred;

> > +		} leaf;

> 

> need a bool is_leaf here to differentiate between leaf and non-leaf node.


This data structure is only used by the node_add() function. The fact whether
the current node is leaf or non-leaf is decided by the node_id parameter of the
node_add() function (as discussed above on the convention on leaf node ID
enforced by the API), so IMO adding the is_leaf in the parameter structure is
redundant.

> > +/**

> > + * Scheduler node capabilities get

> > + *

> > + * @param port_id

> > + *   The port identifier of the Ethernet device.

> > + * @param node_id

> > + *   Node ID. Needs to be valid.

> > + * @param cap

> > + *   Scheduler node capabilities. Needs to be pre-allocated and valid.

> > + * @param error

> > + *   Error details. Filled in only on error, when not NULL.

> > + * @return

> > + *   0 on success, non-zero error code otherwise.

> > + */

> > +int rte_scheddev_node_capabilities_get(uint8_t port_id,

> > +	uint32_t node_id,

> > +	struct rte_scheddev_node_capabilities *cap,

> > +	struct rte_scheddev_error *error);

> > +

> 

> Node capabilities is already part of scheddev_capabilities?

> 

> What are you expecting different here. Unless you support different

> capability for each level, this may not be useful.

> 


Yes, you're right, I will merge stats capability query into the capability API.

> > +/**

> > + * Scheduler node add

> > + *

> > + * When *node_id* is not a valid node ID, a new node with this ID is

> created and

> > + * connected as child to the existing node identified by

> *parent_node_id*.

> > + *

> > + * When *node_id* is a valid node ID, this node is disconnected from its

> current

> > + * parent and connected as child to another existing node identified by

> > + * *parent_node_id *.

> > + *

> > + * This function can be called during port initialization phase (before the

> > + * Ethernet port is started) for building the scheduler start-up hierarchy.

> > + * Subject to the specific Ethernet port supporting on-the-fly scheduler

> > + * hierarchy updates, this function can also be called during run-time

> (after

> > + * the Ethernet port is started).

> 

> This should  a capability, whether dynamic_hierarchy_updates are

> supported or not.

> 


Agreed, will add.

> > + *

> > + * @param port_id

> > + *   The port identifier of the Ethernet device.

> > + * @param node_id

> > + *   Node ID

> > + * @param parent_node_id

> > + *   Parent node ID. Needs to be the valid.

> 

> What will be the parent node id for the root node?  how the root node is

> created on the ethernet port?

> 


The first node added to the hierarchy should have parent set to
RTE_SCHEDDEV_ROOT_NODE_ID and becomes the _real_root_ node; all
the other nodes should be added as children of the _real_root_ node or its
descendants.

Will add comment to node_add() function to further clarify this.

> > + * @param priority

> > + *   Node priority. The highest node priority is zero. Used by the SP

> algorithm

> > + *   running on the parent of the current node for scheduling this child

> node.

> > + * @param weight

> > + *   Node weight. The node weight is relative to the weight sum of all

> siblings

> > + *   that have the same priority. The lowest weight is one. Used by the

> WFQ/WRR

> > + *   algorithm running on the parent of the current node for scheduling

> this

> > + *   child node.

> > + * @param params

> > + *   Node parameters. Needs to be pre-allocated and valid.

> > + * @param error

> > + *   Error details. Filled in only on error, when not NULL.

> > + * @return

> > + *   0 on success, non-zero error code otherwise.

> > + */

> > +int rte_scheddev_node_add(uint8_t port_id,

> > +	uint32_t node_id,

> > +	uint32_t parent_node_id,

> > +	uint32_t priority,

> > +	uint32_t weight,

> > +	struct rte_scheddev_node_params *params,

> > +	struct rte_scheddev_error *error);

> > +


...<snip>
> > +/**

> > + * Scheduler node parent update

> > + *

> > + * @param port_id

> > + *   The port identifier of the Ethernet device.

> > + * @param node_id

> > + *   Node ID. Needs to be valid.

> > + * @param parent_node_id

> > + *   Node ID for the new parent. Needs to be valid.

> > + * @param priority

> > + *   Node priority. The highest node priority is zero. Used by the SP

> algorithm

> > + *   running on the parent of the current node for scheduling this child

> node.

> > + * @param weight

> > + *   Node weight. The node weight is relative to the weight sum of all

> siblings

> > + *   that have the same priority. The lowest weight is zero. Used by the

> WFQ/WRR

> > + *   algorithm running on the parent of the current node for scheduling

> this

> > + *   child node.

> > + * @param error

> > + *   Error details. Filled in only on error, when not NULL.

> > + * @return

> > + *   0 on success, non-zero error code otherwise.

> > + */

> > +int rte_scheddev_node_parent_update(uint8_t port_id,

> > +	uint32_t node_id,

> > +	uint32_t parent_node_id,

> > +	uint32_t priority,

> > +	uint32_t weight,

> > +	struct rte_scheddev_error *error);

> > +

> 

> The usages are not clear. How it is different from node_add API.

> is the intention to update a specific node or change the connection of a

> specific node to a existing or new parent.

> 


Yes.

The node_add() API function should be called only to create nodes that do
not exist yet. Basically, the provided node_is should not be in use when
node_add() is called. Will update the function description (sorry, I thought I
already documented this, but looks like it slipped somehow to me).

All the node_xyz_update() API functions should be called only on nodes
that already exist for typically run-time updates.

IMO this convention provides a clear way to differentiate between the creation
vs. post-creation update mechanisms.

...<snip>

Regards,
Cristian
  
Jerin Jacob March 2, 2017, 11:47 a.m. UTC | #3
On Fri, Feb 10, 2017 at 02:05:50PM +0000, Cristian Dumitrescu wrote:
> - SW fallback based on librte_sched library (to be later introduced by
>   standalone patch set)
> 
> [1] RFC: http://dpdk.org/ml/archives/dev/2016-November/050956.html
> [2] Jerin’s feedback on RFC: http://www.dpdk.org/ml/archives/dev/2017-January/054484.html
> [3] Hemants’s feedback on RFC: http://www.dpdk.org/ml/archives/dev/2017-January/054866.html
> 
> Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
> ---
>  MAINTAINERS                            |    4 +
> +	rte_scheddev_node_parent_update;
> +	rte_scheddev_node_shaper_update;
> +	rte_scheddev_node_shared_shaper_update;
> +	rte_scheddev_node_scheduling_mode_update;
> +	rte_scheddev_node_cman_update;


Since the scope is beyond the scheduler i.e(CMAN, marking and shaping)
should we call it traffic manager or anything similar ? How about tmdev
instead of scheddev? No strong opinion here. But, I think, it worth to think
any other name for this.(Crypto and eventdev has schedulers too)


> +	rte_scheddev_node_wred_context_update;
> +	rte_scheddev_node_shared_wred_context_update;
> +	rte_scheddev_mark_vlan_dei;
> +	rte_scheddev_mark_ip_ecn;
> +	rte_scheddev_mark_ip_dscp;
> +	rte_scheddev_stats_get_enabled;
> +	rte_scheddev_stats_enable;
> +	rte_scheddev_node_stats_get_enabled;
> +	rte_scheddev_node_stats_enable;
> +	rte_scheddev_node_stats_read;
>  
>  } DPDK_17.02;
> +int rte_scheddev_wred_profile_add(uint8_t port_id,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_wred_params *profile,
> +	struct rte_scheddev_error *error)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	const struct rte_scheddev_ops *ops =
> +		rte_scheddev_ops_get(port_id, error);
> +
> +	if (ops == NULL)
> +		return -rte_errno;
> +
> +	if (ops->wred_profile_add == NULL)
> +		return -rte_scheddev_error_set(error,
> +			ENOSYS,
> +			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
> +			NULL,
> +			rte_strerror(ENOSYS));

IMO, The above piece of code gets duplicated in all the functions, may
be a candidate for macro or inline function

> + * This interface provides the ability to configure the hierarchical scheduler
> + * feature in a generic way.
> + */
> +
> +#include <stdint.h>
> +
> +#include <rte_red.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/** Ethernet framing overhead
> +  *
> +  * Overhead fields per Ethernet frame:
> +  * 1. Preamble:                                            7 bytes;
> +  * 2. Start of Frame Delimiter (SFD):                      1 byte;
> +  * 3. Inter-Frame Gap (IFG):                              12 bytes.
> +  */
> +#define RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD                  20
> +
> +/**
> +  * Ethernet framing overhead plus Frame Check Sequence (FCS). Useful when FCS
> +  * is generated and added at the end of the Ethernet frame on TX side without
> +  * any SW intervention.
> +  */
> +#define RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD_FCS              24
> +
> +/**< Invalid WRED profile ID */
> +#define RTE_SCHEDDEV_WRED_PROFILE_ID_NONE                  UINT32_MAX
> +
> +/**< Invalid shaper profile ID */
> +#define RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE                UINT32_MAX
> +
> +/**< Scheduler hierarchy root node ID */
> +#define RTE_SCHEDDEV_ROOT_NODE_ID                          UINT32_MAX
> +
> +
> +/**
> +  * Scheduler node capabilities
> +  */
> +struct rte_scheddev_node_capabilities {
> +	/**< Private shaper support. */
> +	int shaper_private_supported;
> +
> +	/**< Dual rate shaping support for private shaper. Valid only when
> +	 * private shaper is supported.
> +	 */
> +	int shaper_private_dual_rate_supported;
> +
> +	/**< Minimum committed/peak rate (bytes per second) for private
> +	 * shaper. Valid only when private shaper is supported.
> +	 */
> +	uint64_t shaper_private_rate_min;
> +
> +	/**< Maximum committed/peak rate (bytes per second) for private
> +	 * shaper. Valid only when private shaper is supported.
> +	 */
> +	uint64_t shaper_private_rate_max;
> +
> +	/**< Maximum number of supported shared shapers. The value of zero
> +	 * indicates that shared shapers are not supported.
> +	 */
> +	uint32_t shaper_shared_n_max;


Private vs shared, Can we hide this detail inside the implementation?
What is the semantics of shared shaper. We have the profile concept in
the spec, can that replace the shared shapers? I think, it will help
application if we can hide the private vs shared shaper detail in
implementation.

> +
> +	/**< Items valid only for non-leaf nodes. */
> +	struct {
> +		/**< Maximum number of children nodes. */
> +		uint32_t n_children_max;
> +
> +		/**< Lowest priority supported. The value of 1 indicates that
> +		 * only priority 0 is supported, which essentially means that
> +		 * Strict Priority (SP) algorithm is not supported.
> +		 */
> +		uint32_t sp_priority_min;

As Hemant suggested, _level_ may be right name here.

> +
> +		/**< Maximum number of sibling nodes that can have the same
> +		 * priority at any given time. When equal to *n_children_max*,
> +		 * it indicates that WFQ/WRR algorithms are not supported.
> +		 */
> +		uint32_t sp_n_children_max;

In our HW, A node can have 10 separate sp priority siblings or a WRR with downsteam N nodes.
Valid configuration are
- <=10 siblings with <=10 static priorities
- We can choose a sibling as WRR and connect N WRR siblings + <=9
  static priory nodes

Not sure how that constrain map here

> +
> +		/**< WFQ algorithm support. */
> +		int scheduling_wfq_supported;
> +
> +		/**< WRR algorithm support. */
> +		int scheduling_wrr_supported;
> +
> +		/**< Maximum WFQ/WRR weight. */
> +		uint32_t scheduling_wfq_wrr_weight_max;
> +	} nonleaf;
> +
> +	/**< Items valid only for leaf nodes. */
> +	struct {
> +		/**< Head drop algorithm support. */
> +		int cman_head_drop_supported;
> +
> +		/**< Private WRED context support. */
> +		int cman_wred_context_private_supported;
> +
> +		/**< Maximum number of shared WRED contexts supported. The value
> +		 * of zero indicates that shared WRED contexts are not
> +		 * supported.
> +		 */
> +		uint32_t cman_wred_context_shared_n_max;
> +	} leaf;
> +};
> +
> +/**
> +  * Scheduler capabilities
> +  */
> +struct rte_scheddev_capabilities {
> +	/**< Maximum number of nodes. */
> +	uint32_t n_nodes_max;
> +
> +	/**< Maximum number of levels (i.e. number of nodes connecting the root
> +	 * node with any leaf node, including the root and the leaf).
> +	 */
> +	uint32_t n_levels_max;
> +
> +	/**< Maximum number of shapers, either private or shared. In case the
> +	 * implementation does not share any resource between private and
> +	 * shared shapers, it is typically equal to the sum between
> +	 * *shaper_private_n_max* and *shaper_shared_n_max*.
> +	 */
> +	uint32_t shaper_n_max;
> +
> +	/**< Maximum number of private shapers. Indicates the maximum number of
> +	 * nodes that can concurrently have the private shaper enabled.
> +	 */
> +	uint32_t shaper_private_n_max;
> +
> +	/**< Maximum number of shared shapers. The value of zero indicates that
> +	  * shared shapers are not supported.
> +	  */
> +	uint32_t shaper_shared_n_max;
> +
> +	/**< Maximum number of nodes that can share the same shared shaper. Only
> +	  * valid when shared shapers are supported.
> +	  */
> +	uint32_t shaper_shared_n_nodes_max;
> +
> +	/**< Maximum number of shared shapers that can be configured with dual
> +	  * rate shaping. The value of zero indicates that dual rate shaping
> +	  * support is not available for shared shapers.
> +	  */
> +	uint32_t shaper_shared_dual_rate_n_max;
> +
> +	/**< Minimum committed/peak rate (bytes per second) for shared
> +	  * shapers. Only valid when shared shapers are supported.
> +	  */
> +	uint64_t shaper_shared_rate_min;
> +
> +	/**< Maximum committed/peak rate (bytes per second) for shared
> +	  * shaper. Only valid when shared shapers are supported.
> +	  */
> +	uint64_t shaper_shared_rate_max;

As Hemant suggested, We need additional per LEVEL capabilities. Number of
shared and number of nodes are limited in our HW and we cannot move one
node to another level.

> +
> +	/**< Minimum value allowed for packet length adjustment for
> +	  * private/shared shapers.
> +	  */
> +	int shaper_pkt_length_adjust_min;
> +
> +	/**< Maximum value allowed for packet length adjustment for
> +	  * private/shared shapers.
> +	  */
> +	int shaper_pkt_length_adjust_max;
> +
> +	/**< Maximum number of WRED contexts. */
> +	uint32_t cman_wred_context_n_max;
> +
> +	/**< Maximum number of private WRED contexts. Indicates the maximum
> +	  * number of leaf nodes that can concurrently have the private WRED
> +	  * context enabled.
> +	  */
> +	uint32_t cman_wred_context_private_n_max;
> +
> +	/**< Maximum number of shared WRED contexts. The value of zero indicates
> +	  * that shared WRED contexts are not supported.
> +	  */
> +	uint32_t cman_wred_context_shared_n_max;
> +
> +	/**< Maximum number of leaf nodes that can share the same WRED context.
> +	  * Only valid when shared WRED contexts are supported.
> +	  */
> +	uint32_t cman_wred_context_shared_n_nodes_max;
> +
> +	/**< Support for VLAN DEI packet marking. */
> +	int mark_vlan_dei_supported;
> +
> +	/**< Support for IPv4/IPv6 ECN marking of TCP packets. */
> +	int mark_ip_ecn_tcp_supported;
> +
> +	/**< Support for IPv4/IPv6 ECN marking of SCTP packets. */
> +	int mark_ip_ecn_sctp_supported;
> +
> +	/**< Support for IPv4/IPv6 DSCP packet marking. */
> +	int mark_ip_dscp_supported;
> +
> +	/**< Summary of node-level capabilities across all nodes. */
> +	struct rte_scheddev_node_capabilities node;
> +};
> +
> +/**
> +  * Congestion management (CMAN) mode
> +  *
> +  * This is used for controlling the admission of packets into a packet queue or
> +  * group of packet queues on congestion. On request of writing a new packet
> +  * into the current queue while the queue is full, the *tail drop* algorithm
> +  * drops the new packet while leaving the queue unmodified, as opposed to *head
> +  * drop* algorithm, which drops the packet at the head of the queue (the oldest
> +  * packet waiting in the queue) and admits the new packet at the tail of the
> +  * queue.
> +  *
> +  * The *Random Early Detection (RED)* algorithm works by proactively dropping
> +  * more and more input packets as the queue occupancy builds up. When the queue
> +  * is full or almost full, RED effectively works as *tail drop*. The *Weighted
> +  * RED* algorithm uses a separate set of RED thresholds for each packet color.
> +  */
> +enum rte_scheddev_cman_mode {
> +	RTE_SCHEDDEV_CMAN_TAIL_DROP = 0, /**< Tail drop */
> +	RTE_SCHEDDEV_CMAN_HEAD_DROP, /**< Head drop */
> +	RTE_SCHEDDEV_CMAN_WRED, /**< Weighted Random Early Detection (WRED) */
> +};
> +
> +/**
> +  * Color
> +  */
> +enum rte_scheddev_color {
> +	e_RTE_SCHEDDEV_GREEN = 0, /**< Green */
> +	e_RTE_SCHEDDEV_YELLOW,    /**< Yellow */
> +	e_RTE_SCHEDDEV_RED,       /**< Red */
> +	e_RTE_SCHEDDEV_COLORS     /**< Number of colors */

May be we can remove e_ here to be inline with other DPDK enums

> +};
> +
> +/**
> +  * WRED profile
> +  */
> +struct rte_scheddev_wred_params {
> +	/**< One set of RED parameters per packet color */
> +	struct rte_red_params red_params[e_RTE_SCHEDDEV_COLORS];
> +};
> +
> +/**
> +  * Token bucket
> +  */
> +struct rte_scheddev_token_bucket {
> +	/**< Token bucket rate (bytes per second) */
> +	uint64_t rate;
> +
> +	/**< Token bucket size (bytes), a.k.a. max burst size */
> +	uint64_t size;
> +};
> +
> +  * Each leaf node sits on on top of a TX queue of the current Ethernet port.
> +  * Therefore, the leaf nodes are predefined with the node IDs of 0 .. (N-1),
> +  * where N is the number of TX queues configured for the current Ethernet port.
> +  * The non-leaf nodes have their IDs generated by the application.
> +  */
> +struct rte_scheddev_node_params {
> +	/**< Shaper profile for the private shaper. The absence of the private
> +	 * shaper for the current node is indicated by setting this parameter
> +	 * to RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE.
> +	 */

I think we need add node priorities like priority,weight,level in node
param and let rte_scheddev_node_add only node_id and parent_node_id


> +	uint32_t shaper_profile_id;
> +
> +	/**< User allocated array of valid shared shaper IDs. */
> +	uint32_t *shared_shaper_id;
> +
> +	/**< Number of shared shaper IDs in the *shared_shaper_id* array. */
> +	uint32_t n_shared_shapers;
> +
> +	union {
> +		/**< Parameters only valid for non-leaf nodes. */
> +		struct {
> +			/**< For each priority, indicates whether the children
> +			 * nodes sharing the same priority are to be scheduled
> +			 * by WFQ or by WRR. When NULL, it indicates that WFQ
> +			 * is to be used for all priorities. When non-NULL, it
> +			 * points to a pre-allocated array of *n_priority*
> +			 * elements, with a non-zero value element indicating
> +			 * WFQ and a zero value element for WRR.
> +			 */
> +			int *scheduling_mode_per_priority;
> +
> +			/**< Number of priorities. */
> +			uint32_t n_priorities;
> +		} nonleaf;
> +
> +		/**< Parameters only valid for leaf nodes. */
> +		struct {
> +			/**< Congestion management mode */
> +			enum rte_scheddev_cman_mode cman;
> +
> +			/**< WRED parameters (valid when *cman* is WRED). */
> +			struct {
> +				/**< WRED profile for private WRED context. */
> +				uint32_t wred_profile_id;
> +
> +				/**< User allocated array of shared WRED context
> +				 * IDs. The absence of a private WRED context
> +				 * for current leaf node is indicated by value
> +				 * RTE_SCHEDDEV_WRED_PROFILE_ID_NONE.
> +				 */
> +				uint32_t *shared_wred_context_id;
> +
> +				/**< Number of shared WRED context IDs in the
> +				 * *shared_wred_context_id* array.
> +				 */
> +				uint32_t n_shared_wred_contexts;
> +			} wred;
> +		} leaf;
> +	};
> +};
> +
> +	const char *message; /**< Human-readable error message. */
> +};
> +
> +/**
> + * Scheduler capabilities get
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param cap
> + *   Scheduler capabilities. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_capabilities_get(uint8_t port_id,
> +	struct rte_scheddev_capabilities *cap,
> +	struct rte_scheddev_error *error);

As mentioned above, IMO, better to have level capabilities too.

> +
> +/**
> + * Scheduler node add
> + *
> + * When *node_id* is not a valid node ID, a new node with this ID is created and
> + * connected as child to the existing node identified by *parent_node_id*.
> + *
> + * When *node_id* is a valid node ID, this node is disconnected from its current
> + * parent and connected as child to another existing node identified by
> + * *parent_node_id *.
> + *
> + * This function can be called during port initialization phase (before the
> + * Ethernet port is started) for building the scheduler start-up hierarchy.
> + * Subject to the specific Ethernet port supporting on-the-fly scheduler
> + * hierarchy updates, this function can also be called during run-time (after
> + * the Ethernet port is started).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID
> + * @param parent_node_id
> + *   Parent node ID. Needs to be the valid.
> + * @param priority
> + *   Node priority. The highest node priority is zero. Used by the SP algorithm
> + *   running on the parent of the current node for scheduling this child node.
> + * @param weight
> + *   Node weight. The node weight is relative to the weight sum of all siblings
> + *   that have the same priority. The lowest weight is one. Used by the WFQ/WRR
> + *   algorithm running on the parent of the current node for scheduling this
> + *   child node.
> + * @param params
> + *   Node parameters. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_add(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,

see struct rte_scheddev_node_params comment.

> +/**
> +	struct rte_scheddev_node_params *params,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node delete
> + *
> + * Delete an existing node. This operation fails when this node currently has at
> + * least one user (i.e. child node).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_delete(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node suspend
> + *
> + * Suspend an existing node.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_suspend(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);

What is the use case for this ? Is it same as setting CIR and PIR as
zero drop the packets. Or its connected to dynamic topology change ?
IMO, dynamic topology change should be based on capability


> +
> +/**
> + * Scheduler node resume
> + *
> + * Resume an existing node that was previously suspended.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_resume(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_parent_update(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,
> +	struct rte_scheddev_error *error);
> +

IMO, dynamic topology change should based on capability

> + * Scheduler node scheduling mode update
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid leaf node ID.
> + * @param scheduling_mode_per_priority
> + *   For each priority, indicates whether the children nodes sharing the same
> + *   priority are to be scheduled by WFQ or by WRR. When NULL, it indicates that
> + *   WFQ is to be used for all priorities. When non-NULL, it points to a
> + *   pre-allocated array of *n_priority* elements, with a non-zero value element
> + *   indicating WFQ and a zero value element for WRR.
> + * @param n_priorities
> + *   Number of priorities.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_scheduling_mode_update(uint8_t port_id,
> +	uint32_t node_id,
> +	int *scheduling_mode_per_priority,
> +	uint32_t n_priorities,
> +	struct rte_scheddev_error *error);

Do we really need expose the driver implements WFQ or WRR ? Any weighted
scheme is fine. Right? No strong opinion though.

> +/**
> + * Scheduler packet marking - VLAN DEI (IEEE 802.1Q)
> + *
> + * IEEE 802.1p maps the traffic class to the VLAN Priority Code Point (PCP)
> + * field (3 bits), while IEEE 802.1q maps the drop priority to the VLAN Drop
> + * Eligible Indicator (DEI) field (1 bit), which was previously named Canonical
> + * Format Indicator (CFI).
> + *
> + * All VLAN frames of a given color get their DEI bit set if marking is enabled
> + * for this color; otherwise, their DEI bit is left as is (either set or not).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param mark_green
> + *   Set to non-zero value to enable marking of green packets and to zero to
> + *   disable it.
> + * @param mark_yellow
> + *   Set to non-zero value to enable marking of yellow packets and to zero to
> + *   disable it.
> + * @param mark_red
> + *   Set to non-zero value to enable marking of red packets and to zero to
> + *   disable it.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_mark_vlan_dei(uint8_t port_id,
> +	int mark_green,

We think, we don't need to mark for green color across marking APIs

> +	int mark_yellow,
> +	int mark_red,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler packet marking - IPv4 / IPv6 ECN (IETF RFC 3168)
> + *
> + * IETF RFCs 2474 and 3168 reorganize the IPv4 Type of Service (TOS) field
> +
> + * @param capability_stats_mask
> + *   Statistics counter types available for the current node. Needs to be
> + *   pre-allocated.
> + * @param enabled_stats_mask
> + *   Statistics counter types currently enabled for the current node. This is
> + *   a subset of *capability_stats_mask*. Needs to be pre-allocated.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_stats_get_enabled(uint8_t port_id,
> +	uint32_t node_id,
> +	uint64_t *capability_stats_mask,
> +	uint64_t *enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +
> +/**
> +int rte_scheddev_node_stats_read(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_node_stats *stats,
> +	int clear,
> +	struct rte_scheddev_error *error);

We need to add stats reset too. Right?

Jerin and Bala

> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* __INCLUDE_RTE_SCHEDDEV_H__ */
  

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index cc3bf98..666931d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -247,6 +247,10 @@  Flow API
 M: Adrien Mazarguil <adrien.mazarguil@6wind.com>
 F: lib/librte_ether/rte_flow*
 
+SchedDev API
+M: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
+F: lib/librte_ether/rte_scheddev*
+
 Crypto API
 M: Declan Doherty <declan.doherty@intel.com>
 F: lib/librte_cryptodev/
diff --git a/lib/librte_ether/Makefile b/lib/librte_ether/Makefile
index 1d095a9..7e0527f 100644
--- a/lib/librte_ether/Makefile
+++ b/lib/librte_ether/Makefile
@@ -1,6 +1,6 @@ 
 #   BSD LICENSE
 #
-#   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+#   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
 #   All rights reserved.
 #
 #   Redistribution and use in source and binary forms, with or without
@@ -45,6 +45,7 @@  LIBABIVER := 6
 
 SRCS-y += rte_ethdev.c
 SRCS-y += rte_flow.c
+SRCS-y += rte_scheddev.c
 
 #
 # Export include files
@@ -54,6 +55,8 @@  SYMLINK-y-include += rte_eth_ctrl.h
 SYMLINK-y-include += rte_dev_info.h
 SYMLINK-y-include += rte_flow.h
 SYMLINK-y-include += rte_flow_driver.h
+SYMLINK-y-include += rte_scheddev.h
+SYMLINK-y-include += rte_scheddev_driver.h
 
 # this lib depends upon:
 DEPDIRS-y += lib/librte_net lib/librte_eal lib/librte_mempool lib/librte_ring lib/librte_mbuf
diff --git a/lib/librte_ether/rte_ether_version.map b/lib/librte_ether/rte_ether_version.map
index d00cb5c..6b3c84f 100644
--- a/lib/librte_ether/rte_ether_version.map
+++ b/lib/librte_ether/rte_ether_version.map
@@ -159,5 +159,35 @@  DPDK_17.05 {
 	global:
 
 	rte_eth_dev_capability_control;
+	rte_scheddev_capabilities_get;
+	rte_scheddev_node_capabilities_get;
+	rte_scheddev_wred_profile_add;
+	rte_scheddev_wred_profile_delete;
+	rte_scheddev_shared_wred_context_add_update;
+	rte_scheddev_shared_wred_context_delete;
+	rte_scheddev_shaper_profile_add;
+	rte_scheddev_shaper_profile_delete;
+	rte_scheddev_shared_shaper_add_update;
+	rte_scheddev_shared_shaper_delete;
+	rte_scheddev_node_add;
+	rte_scheddev_node_delete;
+	rte_scheddev_node_suspend;
+	rte_scheddev_node_resume;
+	rte_scheddev_hierarchy_set;
+	rte_scheddev_node_parent_update;
+	rte_scheddev_node_shaper_update;
+	rte_scheddev_node_shared_shaper_update;
+	rte_scheddev_node_scheduling_mode_update;
+	rte_scheddev_node_cman_update;
+	rte_scheddev_node_wred_context_update;
+	rte_scheddev_node_shared_wred_context_update;
+	rte_scheddev_mark_vlan_dei;
+	rte_scheddev_mark_ip_ecn;
+	rte_scheddev_mark_ip_dscp;
+	rte_scheddev_stats_get_enabled;
+	rte_scheddev_stats_enable;
+	rte_scheddev_node_stats_get_enabled;
+	rte_scheddev_node_stats_enable;
+	rte_scheddev_node_stats_read;
 
 } DPDK_17.02;
diff --git a/lib/librte_ether/rte_scheddev.c b/lib/librte_ether/rte_scheddev.c
new file mode 100644
index 0000000..679a22d
--- /dev/null
+++ b/lib/librte_ether/rte_scheddev.c
@@ -0,0 +1,790 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+
+#include <rte_errno.h>
+#include <rte_branch_prediction.h>
+#include "rte_ethdev.h"
+#include "rte_scheddev_driver.h"
+#include "rte_scheddev.h"
+
+/* Get generic scheduler operations structure from a port. */
+const struct rte_scheddev_ops *
+rte_scheddev_ops_get(uint8_t port_id, struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops;
+
+	if (!rte_eth_dev_is_valid_port(port_id)) {
+		rte_scheddev_error_set(error,
+			ENODEV,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENODEV));
+		return NULL;
+	}
+
+	if ((dev->dev_ops->cap_ctrl == NULL) ||
+		dev->dev_ops->cap_ctrl(dev, RTE_ETH_CAPABILITY_SCHED, &ops) ||
+		(ops == NULL)) {
+		rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+		return NULL;
+	}
+
+	return ops;
+}
+
+/* Get capabilities */
+int rte_scheddev_capabilities_get(uint8_t port_id,
+	struct rte_scheddev_capabilities *cap,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->capabilities_get == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->capabilities_get(dev, cap, error);
+}
+
+/* Get node capabilities */
+int rte_scheddev_node_capabilities_get(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_node_capabilities *cap,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_capabilities_get == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_capabilities_get(dev, node_id, cap, error);
+}
+
+/* Add WRED profile */
+int rte_scheddev_wred_profile_add(uint8_t port_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_wred_params *profile,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->wred_profile_add == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->wred_profile_add(dev, wred_profile_id, profile, error);
+}
+
+/* Delete WRED profile */
+int rte_scheddev_wred_profile_delete(uint8_t port_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->wred_profile_delete == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->wred_profile_delete(dev, wred_profile_id, error);
+}
+
+/* Add/update shared WRED context */
+int rte_scheddev_shared_wred_context_add_update(uint8_t port_id,
+	uint32_t shared_wred_context_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->shared_wred_context_add_update == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->shared_wred_context_add_update(dev, shared_wred_context_id,
+		wred_profile_id, error);
+}
+
+/* Delete shared WRED context */
+int rte_scheddev_shared_wred_context_delete(uint8_t port_id,
+	uint32_t shared_wred_context_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->shared_wred_context_delete == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->shared_wred_context_delete(dev, shared_wred_context_id,
+		error);
+}
+
+/* Add shaper profile */
+int rte_scheddev_shaper_profile_add(uint8_t port_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_shaper_params *profile,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->shaper_profile_add == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->shaper_profile_add(dev, shaper_profile_id, profile, error);
+}
+
+/* Delete WRED profile */
+int rte_scheddev_shaper_profile_delete(uint8_t port_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->shaper_profile_delete == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->shaper_profile_delete(dev, shaper_profile_id, error);
+}
+
+/* Add shared shaper */
+int rte_scheddev_shared_shaper_add_update(uint8_t port_id,
+	uint32_t shared_shaper_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->shared_shaper_add_update == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->shared_shaper_add_update(dev, shared_shaper_id,
+		shaper_profile_id, error);
+}
+
+/* Delete shared shaper */
+int rte_scheddev_shared_shaper_delete(uint8_t port_id,
+	uint32_t shared_shaper_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->shared_shaper_delete == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->shared_shaper_delete(dev, shared_shaper_id, error);
+}
+
+/* Add node to port scheduler hierarchy */
+int rte_scheddev_node_add(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t parent_node_id,
+	uint32_t priority,
+	uint32_t weight,
+	struct rte_scheddev_node_params *params,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_add == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_add(dev, node_id, parent_node_id, priority, weight,
+		params, error);
+}
+
+/* Delete node from scheduler hierarchy */
+int rte_scheddev_node_delete(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_delete == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_delete(dev, node_id, error);
+}
+
+/* Suspend node */
+int rte_scheddev_node_suspend(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_suspend == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_suspend(dev, node_id, error);
+}
+
+/* Resume node */
+int rte_scheddev_node_resume(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_resume == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_resume(dev, node_id, error);
+}
+
+/* Set the initial port scheduler hierarchy */
+int rte_scheddev_hierarchy_set(uint8_t port_id,
+	int clear_on_fail,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->hierarchy_set == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->hierarchy_set(dev, clear_on_fail, error);
+}
+
+/* Update node parent  */
+int rte_scheddev_node_parent_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t parent_node_id,
+	uint32_t priority,
+	uint32_t weight,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_parent_update == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_parent_update(dev, node_id, parent_node_id, priority,
+		weight, error);
+}
+
+/* Update node private shaper */
+int rte_scheddev_node_shaper_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_shaper_update == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_shaper_update(dev, node_id, shaper_profile_id,
+		error);
+}
+
+/* Update node shared shapers */
+int rte_scheddev_node_shared_shaper_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t shared_shaper_id,
+	int add,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_shared_shaper_update == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_shared_shaper_update(dev, node_id, shared_shaper_id,
+		add, error);
+}
+
+/* Update scheduling mode */
+int rte_scheddev_node_scheduling_mode_update(uint8_t port_id,
+	uint32_t node_id,
+	int *scheduling_mode_per_priority,
+	uint32_t n_priorities,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_scheduling_mode_update == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_scheduling_mode_update(dev, node_id,
+		scheduling_mode_per_priority, n_priorities, error);
+}
+
+/* Update node congestion management mode */
+int rte_scheddev_node_cman_update(uint8_t port_id,
+	uint32_t node_id,
+	enum rte_scheddev_cman_mode cman,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_cman_update == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_cman_update(dev, node_id, cman, error);
+}
+
+/* Update node private WRED context */
+int rte_scheddev_node_wred_context_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_wred_context_update == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_wred_context_update(dev, node_id, wred_profile_id,
+		error);
+}
+
+/* Update node shared WRED context */
+int rte_scheddev_node_shared_wred_context_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t shared_wred_context_id,
+	int add,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_shared_wred_context_update == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_shared_wred_context_update(dev, node_id,
+		shared_wred_context_id, add, error);
+}
+
+/* Packet marking - VLAN DEI */
+int rte_scheddev_mark_vlan_dei(uint8_t port_id,
+	int mark_green,
+	int mark_yellow,
+	int mark_red,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->mark_vlan_dei == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->mark_vlan_dei(dev, mark_green, mark_yellow, mark_red,
+		error);
+}
+
+/* Packet marking - IPv4/IPv6 ECN */
+int rte_scheddev_mark_ip_ecn(uint8_t port_id,
+	int mark_green,
+	int mark_yellow,
+	int mark_red,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->mark_ip_ecn == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->mark_ip_ecn(dev, mark_green, mark_yellow, mark_red, error);
+}
+
+/* Packet marking - IPv4/IPv6 DSCP */
+int rte_scheddev_mark_ip_dscp(uint8_t port_id,
+	int mark_green,
+	int mark_yellow,
+	int mark_red,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->mark_ip_dscp == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->mark_ip_dscp(dev, mark_green, mark_yellow, mark_red,
+		error);
+}
+
+/* Get set of stats counter types currently enabled for all nodes */
+int rte_scheddev_stats_get_enabled(uint8_t port_id,
+	uint64_t *nonleaf_node_capability_stats_mask,
+	uint64_t *nonleaf_node_enabled_stats_mask,
+	uint64_t *leaf_node_capability_stats_mask,
+	uint64_t *leaf_node_enabled_stats_mask,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->stats_get_enabled == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->stats_get_enabled(dev,
+		nonleaf_node_capability_stats_mask,
+		nonleaf_node_enabled_stats_mask,
+		leaf_node_capability_stats_mask,
+		leaf_node_enabled_stats_mask,
+		error);
+}
+
+/* Enable specified set of stats counter types for all nodes */
+int rte_scheddev_stats_enable(uint8_t port_id,
+	uint64_t nonleaf_node_enabled_stats_mask,
+	uint64_t leaf_node_enabled_stats_mask,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->stats_enable == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->stats_enable(dev,
+		nonleaf_node_enabled_stats_mask,
+		leaf_node_enabled_stats_mask,
+		error);
+}
+
+/* Get set of stats counter types currently enabled for specific node */
+int rte_scheddev_node_stats_get_enabled(uint8_t port_id,
+	uint32_t node_id,
+	uint64_t *capability_stats_mask,
+	uint64_t *enabled_stats_mask,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_stats_get_enabled == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_stats_get_enabled(dev,
+		node_id,
+		capability_stats_mask,
+		enabled_stats_mask,
+		error);
+}
+
+/* Enable specified set of stats counter types for specific node */
+int rte_scheddev_node_stats_enable(uint8_t port_id,
+	uint32_t node_id,
+	uint64_t enabled_stats_mask,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_stats_enable == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_stats_enable(dev, node_id, enabled_stats_mask, error);
+}
+
+/* Read and/or clear stats counters for specific node */
+int rte_scheddev_node_stats_read(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_node_stats *stats,
+	int clear,
+	struct rte_scheddev_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_scheddev_ops *ops =
+		rte_scheddev_ops_get(port_id, error);
+
+	if (ops == NULL)
+		return -rte_errno;
+
+	if (ops->node_stats_read == NULL)
+		return -rte_scheddev_error_set(error,
+			ENOSYS,
+			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
+			NULL,
+			rte_strerror(ENOSYS));
+
+	return ops->node_stats_read(dev, node_id, stats, clear, error);
+}
diff --git a/lib/librte_ether/rte_scheddev.h b/lib/librte_ether/rte_scheddev.h
new file mode 100644
index 0000000..fed3df2
--- /dev/null
+++ b/lib/librte_ether/rte_scheddev.h
@@ -0,0 +1,1273 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __INCLUDE_RTE_SCHEDDEV_H__
+#define __INCLUDE_RTE_SCHEDDEV_H__
+
+/**
+ * @file
+ * RTE Generic Hierarchical Scheduler API
+ *
+ * This interface provides the ability to configure the hierarchical scheduler
+ * feature in a generic way.
+ */
+
+#include <stdint.h>
+
+#include <rte_red.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/** Ethernet framing overhead
+  *
+  * Overhead fields per Ethernet frame:
+  * 1. Preamble:                                            7 bytes;
+  * 2. Start of Frame Delimiter (SFD):                      1 byte;
+  * 3. Inter-Frame Gap (IFG):                              12 bytes.
+  */
+#define RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD                  20
+
+/**
+  * Ethernet framing overhead plus Frame Check Sequence (FCS). Useful when FCS
+  * is generated and added at the end of the Ethernet frame on TX side without
+  * any SW intervention.
+  */
+#define RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD_FCS              24
+
+/**< Invalid WRED profile ID */
+#define RTE_SCHEDDEV_WRED_PROFILE_ID_NONE                  UINT32_MAX
+
+/**< Invalid shaper profile ID */
+#define RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE                UINT32_MAX
+
+/**< Scheduler hierarchy root node ID */
+#define RTE_SCHEDDEV_ROOT_NODE_ID                          UINT32_MAX
+
+
+/**
+  * Scheduler node capabilities
+  */
+struct rte_scheddev_node_capabilities {
+	/**< Private shaper support. */
+	int shaper_private_supported;
+
+	/**< Dual rate shaping support for private shaper. Valid only when
+	 * private shaper is supported.
+	 */
+	int shaper_private_dual_rate_supported;
+
+	/**< Minimum committed/peak rate (bytes per second) for private
+	 * shaper. Valid only when private shaper is supported.
+	 */
+	uint64_t shaper_private_rate_min;
+
+	/**< Maximum committed/peak rate (bytes per second) for private
+	 * shaper. Valid only when private shaper is supported.
+	 */
+	uint64_t shaper_private_rate_max;
+
+	/**< Maximum number of supported shared shapers. The value of zero
+	 * indicates that shared shapers are not supported.
+	 */
+	uint32_t shaper_shared_n_max;
+
+	/**< Items valid only for non-leaf nodes. */
+	struct {
+		/**< Maximum number of children nodes. */
+		uint32_t n_children_max;
+
+		/**< Lowest priority supported. The value of 1 indicates that
+		 * only priority 0 is supported, which essentially means that
+		 * Strict Priority (SP) algorithm is not supported.
+		 */
+		uint32_t sp_priority_min;
+
+		/**< Maximum number of sibling nodes that can have the same
+		 * priority at any given time. When equal to *n_children_max*,
+		 * it indicates that WFQ/WRR algorithms are not supported.
+		 */
+		uint32_t sp_n_children_max;
+
+		/**< WFQ algorithm support. */
+		int scheduling_wfq_supported;
+
+		/**< WRR algorithm support. */
+		int scheduling_wrr_supported;
+
+		/**< Maximum WFQ/WRR weight. */
+		uint32_t scheduling_wfq_wrr_weight_max;
+	} nonleaf;
+
+	/**< Items valid only for leaf nodes. */
+	struct {
+		/**< Head drop algorithm support. */
+		int cman_head_drop_supported;
+
+		/**< Private WRED context support. */
+		int cman_wred_context_private_supported;
+
+		/**< Maximum number of shared WRED contexts supported. The value
+		 * of zero indicates that shared WRED contexts are not
+		 * supported.
+		 */
+		uint32_t cman_wred_context_shared_n_max;
+	} leaf;
+};
+
+/**
+  * Scheduler capabilities
+  */
+struct rte_scheddev_capabilities {
+	/**< Maximum number of nodes. */
+	uint32_t n_nodes_max;
+
+	/**< Maximum number of levels (i.e. number of nodes connecting the root
+	 * node with any leaf node, including the root and the leaf).
+	 */
+	uint32_t n_levels_max;
+
+	/**< Maximum number of shapers, either private or shared. In case the
+	 * implementation does not share any resource between private and
+	 * shared shapers, it is typically equal to the sum between
+	 * *shaper_private_n_max* and *shaper_shared_n_max*.
+	 */
+	uint32_t shaper_n_max;
+
+	/**< Maximum number of private shapers. Indicates the maximum number of
+	 * nodes that can concurrently have the private shaper enabled.
+	 */
+	uint32_t shaper_private_n_max;
+
+	/**< Maximum number of shared shapers. The value of zero indicates that
+	  * shared shapers are not supported.
+	  */
+	uint32_t shaper_shared_n_max;
+
+	/**< Maximum number of nodes that can share the same shared shaper. Only
+	  * valid when shared shapers are supported.
+	  */
+	uint32_t shaper_shared_n_nodes_max;
+
+	/**< Maximum number of shared shapers that can be configured with dual
+	  * rate shaping. The value of zero indicates that dual rate shaping
+	  * support is not available for shared shapers.
+	  */
+	uint32_t shaper_shared_dual_rate_n_max;
+
+	/**< Minimum committed/peak rate (bytes per second) for shared
+	  * shapers. Only valid when shared shapers are supported.
+	  */
+	uint64_t shaper_shared_rate_min;
+
+	/**< Maximum committed/peak rate (bytes per second) for shared
+	  * shaper. Only valid when shared shapers are supported.
+	  */
+	uint64_t shaper_shared_rate_max;
+
+	/**< Minimum value allowed for packet length adjustment for
+	  * private/shared shapers.
+	  */
+	int shaper_pkt_length_adjust_min;
+
+	/**< Maximum value allowed for packet length adjustment for
+	  * private/shared shapers.
+	  */
+	int shaper_pkt_length_adjust_max;
+
+	/**< Maximum number of WRED contexts. */
+	uint32_t cman_wred_context_n_max;
+
+	/**< Maximum number of private WRED contexts. Indicates the maximum
+	  * number of leaf nodes that can concurrently have the private WRED
+	  * context enabled.
+	  */
+	uint32_t cman_wred_context_private_n_max;
+
+	/**< Maximum number of shared WRED contexts. The value of zero indicates
+	  * that shared WRED contexts are not supported.
+	  */
+	uint32_t cman_wred_context_shared_n_max;
+
+	/**< Maximum number of leaf nodes that can share the same WRED context.
+	  * Only valid when shared WRED contexts are supported.
+	  */
+	uint32_t cman_wred_context_shared_n_nodes_max;
+
+	/**< Support for VLAN DEI packet marking. */
+	int mark_vlan_dei_supported;
+
+	/**< Support for IPv4/IPv6 ECN marking of TCP packets. */
+	int mark_ip_ecn_tcp_supported;
+
+	/**< Support for IPv4/IPv6 ECN marking of SCTP packets. */
+	int mark_ip_ecn_sctp_supported;
+
+	/**< Support for IPv4/IPv6 DSCP packet marking. */
+	int mark_ip_dscp_supported;
+
+	/**< Summary of node-level capabilities across all nodes. */
+	struct rte_scheddev_node_capabilities node;
+};
+
+/**
+  * Congestion management (CMAN) mode
+  *
+  * This is used for controlling the admission of packets into a packet queue or
+  * group of packet queues on congestion. On request of writing a new packet
+  * into the current queue while the queue is full, the *tail drop* algorithm
+  * drops the new packet while leaving the queue unmodified, as opposed to *head
+  * drop* algorithm, which drops the packet at the head of the queue (the oldest
+  * packet waiting in the queue) and admits the new packet at the tail of the
+  * queue.
+  *
+  * The *Random Early Detection (RED)* algorithm works by proactively dropping
+  * more and more input packets as the queue occupancy builds up. When the queue
+  * is full or almost full, RED effectively works as *tail drop*. The *Weighted
+  * RED* algorithm uses a separate set of RED thresholds for each packet color.
+  */
+enum rte_scheddev_cman_mode {
+	RTE_SCHEDDEV_CMAN_TAIL_DROP = 0, /**< Tail drop */
+	RTE_SCHEDDEV_CMAN_HEAD_DROP, /**< Head drop */
+	RTE_SCHEDDEV_CMAN_WRED, /**< Weighted Random Early Detection (WRED) */
+};
+
+/**
+  * Color
+  */
+enum rte_scheddev_color {
+	e_RTE_SCHEDDEV_GREEN = 0, /**< Green */
+	e_RTE_SCHEDDEV_YELLOW,    /**< Yellow */
+	e_RTE_SCHEDDEV_RED,       /**< Red */
+	e_RTE_SCHEDDEV_COLORS     /**< Number of colors */
+};
+
+/**
+  * WRED profile
+  */
+struct rte_scheddev_wred_params {
+	/**< One set of RED parameters per packet color */
+	struct rte_red_params red_params[e_RTE_SCHEDDEV_COLORS];
+};
+
+/**
+  * Token bucket
+  */
+struct rte_scheddev_token_bucket {
+	/**< Token bucket rate (bytes per second) */
+	uint64_t rate;
+
+	/**< Token bucket size (bytes), a.k.a. max burst size */
+	uint64_t size;
+};
+
+/**
+  * Shaper (rate limiter) profile
+  *
+  * Multiple shaper instances can share the same shaper profile. Each node has
+  * zero or one private shaper (only one node using it) and/or zero, one or
+  * several shared shapers (multiple nodes use the same shaper instance).
+  *
+  * Single rate shapers use a single token bucket. A single rate shaper can be
+  * configured by setting the rate of the committed bucket to zero, which
+  * effectively disables this bucket. The peak bucket is used to limit the rate
+  * and the burst size for the current shaper.
+  *
+  * Dual rate shapers use both the committed and the peak token buckets. The
+  * rate of the committed bucket has to be less than or equal to the rate of the
+  * peak bucket.
+  */
+struct rte_scheddev_shaper_params {
+	/**< Committed token bucket */
+	struct rte_scheddev_token_bucket committed;
+
+	/**< Peak token bucket */
+	struct rte_scheddev_token_bucket peak;
+
+	/**< Signed value to be added to the length of each packet for the
+	 * purpose of shaping. Can be used to correct the packet length with
+	 * the framing overhead bytes that are also consumed on the wire (e.g.
+	 * RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD_FCS).
+	 */
+	int32_t pkt_length_adjust;
+};
+
+/**
+  * Node parameters
+  *
+  * Each scheduler hierarchy node has multiple inputs (children nodes of the
+  * current parent node) and a single output (which is input to its parent
+  * node). The current node arbitrates its inputs using Strict Priority (SP),
+  * Weighted Fair Queuing (WFQ) and Weighted Round Robin (WRR) algorithms to
+  * schedule input packets on its output while observing its shaping (rate
+  * limiting) constraints.
+  *
+  * Algorithms such as byte-level WRR, Deficit WRR (DWRR), etc are considered
+  * approximations of the ideal of WFQ and are assimilated to WFQ, although
+  * an associated implementation-dependent trade-off on accuracy, performance
+  * and resource usage might exist.
+  *
+  * Children nodes with different priorities are scheduled using the SP
+  * algorithm, based on their priority, with zero (0) as the highest priority.
+  * Children with same priority are scheduled using the WFQ or WRR algorithm,
+  * based on their weight, which is relative to the sum of the weights of all
+  * siblings with same priority, with one (1) as the lowest weight.
+  *
+  * Each leaf node sits on on top of a TX queue of the current Ethernet port.
+  * Therefore, the leaf nodes are predefined with the node IDs of 0 .. (N-1),
+  * where N is the number of TX queues configured for the current Ethernet port.
+  * The non-leaf nodes have their IDs generated by the application.
+  */
+struct rte_scheddev_node_params {
+	/**< Shaper profile for the private shaper. The absence of the private
+	 * shaper for the current node is indicated by setting this parameter
+	 * to RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE.
+	 */
+	uint32_t shaper_profile_id;
+
+	/**< User allocated array of valid shared shaper IDs. */
+	uint32_t *shared_shaper_id;
+
+	/**< Number of shared shaper IDs in the *shared_shaper_id* array. */
+	uint32_t n_shared_shapers;
+
+	union {
+		/**< Parameters only valid for non-leaf nodes. */
+		struct {
+			/**< For each priority, indicates whether the children
+			 * nodes sharing the same priority are to be scheduled
+			 * by WFQ or by WRR. When NULL, it indicates that WFQ
+			 * is to be used for all priorities. When non-NULL, it
+			 * points to a pre-allocated array of *n_priority*
+			 * elements, with a non-zero value element indicating
+			 * WFQ and a zero value element for WRR.
+			 */
+			int *scheduling_mode_per_priority;
+
+			/**< Number of priorities. */
+			uint32_t n_priorities;
+		} nonleaf;
+
+		/**< Parameters only valid for leaf nodes. */
+		struct {
+			/**< Congestion management mode */
+			enum rte_scheddev_cman_mode cman;
+
+			/**< WRED parameters (valid when *cman* is WRED). */
+			struct {
+				/**< WRED profile for private WRED context. */
+				uint32_t wred_profile_id;
+
+				/**< User allocated array of shared WRED context
+				 * IDs. The absence of a private WRED context
+				 * for current leaf node is indicated by value
+				 * RTE_SCHEDDEV_WRED_PROFILE_ID_NONE.
+				 */
+				uint32_t *shared_wred_context_id;
+
+				/**< Number of shared WRED context IDs in the
+				 * *shared_wred_context_id* array.
+				 */
+				uint32_t n_shared_wred_contexts;
+			} wred;
+		} leaf;
+	};
+};
+
+/**
+  * Node statistics counter type
+  */
+enum rte_scheddev_stats_counter {
+	/**< Number of packets scheduled from current node. */
+	RTE_SCHEDDEV_STATS_COUNTER_N_PKTS = 1 << 0,
+
+	/**< Number of bytes scheduled from current node. */
+	RTE_SCHEDDEV_STATS_COUNTER_N_BYTES = 1 << 1,
+
+	/**< Number of packets dropped by current node.  */
+	RTE_SCHEDDEV_STATS_COUNTER_N_PKTS_DROPPED = 1 << 2,
+
+	/**< Number of bytes dropped by current node.  */
+	RTE_SCHEDDEV_STATS_COUNTER_N_BYTES_DROPPED = 1 << 3,
+
+	/**< Number of packets currently waiting in the packet queue of current
+	 * leaf node.
+	 */
+	RTE_SCHEDDEV_STATS_COUNTER_N_PKTS_QUEUED = 1 << 4,
+
+	/**< Number of bytes currently waiting in the packet queue of current
+	 * leaf node.
+	 */
+	RTE_SCHEDDEV_STATS_COUNTER_N_BYTES_QUEUED = 1 << 5,
+};
+
+/**
+  * Node statistics counters
+  */
+struct rte_scheddev_node_stats {
+	/**< Number of packets scheduled from current node. */
+	uint64_t n_pkts;
+
+	/**< Number of bytes scheduled from current node. */
+	uint64_t n_bytes;
+
+	/**< Statistics counters for leaf nodes only. */
+	struct {
+		/**< Number of packets dropped by current leaf node. */
+		uint64_t n_pkts_dropped;
+
+		/**< Number of bytes dropped by current leaf node. */
+		uint64_t n_bytes_dropped;
+
+		/**< Number of packets currently waiting in the packet queue of
+		 * current leaf node.
+		 */
+		uint64_t n_pkts_queued;
+
+		/**< Number of bytes currently waiting in the packet queue of
+		 * current leaf node.
+		 */
+		uint64_t n_bytes_queued;
+	} leaf;
+};
+
+/**
+ * Verbose error types.
+ *
+ * Most of them provide the type of the object referenced by struct
+ * rte_scheddev_error::cause.
+ */
+enum rte_scheddev_error_type {
+	RTE_SCHEDDEV_ERROR_TYPE_NONE, /**< No error. */
+	RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED, /**< Cause unspecified. */
+	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE,
+	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE_GREEN,
+	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE_YELLOW,
+	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE_RED,
+	RTE_SCHEDDEV_ERROR_TYPE_WRED_PROFILE_ID,
+	RTE_SCHEDDEV_ERROR_TYPE_SHARED_WRED_CONTEXT_ID,
+	RTE_SCHEDDEV_ERROR_TYPE_SHAPER_PROFILE,
+	RTE_SCHEDDEV_ERROR_TYPE_SHARED_SHAPER_ID,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_PARENT_NODE_ID,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_PRIORITY,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_WEIGHT,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_SCHEDULING_MODE,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_SHAPER_PROFILE_ID,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_SHARED_SHAPER_ID,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_LEAF,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_LEAF_CMAN,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_LEAF_WRED_PROFILE_ID,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_PARAMS_LEAF_SHARED_WRED_CONTEXT_ID,
+	RTE_SCHEDDEV_ERROR_TYPE_NODE_ID,
+};
+
+/**
+ * Verbose error structure definition.
+ *
+ * This object is normally allocated by applications and set by PMDs, the
+ * message points to a constant string which does not need to be freed by
+ * the application, however its pointer can be considered valid only as long
+ * as its associated DPDK port remains configured. Closing the underlying
+ * device or unloading the PMD invalidates it.
+ *
+ * Both cause and message may be NULL regardless of the error type.
+ */
+struct rte_scheddev_error {
+	enum rte_scheddev_error_type type; /**< Cause field and error type. */
+	const void *cause; /**< Object responsible for the error. */
+	const char *message; /**< Human-readable error message. */
+};
+
+/**
+ * Scheduler capabilities get
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param cap
+ *   Scheduler capabilities. Needs to be pre-allocated and valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_capabilities_get(uint8_t port_id,
+	struct rte_scheddev_capabilities *cap,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node capabilities get
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param cap
+ *   Scheduler node capabilities. Needs to be pre-allocated and valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_capabilities_get(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_node_capabilities *cap,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler WRED profile add
+ *
+ * Create a new WRED profile with ID set to *wred_profile_id*. The new profile
+ * is used to create one or several WRED contexts.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param wred_profile_id
+ *   WRED profile ID for the new profile. Needs to be unused.
+ * @param profile
+ *   WRED profile parameters. Needs to be pre-allocated and valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_wred_profile_add(uint8_t port_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_wred_params *profile,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler WRED profile delete
+ *
+ * Delete an existing WRED profile. This operation fails when there is currently
+ * at least one user (i.e. WRED context) of this WRED profile.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param wred_profile_id
+ *   WRED profile ID. Needs to be the valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_wred_profile_delete(uint8_t port_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler shared WRED context add or update
+ *
+ * When *shared_wred_context_id* is invalid, a new WRED context with this ID is
+ * created by using the WRED profile identified by *wred_profile_id*.
+ *
+ * When *shared_wred_context_id* is valid, this WRED context is no longer using
+ * the profile previously assigned to it and is updated to use the profile
+ * identified by *wred_profile_id*.
+ *
+ * A valid shared WRED context can be assigned to several scheduler hierarchy
+ * leaf nodes configured to use WRED as the congestion management mode.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shared_wred_context_id
+ *   Shared WRED context ID
+ * @param wred_profile_id
+ *   WRED profile ID. Needs to be the valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_shared_wred_context_add_update(uint8_t port_id,
+	uint32_t shared_wred_context_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler shared WRED context delete
+ *
+ * Delete an existing shared WRED context. This operation fails when there is
+ * currently at least one user (i.e. scheduler hierarchy leaf node) of this
+ * shared WRED context.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shared_wred_context_id
+ *   Shared WRED context ID. Needs to be the valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_shared_wred_context_delete(uint8_t port_id,
+	uint32_t shared_wred_context_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler shaper profile add
+ *
+ * Create a new shaper profile with ID set to *shaper_profile_id*. The new
+ * shaper profile is used to create one or several shapers.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shaper_profile_id
+ *   Shaper profile ID for the new profile. Needs to be unused.
+ * @param profile
+ *   Shaper profile parameters. Needs to be pre-allocated and valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_shaper_profile_add(uint8_t port_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_shaper_params *profile,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler shaper profile delete
+ *
+ * Delete an existing shaper profile. This operation fails when there is
+ * currently at least one user (i.e. shaper) of this shaper profile.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shaper_profile_id
+ *   Shaper profile ID. Needs to be the valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_shaper_profile_delete(uint8_t port_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler shared shaper add or update
+ *
+ * When *shared_shaper_id* is not a valid shared shaper ID, a new shared shaper
+ * with this ID is created using the shaper profile identified by
+ * *shaper_profile_id*.
+ *
+ * When *shared_shaper_id* is a valid shared shaper ID, this shared shaper is no
+ * longer using the shaper profile previously assigned to it and is updated to
+ * use the shaper profile identified by *shaper_profile_id*.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shared_shaper_id
+ *   Shared shaper ID
+ * @param shaper_profile_id
+ *   Shaper profile ID. Needs to be the valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_shared_shaper_add_update(uint8_t port_id,
+	uint32_t shared_shaper_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler shared shaper delete
+ *
+ * Delete an existing shared shaper. This operation fails when there is
+ * currently at least one user (i.e. scheduler hierarchy node) of this shared
+ * shaper.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param shared_shaper_id
+ *   Shared shaper ID. Needs to be the valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_shared_shaper_delete(uint8_t port_id,
+	uint32_t shared_shaper_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node add
+ *
+ * When *node_id* is not a valid node ID, a new node with this ID is created and
+ * connected as child to the existing node identified by *parent_node_id*.
+ *
+ * When *node_id* is a valid node ID, this node is disconnected from its current
+ * parent and connected as child to another existing node identified by
+ * *parent_node_id *.
+ *
+ * This function can be called during port initialization phase (before the
+ * Ethernet port is started) for building the scheduler start-up hierarchy.
+ * Subject to the specific Ethernet port supporting on-the-fly scheduler
+ * hierarchy updates, this function can also be called during run-time (after
+ * the Ethernet port is started).
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID
+ * @param parent_node_id
+ *   Parent node ID. Needs to be the valid.
+ * @param priority
+ *   Node priority. The highest node priority is zero. Used by the SP algorithm
+ *   running on the parent of the current node for scheduling this child node.
+ * @param weight
+ *   Node weight. The node weight is relative to the weight sum of all siblings
+ *   that have the same priority. The lowest weight is one. Used by the WFQ/WRR
+ *   algorithm running on the parent of the current node for scheduling this
+ *   child node.
+ * @param params
+ *   Node parameters. Needs to be pre-allocated and valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_add(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t parent_node_id,
+	uint32_t priority,
+	uint32_t weight,
+	struct rte_scheddev_node_params *params,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node delete
+ *
+ * Delete an existing node. This operation fails when this node currently has at
+ * least one user (i.e. child node).
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_delete(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node suspend
+ *
+ * Suspend an existing node.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_suspend(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node resume
+ *
+ * Resume an existing node that was previously suspended.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_resume(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler hierarchy set
+ *
+ * This function is called during the port initialization phase (before the
+ * Ethernet port is started) to freeze the scheduler start-up hierarchy.
+ *
+ * This function fails when the currently configured scheduler hierarchy is not
+ * supported by the Ethernet port, in which case the user can abort or try out
+ * another hierarchy configuration (e.g. a hierarchy with less leaf nodes),
+ * which can be build from scratch (when *clear_on_fail* is enabled) or by
+ * modifying the existing hierarchy configuration (when *clear_on_fail* is
+ * disabled).
+ *
+ * Note that, even when the configured scheduler hierarchy is supported (so this
+ * function is successful), the Ethernet port start might still fail due to e.g.
+ * not enough memory being available in the system, etc.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param clear_on_fail
+ *   On function call failure, hierarchy is cleared when this parameter is
+ *   non-zero and preserved when this parameter is equal to zero.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_hierarchy_set(uint8_t port_id,
+	int clear_on_fail,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node parent update
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param parent_node_id
+ *   Node ID for the new parent. Needs to be valid.
+ * @param priority
+ *   Node priority. The highest node priority is zero. Used by the SP algorithm
+ *   running on the parent of the current node for scheduling this child node.
+ * @param weight
+ *   Node weight. The node weight is relative to the weight sum of all siblings
+ *   that have the same priority. The lowest weight is zero. Used by the WFQ/WRR
+ *   algorithm running on the parent of the current node for scheduling this
+ *   child node.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_parent_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t parent_node_id,
+	uint32_t priority,
+	uint32_t weight,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node private shaper update
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param shaper_profile_id
+ *   Shaper profile ID for the private shaper of the current node. Needs to be
+ *   either valid shaper profile ID or RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE, with
+ *   the latter disabling the private shaper of the current node.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_shaper_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node shared shapers update
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param shared_shaper_id
+ *   Shared shaper ID. Needs to be valid.
+ * @param add
+ *   Set to non-zero value to add this shared shaper to current node or to zero
+ *   to delete this shared shaper from current node.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_shared_shaper_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t shared_shaper_id,
+	int add,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node scheduling mode update
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid leaf node ID.
+ * @param scheduling_mode_per_priority
+ *   For each priority, indicates whether the children nodes sharing the same
+ *   priority are to be scheduled by WFQ or by WRR. When NULL, it indicates that
+ *   WFQ is to be used for all priorities. When non-NULL, it points to a
+ *   pre-allocated array of *n_priority* elements, with a non-zero value element
+ *   indicating WFQ and a zero value element for WRR.
+ * @param n_priorities
+ *   Number of priorities.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_scheduling_mode_update(uint8_t port_id,
+	uint32_t node_id,
+	int *scheduling_mode_per_priority,
+	uint32_t n_priorities,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node congestion management mode update
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid leaf node ID.
+ * @param cman
+ *   Congestion management mode.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_cman_update(uint8_t port_id,
+	uint32_t node_id,
+	enum rte_scheddev_cman_mode cman,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node private WRED context update
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid leaf node ID.
+ * @param wred_profile_id
+ *   WRED profile ID for the private WRED context of the current node. Needs to
+ *   be either valid WRED profile ID or RTE_SCHEDDEV_WRED_PROFILE_ID_NONE, with
+ *   the latter disabling the private WRED context of the current node.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_wred_context_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node shared WRED context update
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid leaf node ID.
+ * @param shared_wred_context_id
+ *   Shared WRED context ID. Needs to be valid.
+ * @param add
+ *   Set to non-zero value to add this shared WRED context to current node or to
+ *   zero to delete this shared WRED context from current node.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_shared_wred_context_update(uint8_t port_id,
+	uint32_t node_id,
+	uint32_t shared_wred_context_id,
+	int add,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler packet marking - VLAN DEI (IEEE 802.1Q)
+ *
+ * IEEE 802.1p maps the traffic class to the VLAN Priority Code Point (PCP)
+ * field (3 bits), while IEEE 802.1q maps the drop priority to the VLAN Drop
+ * Eligible Indicator (DEI) field (1 bit), which was previously named Canonical
+ * Format Indicator (CFI).
+ *
+ * All VLAN frames of a given color get their DEI bit set if marking is enabled
+ * for this color; otherwise, their DEI bit is left as is (either set or not).
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param mark_green
+ *   Set to non-zero value to enable marking of green packets and to zero to
+ *   disable it.
+ * @param mark_yellow
+ *   Set to non-zero value to enable marking of yellow packets and to zero to
+ *   disable it.
+ * @param mark_red
+ *   Set to non-zero value to enable marking of red packets and to zero to
+ *   disable it.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_mark_vlan_dei(uint8_t port_id,
+	int mark_green,
+	int mark_yellow,
+	int mark_red,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler packet marking - IPv4 / IPv6 ECN (IETF RFC 3168)
+ *
+ * IETF RFCs 2474 and 3168 reorganize the IPv4 Type of Service (TOS) field
+ * (8 bits) and the IPv6 Traffic Class (TC) field (8 bits) into Differentiated
+ * Services Codepoint (DSCP) field (6 bits) and Explicit Congestion Notification
+ * (ECN) field (2 bits). The DSCP field is typically used to encode the traffic
+ * class and/or drop priority (RFC 2597), while the ECN field is used by RFC
+ * 3168 to implement a congestion notification mechanism to be leveraged by
+ * transport layer protocols such as TCP and SCTP that have congestion control
+ * mechanisms.
+ *
+ * When congestion is experienced, as alternative to dropping the packet,
+ * routers can change the ECN field of input packets from 2'b01 or 2'b10 (values
+ * indicating that source endpoint is ECN-capable) to 2'b11 (meaning that
+ * congestion is experienced). The destination endpoint can use the ECN-Echo
+ * (ECE) TCP flag to relay the congestion indication back to the source
+ * endpoint, which acknowledges it back to the destination endpoint with the
+ * Congestion Window Reduced (CWR) TCP flag.
+ *
+ * All IPv4/IPv6 packets of a given color with ECN set to 2’b01 or 2’b10
+ * carrying TCP or SCTP have their ECN set to 2’b11 if the marking feature is
+ * enabled for the current color, otherwise the ECN field is left as is.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param mark_green
+ *   Set to non-zero value to enable marking of green packets and to zero to
+ *   disable it.
+ * @param mark_yellow
+ *   Set to non-zero value to enable marking of yellow packets and to zero to
+ *   disable it.
+ * @param mark_red
+ *   Set to non-zero value to enable marking of red packets and to zero to
+ *   disable it.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_mark_ip_ecn(uint8_t port_id,
+	int mark_green,
+	int mark_yellow,
+	int mark_red,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler packet marking - IPv4 / IPv6 DSCP (IETF RFC 2597)
+ *
+ * IETF RFC 2597 maps the traffic class and the drop priority to the IPv4/IPv6
+ * Differentiated Services Codepoint (DSCP) field (6 bits). Here are the DSCP
+ * values proposed by this RFC:
+ *
+ *                       Class 1    Class 2    Class 3    Class 4
+ *                     +----------+----------+----------+----------+
+ *    Low Drop Prec    |  001010  |  010010  |  011010  |  100010  |
+ *    Medium Drop Prec |  001100  |  010100  |  011100  |  100100  |
+ *    High Drop Prec   |  001110  |  010110  |  011110  |  100110  |
+ *                     +----------+----------+----------+----------+
+ *
+ * There are 4 traffic classes (classes 1 .. 4) encoded by DSCP bits 1 and 2, as
+ * well as 3 drop priorities (low/medium/high) encoded by DSCP bits 3 and 4.
+ *
+ * All IPv4/IPv6 packets have their color marked into DSCP bits 3 and 4 as
+ * follows: green mapped to Low Drop Precedence (2’b01), yellow to Medium
+ * (2’b10) and red to High (2’b11). Marking needs to be explicitly enabled
+ * for each color; when not enabled for a given color, the DSCP field of all
+ * packets with that color is left as is.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param mark_green
+ *   Set to non-zero value to enable marking of green packets and to zero to
+ *   disable it.
+ * @param mark_yellow
+ *   Set to non-zero value to enable marking of yellow packets and to zero to
+ *   disable it.
+ * @param mark_red
+ *   Set to non-zero value to enable marking of red packets and to zero to
+ *   disable it.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_mark_ip_dscp(uint8_t port_id,
+	int mark_green,
+	int mark_yellow,
+	int mark_red,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler get statistics counter types enabled for all nodes
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param nonleaf_node_capability_stats_mask
+ *   Statistics counter types available per node for all non-leaf nodes. Needs
+ *   to be pre-allocated.
+ * @param nonleaf_node_enabled_stats_mask
+ *   Statistics counter types currently enabled per node for each non-leaf node.
+ *   This is a subset of *nonleaf_node_capability_stats_mask*. Needs to be
+ *   pre-allocated.
+ * @param leaf_node_capability_stats_mask
+ *   Statistics counter types available per node for all leaf nodes. Needs to
+ *   be pre-allocated.
+ * @param leaf_node_enabled_stats_mask
+ *   Statistics counter types currently enabled for each leaf node. This is
+ *   a subset of *leaf_node_capability_stats_mask*. Needs to be pre-allocated.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_stats_get_enabled(uint8_t port_id,
+	uint64_t *nonleaf_node_capability_stats_mask,
+	uint64_t *nonleaf_node_enabled_stats_mask,
+	uint64_t *leaf_node_capability_stats_mask,
+	uint64_t *leaf_node_enabled_stats_mask,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler enable selected statistics counters for all nodes
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param nonleaf_node_enabled_stats_mask
+ *   Statistics counter types to be enabled per node for each non-leaf node.
+ *   This needs to be a subset of the statistics counter types available per
+ *   node for all non-leaf nodes. Any statistics counter type not included in
+ *   this set is to be disabled for all non-leaf nodes.
+ * @param leaf_node_enabled_stats_mask
+ *   Statistics counter types to be enabled per node for each leaf node. This
+ *   needs to be a subset of the statistics counter types available per node for
+ *   all leaf nodes. Any statistics counter type not included in this set is to
+ *   be disabled for all leaf nodes.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_stats_enable(uint8_t port_id,
+	uint64_t nonleaf_node_enabled_stats_mask,
+	uint64_t leaf_node_enabled_stats_mask,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler get statistics counter types enabled for current node
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param capability_stats_mask
+ *   Statistics counter types available for the current node. Needs to be
+ *   pre-allocated.
+ * @param enabled_stats_mask
+ *   Statistics counter types currently enabled for the current node. This is
+ *   a subset of *capability_stats_mask*. Needs to be pre-allocated.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_stats_get_enabled(uint8_t port_id,
+	uint32_t node_id,
+	uint64_t *capability_stats_mask,
+	uint64_t *enabled_stats_mask,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler enable selected statistics counters for current node
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param enabled_stats_mask
+ *   Statistics counter types to be enabled for the current node. This needs to
+ *   be a subset of the statistics counter types available for the current node.
+ *   Any statistics counter type not included in this set is to be disabled for
+ *   the current node.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_stats_enable(uint8_t port_id,
+	uint32_t node_id,
+	uint64_t enabled_stats_mask,
+	struct rte_scheddev_error *error);
+
+/**
+ * Scheduler node statistics counters read
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param node_id
+ *   Node ID. Needs to be valid.
+ * @param stats
+ *   When non-NULL, it contains the current value for the statistics counters
+ *   enabled for the current node.
+ * @param clear
+ *   When this parameter has a non-zero value, the statistics counters are
+ *   cleared (i.e. set to zero) immediately after they have been read, otherwise
+ *   the statistics counters are left untouched.
+ * @param error
+ *   Error details. Filled in only on error, when not NULL.
+ * @return
+ *   0 on success, non-zero error code otherwise.
+ */
+int rte_scheddev_node_stats_read(uint8_t port_id,
+	uint32_t node_id,
+	struct rte_scheddev_node_stats *stats,
+	int clear,
+	struct rte_scheddev_error *error);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* __INCLUDE_RTE_SCHEDDEV_H__ */
diff --git a/lib/librte_ether/rte_scheddev_driver.h b/lib/librte_ether/rte_scheddev_driver.h
new file mode 100644
index 0000000..c0a0321
--- /dev/null
+++ b/lib/librte_ether/rte_scheddev_driver.h
@@ -0,0 +1,374 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef __INCLUDE_RTE_SCHEDDEV_DRIVER_H__
+#define __INCLUDE_RTE_SCHEDDEV_DRIVER_H__
+
+/**
+ * @file
+ * RTE Generic Hierarchical Scheduler API (Driver Side)
+ *
+ * This file provides implementation helpers for internal use by PMDs, they
+ * are not intended to be exposed to applications and are not subject to ABI
+ * versioning.
+ */
+
+#include <stdint.h>
+
+#include <rte_errno.h>
+#include "rte_ethdev.h"
+#include "rte_scheddev.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+typedef int (*rte_scheddev_capabilities_get_t)(struct rte_eth_dev *dev,
+	struct rte_scheddev_capabilities *cap,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler capabilities get */
+
+typedef int (*rte_scheddev_node_capabilities_get_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	struct rte_scheddev_node_capabilities *cap,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node capabilities get */
+
+typedef int (*rte_scheddev_wred_profile_add_t)(struct rte_eth_dev *dev,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_wred_params *profile,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler WRED profile add */
+
+typedef int (*rte_scheddev_wred_profile_delete_t)(struct rte_eth_dev *dev,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler WRED profile delete */
+
+typedef int (*rte_scheddev_shared_wred_context_add_update_t)(
+	struct rte_eth_dev *dev,
+	uint32_t shared_wred_context_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler shared WRED context add */
+
+typedef int (*rte_scheddev_shared_wred_context_delete_t)(
+	struct rte_eth_dev *dev,
+	uint32_t shared_wred_context_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler shared WRED context delete */
+
+typedef int (*rte_scheddev_shaper_profile_add_t)(struct rte_eth_dev *dev,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_shaper_params *profile,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler shaper profile add */
+
+typedef int (*rte_scheddev_shaper_profile_delete_t)(struct rte_eth_dev *dev,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler shaper profile delete */
+
+typedef int (*rte_scheddev_shared_shaper_add_update_t)(struct rte_eth_dev *dev,
+	uint32_t shared_shaper_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler shared shaper add/update */
+
+typedef int (*rte_scheddev_shared_shaper_delete_t)(struct rte_eth_dev *dev,
+	uint32_t shared_shaper_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler shared shaper delete */
+
+typedef int (*rte_scheddev_node_add_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t parent_node_id,
+	uint32_t priority,
+	uint32_t weight,
+	struct rte_scheddev_node_params *params,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node add */
+
+typedef int (*rte_scheddev_node_delete_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node delete */
+
+typedef int (*rte_scheddev_node_suspend_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node suspend */
+
+typedef int (*rte_scheddev_node_resume_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node resume */
+
+typedef int (*rte_scheddev_hierarchy_set_t)(struct rte_eth_dev *dev,
+	int clear_on_fail,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler hierarchy set */
+
+typedef int (*rte_scheddev_node_parent_update_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t parent_node_id,
+	uint32_t priority,
+	uint32_t weight,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node parent update */
+
+typedef int (*rte_scheddev_node_shaper_update_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t shaper_profile_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node shaper update */
+
+typedef int (*rte_scheddev_node_shared_shaper_update_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t shared_shaper_id,
+	int32_t add,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node shaper update */
+
+typedef int (*rte_scheddev_node_scheduling_mode_update_t)(
+	struct rte_eth_dev *dev,
+	uint32_t node_id,
+	int *scheduling_mode_per_priority,
+	uint32_t n_priorities,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node scheduling mode update */
+
+typedef int (*rte_scheddev_node_cman_update_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	enum rte_scheddev_cman_mode cman,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node congestion management mode update */
+
+typedef int (*rte_scheddev_node_wred_context_update_t)(
+	struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t wred_profile_id,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node WRED context update */
+
+typedef int (*rte_scheddev_node_shared_wred_context_update_t)(
+	struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint32_t shared_wred_context_id,
+	int add,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler node WRED context update */
+
+typedef int (*rte_scheddev_mark_vlan_dei_t)(struct rte_eth_dev *dev,
+	int mark_green,
+	int mark_yellow,
+	int mark_red,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler packet marking - VLAN DEI */
+
+typedef int (*rte_scheddev_mark_ip_ecn_t)(struct rte_eth_dev *dev,
+	int mark_green,
+	int mark_yellow,
+	int mark_red,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler packet marking - IPv4/IPv6 ECN */
+
+typedef int (*rte_scheddev_mark_ip_dscp_t)(struct rte_eth_dev *dev,
+	int mark_green,
+	int mark_yellow,
+	int mark_red,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler packet marking - IPv4/IPv6 DSCP */
+
+typedef int (*rte_scheddev_stats_get_enabled_t)(struct rte_eth_dev *dev,
+	uint64_t *nonleaf_node_capability_stats_mask,
+	uint64_t *nonleaf_node_enabled_stats_mask,
+	uint64_t *leaf_node_capability_stats_mask,
+	uint64_t *leaf_node_enabled_stats_mask,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler get set of stats counters enabled for all nodes */
+
+typedef int (*rte_scheddev_stats_enable_t)(struct rte_eth_dev *dev,
+	uint64_t nonleaf_node_enabled_stats_mask,
+	uint64_t leaf_node_enabled_stats_mask,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler enable selected stats counters for all nodes */
+
+typedef int (*rte_scheddev_node_stats_get_enabled_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint64_t *capability_stats_mask,
+	uint64_t *enabled_stats_mask,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler get set of stats counters enabled for specific node */
+
+typedef int (*rte_scheddev_node_stats_enable_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	uint64_t enabled_stats_mask,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler enable selected stats counters for specific node */
+
+typedef int (*rte_scheddev_node_stats_read_t)(struct rte_eth_dev *dev,
+	uint32_t node_id,
+	struct rte_scheddev_node_stats *stats,
+	int clear,
+	struct rte_scheddev_error *error);
+/**< @internal Scheduler read stats counters for specific node */
+
+struct rte_scheddev_ops {
+	/** Scheduler capabilities_get */
+	rte_scheddev_capabilities_get_t capabilities_get;
+	/** Scheduler node capabilities get */
+	rte_scheddev_node_capabilities_get_t node_capabilities_get;
+
+	/** Scheduler WRED profile add */
+	rte_scheddev_wred_profile_add_t wred_profile_add;
+	/** Scheduler WRED profile delete */
+	rte_scheddev_wred_profile_delete_t wred_profile_delete;
+	/** Scheduler shared WRED context add/update */
+	rte_scheddev_shared_wred_context_add_update_t
+		shared_wred_context_add_update;
+	/** Scheduler shared WRED context delete */
+	rte_scheddev_shared_wred_context_delete_t
+		shared_wred_context_delete;
+	/** Scheduler shaper profile add */
+	rte_scheddev_shaper_profile_add_t shaper_profile_add;
+	/** Scheduler shaper profile delete */
+	rte_scheddev_shaper_profile_delete_t shaper_profile_delete;
+	/** Scheduler shared shaper add/update */
+	rte_scheddev_shared_shaper_add_update_t shared_shaper_add_update;
+	/** Scheduler shared shaper delete */
+	rte_scheddev_shared_shaper_delete_t shared_shaper_delete;
+
+	/** Scheduler node add */
+	rte_scheddev_node_add_t node_add;
+	/** Scheduler node delete */
+	rte_scheddev_node_delete_t node_delete;
+	/** Scheduler node suspend */
+	rte_scheddev_node_suspend_t node_suspend;
+	/** Scheduler node resume */
+	rte_scheddev_node_resume_t node_resume;
+	/** Scheduler hierarchy set */
+	rte_scheddev_hierarchy_set_t hierarchy_set;
+
+	/** Scheduler node parent update */
+	rte_scheddev_node_parent_update_t node_parent_update;
+	/** Scheduler node shaper update */
+	rte_scheddev_node_shaper_update_t node_shaper_update;
+	/** Scheduler node shared shaper update */
+	rte_scheddev_node_shared_shaper_update_t node_shared_shaper_update;
+	/** Scheduler node scheduling mode update */
+	rte_scheddev_node_scheduling_mode_update_t node_scheduling_mode_update;
+	/** Scheduler node congestion management mode update */
+	rte_scheddev_node_cman_update_t node_cman_update;
+	/** Scheduler node WRED context update */
+	rte_scheddev_node_wred_context_update_t node_wred_context_update;
+	/** Scheduler node shared WRED context update */
+	rte_scheddev_node_shared_wred_context_update_t
+		node_shared_wred_context_update;
+
+	/** Scheduler packet marking - VLAN DEI */
+	rte_scheddev_mark_vlan_dei_t mark_vlan_dei;
+	/** Scheduler packet marking - IPv4/IPv6 ECN */
+	rte_scheddev_mark_ip_ecn_t mark_ip_ecn;
+	/** Scheduler packet marking - IPv4/IPv6 DSCP */
+	rte_scheddev_mark_ip_dscp_t mark_ip_dscp;
+
+	/** Scheduler get statistics counter type enabled for all nodes */
+	rte_scheddev_stats_get_enabled_t stats_get_enabled;
+	/** Scheduler enable selected statistics counters for all nodes */
+	rte_scheddev_stats_enable_t stats_enable;
+	/** Scheduler get statistics counter type enabled for current node */
+	rte_scheddev_node_stats_get_enabled_t node_stats_get_enabled;
+	/** Scheduler enable selected statistics counters for current node */
+	rte_scheddev_node_stats_enable_t node_stats_enable;
+	/** Scheduler read statistics counters for current node */
+	rte_scheddev_node_stats_read_t node_stats_read;
+};
+
+/**
+ * Initialize generic error structure.
+ *
+ * This function also sets rte_errno to a given value.
+ *
+ * @param error
+ *   Pointer to error structure (may be NULL).
+ * @param code
+ *   Related error code (rte_errno).
+ * @param type
+ *   Cause field and error type.
+ * @param cause
+ *   Object responsible for the error.
+ * @param message
+ *   Human-readable error message.
+ *
+ * @return
+ *   Error code.
+ */
+static inline int
+rte_scheddev_error_set(struct rte_scheddev_error *error,
+		   int code,
+		   enum rte_scheddev_error_type type,
+		   const void *cause,
+		   const char *message)
+{
+	if (error) {
+		*error = (struct rte_scheddev_error){
+			.type = type,
+			.cause = cause,
+			.message = message,
+		};
+	}
+	rte_errno = code;
+	return code;
+}
+
+/**
+ * Get generic hierarchical scheduler operations structure from a port
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param error
+ *   Error details
+ *
+ * @return
+ *   The hierarchical scheduler operations structure associated with port_id on
+ *   success, NULL otherwise.
+ */
+const struct rte_scheddev_ops *
+rte_scheddev_ops_get(uint8_t port_id, struct rte_scheddev_error *error);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* __INCLUDE_RTE_SCHEDDEV_DRIVER_H__ */