[dpdk-dev,RFC] ethdev: abstraction layer for QoS hierarchical scheduler

From: Dumitrescu, Cristian [mailto:cristian.dumitrescu@intel.com]

  This RFC proposes an ethdev-based abstraction layer for Quality of Service (QoS)
hierarchical scheduler. The goal of the abstraction layer is to provide a simple
generic API that is agnostic of the underlying HW, SW or mixed HW-SW complex
implementation.

Q1: What is the benefit for having an abstraction layer for QoS hierarchical
layer?
A1: There is growing interest in the industry for handling various HW-based,
SW-based or mixed hierarchical scheduler implementations using a unified DPDK
API.

Q2: Which devices are targeted by this abstraction layer?
A2: All current and future devices that expose a hierarchical scheduler feature
under DPDK, including NICs, FPGAs, ASICs, SOCs, SW libraries.

Q3: Which scheduler hierarchies are supported by the API?
A3: Hopefully any scheduler hierarchy can be described and covered by the
current API. Of course, functional correctness, accuracy and performance levels
depend on the specific implementations of this API.

Q4: Why have this abstraction layer into ethdev as opposed to a new type of
device (e.g. scheddev) similar to ethdev, cryptodev, eventdev, etc?
A4: Packets are sent to the Ethernet device using the ethdev API
rte_eth_tx_burst() function, with the hierarchical scheduling taking place
automatically (i.e. no SW intervention) in HW implementations. Basically, the
hierarchical scheduler is done as part of packet TX operation.
The hierarchical scheduler is typically the last stage before packet TX and it
is tightly integrated with the TX stage. The hierarchical scheduler is just
another offload feature of the Ethernet device, which needs to be accommodated
by the ethdev API similar to any other offload feature (such as RSS, DCB,
flow director, etc).
Once the decision to schedule a specific packet has been taken, this packet
cannot be dropped and it has to be sent over the wire as is, otherwise what
takes place on the wire is not what was planned at scheduling time, so the
scheduling is not accurate (Note: there are some devices which allow prepending
headers to the packet after the scheduling stage at the expense of sending
correction requests back to the scheduler, but this only strengthens the bond
between scheduling and TX).

Q5: Given that the packet scheduling takes place automatically for pure HW
implementations, how does packet scheduling take place for poll-mode SW
implementations?
A5: The API provided function rte_sched_run() is designed to take care of this.
For HW implementations, this function typically does nothing. For SW
implementations, this function is typically expected to perform dequeue of
packets from the hierarchical scheduler and their write to Ethernet device TX
queue, periodic flush of any buffers on enqueue-side into the hierarchical
scheduler for burst-oriented implementations, etc.

Q6: Which are the scheduling algorithms supported?
A6: The fundamental scheduling algorithms that are supported are Strict Priority
(SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported at
the level of each node of the scheduling hierarchy, regardless of the node
level/position in the tree. The SP algorithm is used to schedule between sibling
nodes with different priority, while WFQ is used to schedule between groups of
siblings that have the same priority.
Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR
(DWRR), etc are considered approximations of the ideal WFQ and are therefore
assimilated to WFQ, although an associated implementation-dependent accuracy,
performance and resource usage trade-off might exist.

Q7: Which are the supported congestion management algorithms?
A7: Tail drop, head drop and Weighted Random Early Detection (WRED). They are
available for every leaf node in the hierarchy, subject to the specific
implementation supporting them.

Q8: Is traffic shaping supported?
A8: Yes, there are a number of shapers (rate limiters) that can be supported for
each node in the hierarchy (built-in limit is currently set to 4 per node). Each
shaper can be private to a node (used only by that node) or shared between
multiple nodes.

Q9: What is the purpose of having shaper profiles and WRED profiles?
A9: In most implementations, many shapers typically share the same configuration
parameters, so defining shaper profiles simplifies the configuration task. Same
considerations apply to WRED contexts and profiles.

Q10: How is the scheduling hierarchy defined and created?
A10: Scheduler hierarchy tree is set up by creating new nodes and connecting
them to other existing nodes, which thus become parent nodes. The unique ID that
is assigned to each node when the node is created is further used to update the
node configuration or to connect children nodes to it. The leaf nodes of the
scheduler hierarchy are each attached to one of the Ethernet device TX queues.

Q11: Are on-the-fly changes of the scheduling hierarchy allowed by the API?
A11: Yes. The actual changes take place subject to the specific implementation
supporting them, otherwise error code is returned.

Q12: What is the typical function call sequence to set up and run the Ethernet
device scheduler?
A12: The typical simplified function call sequence is listed below:
i) Configure the Ethernet device and its TX queues: rte_eth_dev_configure(),
rte_eth_tx_queue_setup()
ii) Create WRED profiles and WRED contexts, shaper profiles and shapers:
rte_eth_sched_wred_profile_add(), rte_eth_sched_wred_context_add(),
rte_eth_sched_shaper_profile_add(), rte_eth_sched_shaper_add()
iii) Create the scheduler hierarchy nodes and tree: rte_eth_sched_node_add()
iv) Freeze the start-up hierarchy and ask the device whether it supports it:
rte_eth_sched_node_add()
v) Start the Ethernet port: rte_eth_dev_start()
vi) Run-time scheduler hierarchy updates: rte_eth_sched_node_add(),
rte_eth_sched_node_<attribute>_set()
vii) Run-time packet enqueue into the hierarchical scheduler: rte_eth_tx_burst()
viii) Run-time support for SW poll-mode implementations (see previous answer):
rte_sched_run()

Q13: Which are the possible options for the user when the Ethernet port does not
support the scheduling hierarchy required by the user?
A13: The following options are available to the user:
i) abort
ii) try out a new hierarchy (e.g. with less leaf nodes), if acceptable
iii) wrap the Ethernet device into a new type of Ethernet device that has a SW
front-end implementing the hierarchical scheduler (e.g. existing DPDK library
librte_sched); instantiate the new device type on-the-fly and check if the
hierarchy requirements can be met by the new device.

Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
---
 lib/librte_ether/rte_ethdev.h | 794 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 794 insertions(+)
 mode change 100644 => 100755 lib/librte_ether/rte_ethdev.h

Message ID	1480529810-95280-1-git-send-email-cristian.dumitrescu@intel.com (mailing list archive)
State	Superseded, archived
Delegated to:	Thomas Monjalon
Headers	Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [IPv6:::1]) by dpdk.org (Postfix) with ESMTP id B90C15591; Wed, 30 Nov 2016 19:16:56 +0100 (CET) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 40127558B for <dev@dpdk.org>; Wed, 30 Nov 2016 19:16:53 +0100 (CET) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP; 30 Nov 2016 10:16:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos; i="5.31,574,1473145200"; d="scan'208"; a="1066267047" Received: from sie-lab-214-193.ir.intel.com (HELO silpixa00382659.ir.intel.com) ([10.237.214.193]) by orsmga001.jf.intel.com with ESMTP; 30 Nov 2016 10:16:51 -0800 From: Cristian Dumitrescu <cristian.dumitrescu@intel.com> To: dev@dpdk.org Cc: cristian.dumitrescu@intel.com Date: Wed, 30 Nov 2016 18:16:50 +0000 Message-Id: <1480529810-95280-1-git-send-email-cristian.dumitrescu@intel.com> X-Mailer: git-send-email 2.5.0 Subject: [dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK <dev.dpdk.org> List-Unsubscribe: <http://dpdk.org/ml/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://dpdk.org/ml/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <http://dpdk.org/ml/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>

[dpdk-dev,RFC] ethdev: abstraction layer for QoS hierarchical scheduler

Checks

Commit Message

Comments

Patch