Show a cover letter.

GET /api/covers/42891/
Content-Type: application/json
Vary: Accept

    "id": 42891,
    "url": "",
    "web_url": "",
    "project": {
        "id": 1,
        "url": "",
        "name": "DPDK",
        "link_name": "dpdk",
        "list_id": "",
        "list_email": "",
        "web_url": "",
        "scm_url": "git://",
        "webscm_url": "",
        "list_archive_url": "",
        "list_archive_url_format": "{}",
        "commit_url_format": ""
    "msgid": "<>",
    "list_archive_url": "",
    "date": "2018-07-11T21:21:53",
    "name": "[RFC,0/1] A Distributed Software Event Device",
    "submitter": {
        "id": 1077,
        "url": "",
        "name": "Mattias Rönnblom",
        "email": ""
    "mbox": "",
    "series": [
            "id": 532,
            "url": "",
            "web_url": "",
            "date": "2018-07-11T21:21:53",
            "name": "A Distributed Software Event Device",
            "version": 1,
            "mbox": ""
    "comments": "",
    "headers": {
        "Subject": "[dpdk-dev] [RFC 0/1] A Distributed Software Event Device",
        "List-Post": "<>",
        "X-Mailer": "git-send-email 2.17.1",
        "X-Original-To": "",
        "Cc": ",,\n\t=?utf-8?q?Mattias_R=C3=B6nnblom?= <>",
        "X-Spam-Score": "0.7",
        "X-Spam-Checker-Version": "SpamAssassin 3.4.1 (2015-04-28) on\n\",
        "Return-Path": "<>",
        "X-BeenThere": "",
        "List-Archive": "<>",
        "Date": "Wed, 11 Jul 2018 23:21:53 +0200",
        "X-Virus-Scanned": "ClamAV using ClamSMTP",
        "Sender": "\"dev\" <>",
        "Message-Id": "<>",
        "X-Mailman-Version": "2.1.15",
        "Errors-To": "",
        "List-Subscribe": "<>,\n\t<>",
        "Delivered-To": "",
        "Precedence": "list",
        "Received": [
            "from [] (localhost [])\n\tby (Postfix) with ESMTP id 1B6FE1B43F;\n\tWed, 11 Jul 2018 23:22:08 +0200 (CEST)",
            "from ( [])\n\tby (Postfix) with ESMTP id 0A8B21B434\n\tfor <>; Wed, 11 Jul 2018 23:22:05 +0200 (CEST)",
            "from (localhost [])\n\tby (Postfix) with ESMTP id 84A094001A\n\tfor <>; Wed, 11 Jul 2018 23:22:05 +0200 (CEST)",
            "by (Postfix, from userid 1004)\n\tid 712D940017; Wed, 11 Jul 2018 23:22:05 +0200 (CEST)",
            "from\n\t( [])\n\t(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256\n\tbits)) (No client certificate requested)\n\tby (Postfix) with ESMTPSA id 269E84000F;\n\tWed, 11 Jul 2018 23:22:03 +0200 (CEST)"
        "X-Spam-Status": "No, score=0.7 required=5.0 tests=ALL_TRUSTED,URIBL_BLACK\n\tautolearn=disabled version=3.4.1",
        "MIME-Version": "1.0",
        "List-Id": "DPDK patches and discussions <>",
        "To": "",
        "List-Unsubscribe": "<>,\n\t<>",
        "From": "=?utf-8?q?Mattias_R=C3=B6nnblom?= <>",
        "Content-Type": "text/plain; charset=UTF-8",
        "X-Spam-Level": "",
        "List-Help": "<>",
        "Content-Transfer-Encoding": "8bit"
    "content": "This is the Distributed Software (DSW) event device, which distributes\nthe task of scheduling events among all the eventdev ports and their\nlcore threads.\n\nDSW is primarily designed for atomic-only queues, but also supports\nsingle-link and parallel queues.\n\n(DSW would be more accurately described as 'parallel', but since that\nterm is used to describe an eventdev queue type, it's referred to as\n'distributed', to avoid suggesting it's somehow biased toward parallel\nqueues.)\n\nEvent Scheduling\n================\n\nInternally, DSW hashes an eventdev flow id to a 15-bit \"flow\nhash\". For each eventdev queue, there's a table mapping a flow hash to\nan eventdev port. That port is considered the owner of the\nflow. Owners are randomly picked at initialization time, among the\nports serving (i.e. are linked to) that queue.\n\nThe scheduling of an event to a port is done (by the sender port) at\ntime of the enqueue operation, and in most cases simply consists of\nhashing the flow id and performing a lookup in the destination queue's\ntable. Each port has an MP/SC event ring to which the events are\nenqueued. This means events go directly port-to-port, typically\nmeaning core-to-core.\n\nPort Load Measurement\n=====================\n\nDSW includes a concept of port load. The event device keeps track of\ntransitions between \"idle\" and \"busy\" (or vice versa) on a per-port\nbasis, compares this to the wall time passed, and computes to what\nextent the port was busy (for a certain interval). A port transitions\nto \"busy\" on a non-zero dequeue, and again back to \"idle\" at the point\nit performs a dequeue operation returning zero events.\n\nFlow Migration\n==============\n\nPeriodically, hidden to the API user and as a part of a normal\nenqueue/dequeue operations, a port updates its load estimate, and in\ncase the load has reached a certain threshold, considers moving one of\nits flow to a different, more lightly loaded, port. This process is\ncalled migration.\n\nMigration Strategy\n------------------\n\nThe DSW migration strategy is to move a small, but yet active flow. To\nquickly find which are the active flows (w/o resorting to scanning\nthrough the tables and/or keeping per-event counters), each port\nmaintains a list of the last 128 events it has dequeued. If there are\nlightly-loaded enough target ports, it will attempt to migrate one of\nthose flows, starting with the smallest. The size is estimated by the\nnumber of events seen on that flow, in that small sample of events.\n\nA good migration strategy, based on reasonably good estimates of port\nand current flow event rates, is key for proper load balancing in a\nDSW-style event device.\n\nMigration Process\n-----------------\n\nIf the prerequisites are met, and a migration target flow and port is\nfound, the owning (source) port will initiate the migration\nprocess. For parallel queues it's a very straightforward operation -\nsimply a table update. For atomic queues, in order to maintain their\nsemantics, it's a fair bit more elaborate a procedure.\n\nA high-level view the migration process is available[1] in the form a\nsequence diagram.\n\nMuch simplified, it consist of the source port sending messages to all\nports configured, asking them to \"pause\" the to-be-migrated flow. Such\nports will flush their output buffers and provide a confirmation back\nto the source port.\n\nEach port holds a list of which flows are paused. Upon the enqueue of\nan event belonging to a paused flow, it will be accepted into the\nmachinery, but kept in a paused-events buffer located on the sending\nport.\n\nAfter receiving confirmations from all ports, the source port will\nmake sure its application-level user has finished processing of all\nevents related to the migrating flow, update the relevant queue's\ntable, and forward all unprocessed events (in its input event ring) to\nthe new target port.\n\nThe source port will then send out a request to \"unpause\" the flow to\nall ports. Upon receiving such a request, the port will flush any\nbuffered (paused) events related to the paused flow, and provide a\nconfirmation.\n\nAll the signaling are done on regular DPDK rings (separate from the\nevent-carrying rings), and are pulled as a part of normal\nenqueue/dequeue operation.\n\nThe migrations can be made fairly rapidly (in the range of a couple\nhundred us, or even faster), but the algorithm, load measurement and\nmigration interval parameters must be carefully chosen not to cause\nthe system to oscillate or otherwise misbehave.\n\nThe migration rate is primarily limited by eventdev enqueue/dequeue\nfunction call rate, which in turn in the typical application is\nlimited by event burst sizes and event processing latency.\n\nMigration API Implications\n--------------------------\n\nThe migration process puts an addition requirement on the application\nbeyond the regular eventdev API, which is to not leave ports\n'unattended'. Unattended here means a port on that neither enqueue nor\ndequeue operations are performed within a reasonable time frame. What\nis 'reasonable' depends on the migration latency requirements, which\nin turns depends on the degree of variation in the workload. For\nenqueue-only ports, which might well come into situations where no\nevents are enqueued for long durations of time, DSW includes an\nless-than-elegant solution, allowing zero-sized enqueue operations,\nwhich serve no other purpose that to drive the migration machinery.\n\nWorkload Suitability\n====================\n\nDSW operates under the assumption that an active flow will remain so\nfor a duration which is significantly longer than the migration\nlatency.\n\nDSW should do well with a larger number of small flows, and also large\nflows that increase their rates at a pace which is low-enough for the\nmigration process to move away smaller flows to make room on that\nport. TCP slow-start kind of traffic, with end-to-end latencies on the\nms level, should be possible to handle, even though their expontential\nnature - but all of this is speculation.\n\nDSW won't be able to properly load balance workloads with few, highly\nbursty, and high intensity flows.\n\nCompared to the SW event device, DSW allows scaling to higher-core\ncount machines, with its substantially higher throughput and avoiding\na single bottleneck core, especially for long pipelines, or systems\nwith very short pipeline stages.\n\nIn addition, it also scales down to configurations with very few or\neven just a single core, avoiding the issue with SW has with running\napplication work and event scheduling on the same core.\n\nThe downsides is that DSW doesn't have SW's near-immediate load\nbalancing flow-rerouting capability, but instead relies on flows\nchanging their inter-event time at a pace which isn't too high for the\nmigration process to handle.\n\nBackpressure\n============\n\nLike any event device, DSW provides a backpressure mechanism to\nprevent event producers flooding it.\n\nDSW employs a credit system, with a central pool equal to the\nconfigured max number of in-flight events. The ports are allowed to\ntake loans from this central pool, and may also return credits, so\nthat consumer-heavy ports don't end up draining the pool.\n\nA port will, at the time of enqueue, make sure it has enough credits\n(one per event) to allow the events into DSW. If not, the port will\nattempt to retrieve more from the central pool. If this fails, the\nenqueue operation fails. For efficiency reasons, at least 64 credits\nare take from the pool (even if fewer might be needed).\n\nA port will, at the time of dequeue, gain as many credits as the\nnumber of events it dequeued. A port will not return credits until\nthey reach 128, and will always keep 64 credits.\n\nAll this in a similar although not identical manner to the SW event\ndevice.\n\nOutput Buffering\n================\n\nUpon a successful enqueue operation, DSW will not immediately send the\nevents to their destination ports' event input rings. Events will\nhowever - unless paused - be assigned a destination port and enqueued\non a buffer on the sending port. Such buffered events are considered\naccepted into the event device, and is so handled from a migration and\nin-flight credit system point of view.\n\nUpon reaching a certain threshold, buffered events will be flushed,\nand enqueued on the destination port's input ring.\n\nThe output buffers make the DSW ports use longer bursts against the\nreceiving port rings, much improving event ring efficiency.\n\nTo avoid having buffered events lingering too long (or even endlessly)\nin these buffers, DSW has a schema where it only allows a certain\nnumber of enqueue/dequeue operations ('ops') to be performed, before\nthe buffers are flushed.\n\nA side effect of how 'ops' are counted is that in case a port goes\nidle, it will likely perform many dequeue operations to pull new work,\nand thus quickly up the 'ops' to a level it's output buffers are\nflushed. That will cause lower ring efficiency, but this is of no\nmajor concern since the worker is idling anyways.\n\nThis allows for single-event enqueue operations to be efficient,\nalthough credit system and statistics update overhead will still make\nthem slower than burst enqueues.\n\nOutput Buffering API Implications\n---------------------------------\n\nThe output buffering schema puts the same requirement on the\napplication as the migration process in that it disallows unattended ports.\n\nIn addition, DSW also implement a schema (maybe more accurately\ndescribed as a hack) where the application can force a buffer flush by\ndoing a zero-sized enqueue.\n\nAlternative Approach\n--------------------\n\nAn alternative to the DSW-internal output buffering is to have the\napplication to use burst enqueues, preferably with very large bursts\n(the more cores, the larger bursts are preferred).\n\nDepending on how the application is structured, this might well lead\nto it having an internal buffer to which it does efficient,\nsingle-event enqueue operations to, and then flushes it on a regular\nbasis.\n\nHowever, since the DSW output buffering happens after the scheduling\nis performed, the buffers can actually be flushed earlier than if\nbuffering happens in the application, if a large fraction of the\nevents are scheduled to a particular port (since the output buffer\nlimit is on a per-destination port basis).\n\nIn addition, since events in the output buffer are considered accepted\ninto DSW, migration can happen somewhat more quickly, since those\nevents can be flushed on migrations, as oppose to an\napplication-controlled buffer.\n\nStatistics\n==========\n\nDSW supports the eventdev 'xstats' interface. It provides a large,\nmost port-based, set of counters, including things like port load,\nnumber of migrations and migration latency, number of events dequeued\nand enqueued, and on which queue, the average event processing latency\nand a timestamp to allow the detection of unattended ports.\n\nDSW xstats also allows reading the current global total and port\ncredits, making it possible to give a rough estimate of how many\nevents are in flight.\n\nPerformance Indications\n=======================\n\nThe primary source of performance metrics comes from a test\napplication implementing a simulate pipeline. With zero work in each\npipe line stage, running on single socket x86_64 system, fourteen 2.4\nGHz worker cores can sustain 300-400 million event/s. With a pipeline\nwith 1000 clock cycles of work per stage, the average event device\noverhead is somewhere 50-150 clock cycles/event.\n\nThe benchmark is run when the system is fully loaded (i.e. there are\nalways events available on the pipeline ingress), and thus the event\ndevice will benefit from batching effects, which are crucial for\nperformance. Also beneficial for DSW efficiency is the fact that the\n\"dummy\" application work cycles has a very small memory working set,\nleaving all the caches to DSW.\n\nThe simulated load has flows with a fixed event rate, causing very few\nmigrations - and more importantly - allows DSW to provide near-ideal\nload balancing. So inefficienes due to imperfect load balancing is\nalso not accounted for.\n\nThe flow-to-port/thread/core affinity of DSW should provide for some\ncaching benefits for the application, for flow-related data\nstructures, compared to an event device where the flows move around\nthe ports in a more stochastic manner.\n\nShort-term TODO\n===============\n\no Figure out which DSW parameters needs to be runtime configurable.\no Consider adding support for event priority.\no Add relevant test cases to eventdev unit tests.\no Convert this massive cover letter into proper DPDK documentation.\n\n[1]\n\nMattias Rönnblom (1):\n  eventdev: add distributed software (DSW) event device\n\n config/common_base                            |    5 +\n drivers/event/Makefile                        |    1 +\n drivers/event/dsw/Makefile                    |   28 +\n drivers/event/dsw/dsw_evdev.c                 |  361 +++++\n drivers/event/dsw/dsw_evdev.h                 |  296 ++++\n drivers/event/dsw/dsw_event.c                 | 1285 +++++++++++++++++\n drivers/event/dsw/dsw_sort.h                  |   47 +\n drivers/event/dsw/dsw_xstats.c                |  284 ++++\n .../event/dsw/   |    3 +\n mk/                                 |    1 +\n 10 files changed, 2311 insertions(+)\n create mode 100644 drivers/event/dsw/Makefile\n create mode 100644 drivers/event/dsw/dsw_evdev.c\n create mode 100644 drivers/event/dsw/dsw_evdev.h\n create mode 100644 drivers/event/dsw/dsw_event.c\n create mode 100644 drivers/event/dsw/dsw_sort.h\n create mode 100644 drivers/event/dsw/dsw_xstats.c\n create mode 100644 drivers/event/dsw/"