From patchwork Thu Sep  8 02:09:56 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Yan, Zhirun" <zhirun.yan@intel.com>
X-Patchwork-Id: 116060
X-Patchwork-Delegate: thomas@monjalon.net
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 16D86A0548;
	Thu,  8 Sep 2022 04:10:37 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 6CE7440143;
	Thu,  8 Sep 2022 04:10:26 +0200 (CEST)
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
 by mails.dpdk.org (Postfix) with ESMTP id EDB4C4281B
 for <dev@dpdk.org>; Thu,  8 Sep 2022 04:10:24 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
 t=1662603025; x=1694139025;
 h=from:to:cc:subject:date:message-id:in-reply-to:
 references:mime-version:content-transfer-encoding;
 bh=0aR0AmVkTU4tpKcILQkCbD2gn0DcbClD+21nIE8nt2M=;
 b=lxADUQP8zKsi+LGo4Q4Mx8/DczyJtiUYIOJypfTD6DTLrwzUWkzz4EcQ
 vBxTsyWXExhl9I4u8eeeRx7mV8k20UYEMXXAtj81uF1Y8p5w8U6FStRdc
 nE9cMbq6JRSxDPr4l8OvqAgectdced9QRnsEpb+eZAA7Z63yBVdHmF7AM
 saL5HvYGKbHMl1VW0LMvw1wcosA9/doWQKm/CrFpvyKITq7ZgAEiGfVCL
 hytAfmFjggg/r0RcHogwkg8yrFlbDiFgR95xf5x/DD6k4EtRtn0dukS+E
 qJjoxxBm8hdEqvqmypdKoCqigYcAa65JL74XxyW/KUBi4PVVBmQp94RvW A==;
X-IronPort-AV: E=McAfee;i="6500,9779,10463"; a="383336887"
X-IronPort-AV: E=Sophos;i="5.93,298,1654585200"; d="scan'208";a="383336887"
Received: from orsmga003.jf.intel.com ([10.7.209.27])
 by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 07 Sep 2022 19:10:22 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.93,298,1654585200"; d="scan'208";a="565755853"
Received: from dpdk-zhirun-lmm.sh.intel.com ([10.67.118.241])
 by orsmga003.jf.intel.com with ESMTP; 07 Sep 2022 19:10:20 -0700
From: Zhirun Yan <zhirun.yan@intel.com>
To: dev@dpdk.org,
	jerinj@marvell.com,
	kirankumark@marvell.com
Cc: cunming.liang@intel.com, haiyue.wang@intel.com,
 Zhirun Yan <zhirun.yan@intel.com>
Subject: [RFC, v1 3/6] graph: enable stream moving cross cores
Date: Thu,  8 Sep 2022 10:09:56 +0800
Message-Id: <20220908020959.1675953-4-zhirun.yan@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20220908020959.1675953-1-zhirun.yan@intel.com>
References: <20220908020959.1675953-1-zhirun.yan@intel.com>
MIME-Version: 1.0
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

This patch allows a worker thread to enable enqueue and move streams of
objects to the next nodes over different cores.

1. add graph_sched_wq_node to hold graph scheduling workqueue node
stream
2. add workqueue help functions to create/destroy/enqueue/dequeue

Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Zhirun Yan <zhirun.yan@intel.com>
---
 lib/graph/graph_populate.c   |   1 +
 lib/graph/graph_private.h    |  38 +++++++
 lib/graph/graph_sched.c      | 194 +++++++++++++++++++++++++++++++++++
 lib/graph/meson.build        |   2 +
 lib/graph/rte_graph_worker.h |  45 ++++++++
 lib/graph/version.map        |   3 +
 6 files changed, 283 insertions(+)
 create mode 100644 lib/graph/graph_sched.c

diff --git a/lib/graph/graph_populate.c b/lib/graph/graph_populate.c
index 102fd6c29b..26f9670406 100644
--- a/lib/graph/graph_populate.c
+++ b/lib/graph/graph_populate.c
@@ -84,6 +84,7 @@ graph_nodes_populate(struct graph *_graph)
 		}
 		node->id = graph_node->node->id;
 		node->parent_id = pid;
+		node->lcore_id = graph_node->node->lcore_id;
 		nb_edges = graph_node->node->nb_edges;
 		node->nb_edges = nb_edges;
 		off += sizeof(struct rte_node);
diff --git a/lib/graph/graph_private.h b/lib/graph/graph_private.h
index d53ef289d4..e50a207764 100644
--- a/lib/graph/graph_private.h
+++ b/lib/graph/graph_private.h
@@ -59,6 +59,17 @@ struct node {
 	char next_nodes[][RTE_NODE_NAMESIZE]; /**< Names of next nodes. */
 };
 
+/**
+ * @internal
+ *
+ * Structure that holds the graph scheduling workqueue node stream.
+ */
+struct graph_sched_wq_node {
+	rte_graph_off_t node_off;
+	uint16_t nb_objs;
+	void *objs[RTE_GRAPH_BURST_SIZE];
+} __rte_cache_aligned;
+
 /**
  * @internal
  *
@@ -349,4 +360,31 @@ void graph_dump(FILE *f, struct graph *g);
  */
 void node_dump(FILE *f, struct node *n);
 
+/**
+ * @internal
+ *
+ * Create the graph schedule work queue. And all cloned graphs attached to the
+ * parent graph MUST be destroied together for fast schedule design limitation.
+ *
+ * @param _graph
+ *   The graph object
+ * @param _parent_graph
+ *   The parent graph object which holds the run-queue head.
+ *
+ * @return
+ *   - 0: Success.
+ *   - <0: Graph schedule work queue related error.
+ */
+int graph_sched_wq_create(struct graph *_graph, struct graph *_parent_graph);
+
+/**
+ * @internal
+ *
+ * Destroy the graph schedule work queue.
+ *
+ * @param _graph
+ *   The graph object
+ */
+void graph_sched_wq_destroy(struct graph *_graph);
+
 #endif /* _RTE_GRAPH_PRIVATE_H_ */
diff --git a/lib/graph/graph_sched.c b/lib/graph/graph_sched.c
new file mode 100644
index 0000000000..465148796b
--- /dev/null
+++ b/lib/graph/graph_sched.c
@@ -0,0 +1,194 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Intel Corporation
+ */
+
+#include <rte_common.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+#include <rte_mempool.h>
+#include <rte_memzone.h>
+#include <rte_ring.h>
+
+#include "graph_private.h"
+
+#define GRAPH_SCHED_WQ_SIZE_MULTIPLIER  8
+#define GRAPH_SCHED_WQ_SIZE(nb_nodes)   \
+	((typeof(nb_nodes))((nb_nodes) * GRAPH_SCHED_WQ_SIZE_MULTIPLIER))
+
+static __rte_always_inline bool
+graph_src_node_avail(struct graph *graph)
+{
+	struct graph_node *graph_node;
+
+	STAILQ_FOREACH(graph_node, &graph->node_list, next)
+		if ((graph_node->node->flags & RTE_NODE_SOURCE_F) &&
+			(graph_node->node->lcore_id == RTE_MAX_LCORE ||
+		    graph->lcore_id == graph_node->node->lcore_id))
+			return true;
+
+	return false;
+}
+
+int
+graph_sched_wq_create(struct graph *_graph, struct graph *_parent_graph)
+{
+	struct rte_graph *parent_graph = _parent_graph->graph;
+	struct rte_graph *graph = _graph->graph;
+	unsigned int wq_size;
+
+	/* cloned graph doesn't access input node if not affinity rx core*/
+	if (!graph_src_node_avail(_graph))
+		graph->head = 0;
+
+	wq_size = GRAPH_SCHED_WQ_SIZE(graph->nb_nodes);
+	wq_size = rte_align32pow2(wq_size + 1);
+
+	graph->wq = rte_ring_create(graph->name, wq_size, graph->socket,
+				    RING_F_SC_DEQ);
+	if (graph->wq == NULL)
+		SET_ERR_JMP(EIO, fail, "Failed to allocate graph WQ");
+
+	graph->mp = rte_mempool_create(graph->name, wq_size,
+				       sizeof(struct graph_sched_wq_node),
+				       0, 0, NULL, NULL, NULL, NULL,
+				       graph->socket, MEMPOOL_F_SP_PUT);
+	if (graph->mp == NULL)
+		SET_ERR_JMP(EIO, fail_mp,
+			    "Failed to allocate graph WQ schedule entry");
+
+	graph->lcore_id = _graph->lcore_id;
+
+	if (parent_graph->rq == NULL) {
+		parent_graph->rq = &parent_graph->rq_head;
+		SLIST_INIT(parent_graph->rq);
+	}
+
+	graph->rq = parent_graph->rq;
+	SLIST_INSERT_HEAD(graph->rq, graph, rq_next);
+
+	return 0;
+
+fail_mp:
+	rte_ring_free(graph->wq);
+	graph->wq = NULL;
+fail:
+	return -rte_errno;
+}
+
+void
+graph_sched_wq_destroy(struct graph *_graph)
+{
+	struct rte_graph *graph = _graph->graph;
+
+	if (graph == NULL)
+		return;
+
+	rte_ring_free(graph->wq);
+	graph->wq = NULL;
+
+	rte_mempool_free(graph->mp);
+	graph->mp = NULL;
+}
+
+static __rte_always_inline bool
+__graph_sched_node_enqueue(struct rte_node *node, struct rte_graph *graph)
+{
+	struct graph_sched_wq_node *wq_node;
+	uint16_t off = 0;
+	uint16_t size;
+
+submit_again:
+	if (rte_mempool_get(graph->mp, (void **)&wq_node) < 0)
+		goto fallback;
+
+	size = RTE_MIN(node->idx, RTE_DIM(wq_node->objs));
+	wq_node->node_off = node->off;
+	wq_node->nb_objs = size;
+	rte_memcpy(wq_node->objs, &node->objs[off], size * sizeof(void *));
+
+	while (rte_ring_mp_enqueue_bulk_elem(graph->wq, (void *)&wq_node,
+					  sizeof(wq_node), 1, NULL) == 0)
+		rte_pause();
+
+	off += size;
+	node->total_sched_objs += size;
+	node->idx -= size;
+	if (node->idx > 0)
+		goto submit_again;
+
+	return true;
+
+fallback:
+	if (off != 0)
+		memmove(&node->objs[0], &node->objs[off],
+			node->idx * sizeof(void *));
+
+	node->total_sched_fail += node->idx;
+
+	return false;
+}
+
+bool __rte_noinline
+__rte_graph_sched_node_enqueue(struct rte_node *node,
+			       struct rte_graph_rq_head *rq)
+{
+	const unsigned int lcore_id = node->lcore_id;
+	struct rte_graph *graph;
+
+	SLIST_FOREACH(graph, rq, rq_next)
+		if (graph->lcore_id == lcore_id)
+			break;
+
+	return graph != NULL ? __graph_sched_node_enqueue(node, graph) : false;
+}
+
+void __rte_noinline
+__rte_graph_sched_wq_process(struct rte_graph *graph)
+{
+	struct graph_sched_wq_node *wq_node;
+	struct rte_mempool *mp = graph->mp;
+	struct rte_ring *wq = graph->wq;
+	uint16_t idx, free_space;
+	struct rte_node *node;
+	unsigned int i, n;
+	uint64_t start;
+	uint16_t rc;
+	void **objs;
+	struct graph_sched_wq_node *wq_nodes[32];
+
+	n = rte_ring_sc_dequeue_burst_elem(wq, wq_nodes, sizeof(wq_nodes[0]),
+					   RTE_DIM(wq_nodes), NULL);
+	if (n == 0)
+		return;
+
+	for (i = 0; i < n; i++) {
+		wq_node = wq_nodes[i];
+		node = RTE_PTR_ADD(graph, wq_node->node_off);
+		RTE_ASSERT(node->fence == RTE_GRAPH_FENCE);
+		idx = node->idx;
+		free_space = node->size - idx;
+
+		if (unlikely(free_space < wq_node->nb_objs))
+			__rte_node_stream_alloc_size(graph, node, node->size + wq_node->nb_objs);
+
+		memmove(&node->objs[idx], wq_node->objs, wq_node->nb_objs * sizeof(void *));
+		memset(wq_node->objs, 0, wq_node->nb_objs * sizeof(void *));
+		node->idx = idx + wq_node->nb_objs;
+		objs = node->objs;
+		rte_prefetch0(objs);
+
+		if (rte_graph_has_stats_feature()) {
+			start = rte_rdtsc();
+			rc = node->process(graph, node, objs, wq_node->nb_objs);
+			node->total_cycles += rte_rdtsc() - start;
+			node->total_calls++;
+			node->total_objs += rc;
+		} else {
+			node->process(graph, node, objs, wq_node->nb_objs);
+		}
+		wq_node->nb_objs = 0;
+		node->idx = 0;
+	}
+
+	rte_mempool_put_bulk(mp, (void **)wq_nodes, n);
+}
diff --git a/lib/graph/meson.build b/lib/graph/meson.build
index c7327549e8..440063b24e 100644
--- a/lib/graph/meson.build
+++ b/lib/graph/meson.build
@@ -13,8 +13,10 @@ sources = files(
         'graph_ops.c',
         'graph_debug.c',
         'graph_stats.c',
+        'graph_sched.c',
         'graph_populate.c',
 )
 headers = files('rte_graph.h', 'rte_graph_worker.h')
 
 deps += ['eal']
+deps += ['mempool', 'ring']
diff --git a/lib/graph/rte_graph_worker.h b/lib/graph/rte_graph_worker.h
index ca0185ec36..faf3f31ddc 100644
--- a/lib/graph/rte_graph_worker.h
+++ b/lib/graph/rte_graph_worker.h
@@ -28,6 +28,13 @@
 extern "C" {
 #endif
 
+/**
+ * @internal
+ *
+ * Singly-linked list head for graph schedule run-queue.
+ */
+SLIST_HEAD(rte_graph_rq_head, rte_graph);
+
 /**
  * @internal
  *
@@ -39,6 +46,14 @@ struct rte_graph {
 	uint32_t cir_mask;	     /**< Circular buffer wrap around mask. */
 	rte_node_t nb_nodes;	     /**< Number of nodes in the graph. */
 	rte_graph_off_t *cir_start;  /**< Pointer to circular buffer. */
+	/* Graph schedule area --BEGIN-- */
+	struct rte_graph_rq_head *rq __rte_cache_aligned; /* The run-queue */
+	struct rte_graph_rq_head rq_head; /* The head for run-queue list */
+	SLIST_ENTRY(rte_graph) rq_next;   /* The next for run-queue list */
+	unsigned int lcore_id;  /**< The graph running Lcore. */
+	struct rte_ring *wq;    /**< The work-queue for pending streams. */
+	struct rte_mempool *mp; /**< The mempool for scheduling streams. */
+	/* Graph schedule area --END-- */
 	rte_graph_off_t nodes_start; /**< Offset at which node memory starts. */
 	rte_graph_t id;	/**< Graph identifier. */
 	int socket;	/**< Socket ID where memory is allocated. */
@@ -63,6 +78,8 @@ struct rte_node {
 	char parent[RTE_NODE_NAMESIZE];	/**< Parent node name. */
 	char name[RTE_NODE_NAMESIZE];	/**< Name of the node. */
 
+	/* Fast schedule area */
+	unsigned int lcore_id __rte_cache_aligned;  /**< Node running Lcore. */
 	/* Fast path area  */
 #define RTE_NODE_CTX_SZ 16
 	uint8_t ctx[RTE_NODE_CTX_SZ] __rte_cache_aligned; /**< Node Context. */
@@ -118,6 +135,34 @@ __rte_experimental
 void __rte_node_stream_alloc_size(struct rte_graph *graph,
 				  struct rte_node *node, uint16_t req_size);
 
+/**
+ * @internal
+ *
+ * Schedule the node to the right graph's work queue.
+ *
+ * @param node
+ *   Pointer to the scheduled node object.
+ * @param rq
+ *   Pointer to the scheduled run-queue for all graphs.
+ *
+ * @return
+ *   True on success, false otherwise.
+ */
+__rte_experimental
+bool __rte_graph_sched_node_enqueue(struct rte_node *node,
+				    struct rte_graph_rq_head *rq);
+
+/**
+ * @internal
+ *
+ * Process all nodes (streams) in the graph's work queue.
+ *
+ * @param graph
+ *   Pointer to the graph object.
+ */
+__rte_experimental
+void __rte_graph_sched_wq_process(struct rte_graph *graph);
+
 /**
  * Perform graph walk on the circular buffer and invoke the process function
  * of the nodes and collect the stats.
diff --git a/lib/graph/version.map b/lib/graph/version.map
index 6fc43e4411..99f82eb21a 100644
--- a/lib/graph/version.map
+++ b/lib/graph/version.map
@@ -1,6 +1,9 @@
 EXPERIMENTAL {
 	global:
 
+	__rte_graph_sched_node_enqueue;
+	__rte_graph_sched_wq_process;
+
 	__rte_node_register;
 	__rte_node_stream_alloc;
 	__rte_node_stream_alloc_size;