From patchwork Wed Apr 19 20:01:49 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
X-Patchwork-Id: 126267
X-Patchwork-Delegate: jerinj@marvell.com
Return-Path: <dev-bounces@dpdk.org>
X-Original-To: patchwork@inbox.dpdk.org
Delivered-To: patchwork@inbox.dpdk.org
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 906D24298D;
	Wed, 19 Apr 2023 22:02:02 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 2395940A79;
	Wed, 19 Apr 2023 22:02:02 +0200 (CEST)
Received: from mx0b-0016f401.pphosted.com (mx0a-0016f401.pphosted.com
 [67.231.148.174])
 by mails.dpdk.org (Postfix) with ESMTP id 121EA4021F
 for <dev@dpdk.org>; Wed, 19 Apr 2023 22:01:59 +0200 (CEST)
Received: from pps.filterd (m0045849.ppops.net [127.0.0.1])
 by mx0a-0016f401.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id
 33JHfPEx030775 for <dev@dpdk.org>; Wed, 19 Apr 2023 13:01:59 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com;
 h=from : to : cc :
 subject : date : message-id : mime-version : content-transfer-encoding :
 content-type; s=pfpt0220; bh=iot83EHbvUUfRtXvaMxCvQ2tmqu0YUX8VRyFh5kQt3o=;
 b=HMARo3d6SgI659VothfvVV+aCJw3WMy8X/XTg/fLOUSEyr1BqMxc/7+inxEexij01zjP
 pyQEE8yYdXlewSRlOtw4t/fP4T6RevTWuuFNc8MdO02ISXU51vS9runCO0876gzf7j6P
 l94T6lv5lL1TtKMuaeg7FCuZowCXBTgFFP94TVnc2owKKmizKPNTlQfuKBDJtlrQkE8V
 GDjTLlXYQRByASf75sq/028TvStosGlExd+6ZNvoA4RVDRZKBJUwyHGwvGovRuPy3icz
 8dIkTOA6bAv7bNfDeLiWURGotvX9fgdcWpi2ShRZ0w+cIeiRFyTNi2XctA8Md3JJMSmX HA==
Received: from dc5-exch02.marvell.com ([199.233.59.182])
 by mx0a-0016f401.pphosted.com (PPS) with ESMTPS id 3q2mwygtbt-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
 for <dev@dpdk.org>; Wed, 19 Apr 2023 13:01:58 -0700
Received: from DC5-EXCH02.marvell.com (10.69.176.39) by DC5-EXCH02.marvell.com
 (10.69.176.39) with Microsoft SMTP Server (TLS) id 15.0.1497.48;
 Wed, 19 Apr 2023 13:01:57 -0700
Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH02.marvell.com
 (10.69.176.39) with Microsoft SMTP Server id 15.0.1497.48 via Frontend
 Transport; Wed, 19 Apr 2023 13:01:56 -0700
Received: from MININT-80QBFE8.corp.innovium.com (unknown [10.28.164.122])
 by maili.marvell.com (Postfix) with ESMTP id 4E5B53F7055;
 Wed, 19 Apr 2023 13:01:54 -0700 (PDT)
From: <pbhagavatula@marvell.com>
To: <jerinj@marvell.com>, Nithin Dabilpuram <ndabilpuram@marvell.com>, "Kiran
 Kumar K" <kirankumark@marvell.com>, Sunil Kumar Kori <skori@marvell.com>,
 Satha Rao <skoteshwar@marvell.com>, Pavan Nikhilesh
 <pbhagavatula@marvell.com>, Shijith Thotton <sthotton@marvell.com>
CC: <dev@dpdk.org>
Subject: [PATCH 1/3] event/cnxk: use LMTST for enqueue new burst
Date: Thu, 20 Apr 2023 01:31:49 +0530
Message-ID: <20230419200151.2474-1-pbhagavatula@marvell.com>
X-Mailer: git-send-email 2.25.1
MIME-Version: 1.0
X-Proofpoint-ORIG-GUID: UxVwhdOXlbOqgkMNLZT81y0lqAInBkfR
X-Proofpoint-GUID: UxVwhdOXlbOqgkMNLZT81y0lqAInBkfR
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.254,Aquarius:18.0.942,Hydra:6.0.573,FMLib:17.11.170.22
 definitions=2023-04-19_14,2023-04-18_01,2023-02-09_01
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

From: Pavan Nikhilesh <pbhagavatula@marvell.com>

Use LMTST when all events in the burst are enqueue with
rte_event:op as RTE_EVENT_OP_NEW i.e. events are enqueued
with the `rte_event_enqueue_new_burst` API.

Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
 drivers/common/cnxk/hw/sso.h        |   1 +
 drivers/common/cnxk/roc_sso.c       |  10 +-
 drivers/common/cnxk/roc_sso.h       |   3 +
 drivers/event/cnxk/cn10k_eventdev.c |   9 +-
 drivers/event/cnxk/cn10k_eventdev.h |   6 +-
 drivers/event/cnxk/cn10k_worker.c   | 304 +++++++++++++++++++++++++++-
 drivers/event/cnxk/cn9k_eventdev.c  |   2 +-
 drivers/event/cnxk/cnxk_eventdev.c  |  15 +-
 drivers/event/cnxk/cnxk_eventdev.h  |   4 +-
 9 files changed, 338 insertions(+), 16 deletions(-)

diff --git a/drivers/common/cnxk/hw/sso.h b/drivers/common/cnxk/hw/sso.h
index 25deaa4c14..09b8d4955f 100644
--- a/drivers/common/cnxk/hw/sso.h
+++ b/drivers/common/cnxk/hw/sso.h
@@ -157,6 +157,7 @@
 #define SSO_LF_GGRP_AQ_CNT	 (0x1c0ull)
 #define SSO_LF_GGRP_AQ_THR	 (0x1e0ull)
 #define SSO_LF_GGRP_MISC_CNT	 (0x200ull)
+#define SSO_LF_GGRP_OP_AW_LMTST	 (0x400ull)
 
 #define SSO_AF_IAQ_FREE_CNT_MASK      0x3FFFull
 #define SSO_AF_IAQ_RSVD_FREE_MASK     0x3FFFull
diff --git a/drivers/common/cnxk/roc_sso.c b/drivers/common/cnxk/roc_sso.c
index 4a6a5080f7..78e98bbdd6 100644
--- a/drivers/common/cnxk/roc_sso.c
+++ b/drivers/common/cnxk/roc_sso.c
@@ -6,6 +6,7 @@
 #include "roc_priv.h"
 
 #define SSO_XAQ_CACHE_CNT (0x7)
+#define SSO_XAQ_SLACK	  (16)
 
 /* Private functions. */
 int
@@ -493,9 +494,13 @@ sso_hwgrp_init_xaq_aura(struct dev *dev, struct roc_sso_xaq_data *xaq,
 
 	xaq->nb_xae = nb_xae;
 
-	/* Taken from HRM 14.3.3(4) */
+	/** SSO will reserve upto 0x4 XAQ buffers per group when GetWork engine
+	 * is inactive and it might prefetch an additional 0x3 buffers due to
+	 * pipelining.
+	 */
 	xaq->nb_xaq = (SSO_XAQ_CACHE_CNT * nb_hwgrp);
 	xaq->nb_xaq += PLT_MAX(1 + ((xaq->nb_xae - 1) / xae_waes), xaq->nb_xaq);
+	xaq->nb_xaq += SSO_XAQ_SLACK;
 
 	xaq->mem = plt_zmalloc(xaq_buf_size * xaq->nb_xaq, xaq_buf_size);
 	if (xaq->mem == NULL) {
@@ -537,7 +542,8 @@ sso_hwgrp_init_xaq_aura(struct dev *dev, struct roc_sso_xaq_data *xaq,
 	 * There should be a minimum headroom of 7 XAQs per HWGRP for SSO
 	 * to request XAQ to cache them even before enqueue is called.
 	 */
-	xaq->xaq_lmt = xaq->nb_xaq - (nb_hwgrp * SSO_XAQ_CACHE_CNT);
+	xaq->xaq_lmt =
+		xaq->nb_xaq - (nb_hwgrp * SSO_XAQ_CACHE_CNT) - SSO_XAQ_SLACK;
 
 	return 0;
 npa_fill_fail:
diff --git a/drivers/common/cnxk/roc_sso.h b/drivers/common/cnxk/roc_sso.h
index e67797b046..a2bb6fcb22 100644
--- a/drivers/common/cnxk/roc_sso.h
+++ b/drivers/common/cnxk/roc_sso.h
@@ -7,6 +7,9 @@
 
 #include "hw/ssow.h"
 
+#define ROC_SSO_AW_PER_LMT_LINE_LOG2 3
+#define ROC_SSO_XAE_PER_XAQ	     352
+
 struct roc_sso_hwgrp_qos {
 	uint16_t hwgrp;
 	uint8_t xaq_prcnt;
diff --git a/drivers/event/cnxk/cn10k_eventdev.c b/drivers/event/cnxk/cn10k_eventdev.c
index 071ea5a212..855c92da83 100644
--- a/drivers/event/cnxk/cn10k_eventdev.c
+++ b/drivers/event/cnxk/cn10k_eventdev.c
@@ -91,8 +91,10 @@ cn10k_sso_hws_setup(void *arg, void *hws, uintptr_t grp_base)
 	uint64_t val;
 
 	ws->grp_base = grp_base;
-	ws->fc_mem = (uint64_t *)dev->fc_iova;
+	ws->fc_mem = (int64_t *)dev->fc_iova;
 	ws->xaq_lmt = dev->xaq_lmt;
+	ws->fc_cache_space = dev->fc_cache_space;
+	ws->aw_lmt = ws->lmt_base;
 
 	/* Set get_work timeout for HWS */
 	val = NSEC2USEC(dev->deq_tmo_ns);
@@ -624,6 +626,7 @@ cn10k_sso_info_get(struct rte_eventdev *event_dev,
 
 	dev_info->driver_name = RTE_STR(EVENTDEV_NAME_CN10K_PMD);
 	cnxk_sso_info_get(dev, dev_info);
+	dev_info->max_event_port_enqueue_depth = UINT32_MAX;
 }
 
 static int
@@ -632,7 +635,7 @@ cn10k_sso_dev_configure(const struct rte_eventdev *event_dev)
 	struct cnxk_sso_evdev *dev = cnxk_sso_pmd_priv(event_dev);
 	int rc;
 
-	rc = cnxk_sso_dev_validate(event_dev);
+	rc = cnxk_sso_dev_validate(event_dev, 1, UINT32_MAX);
 	if (rc < 0) {
 		plt_err("Invalid event device configuration");
 		return -EINVAL;
@@ -871,7 +874,7 @@ cn10k_sso_set_priv_mem(const struct rte_eventdev *event_dev, void *lookup_mem)
 	for (i = 0; i < dev->nb_event_ports; i++) {
 		struct cn10k_sso_hws *ws = event_dev->data->ports[i];
 		ws->xaq_lmt = dev->xaq_lmt;
-		ws->fc_mem = (uint64_t *)dev->fc_iova;
+		ws->fc_mem = (int64_t *)dev->fc_iova;
 		ws->tstamp = dev->tstamp;
 		if (lookup_mem)
 			ws->lookup_mem = lookup_mem;
diff --git a/drivers/event/cnxk/cn10k_eventdev.h b/drivers/event/cnxk/cn10k_eventdev.h
index aaa01d1ec1..29567728cd 100644
--- a/drivers/event/cnxk/cn10k_eventdev.h
+++ b/drivers/event/cnxk/cn10k_eventdev.h
@@ -19,9 +19,11 @@ struct cn10k_sso_hws {
 	struct cnxk_timesync_info **tstamp;
 	uint64_t meta_aura;
 	/* Add Work Fastpath data */
-	uint64_t xaq_lmt __rte_cache_aligned;
-	uint64_t *fc_mem;
+	int64_t *fc_mem __rte_cache_aligned;
+	int64_t *fc_cache_space;
+	uintptr_t aw_lmt;
 	uintptr_t grp_base;
+	int32_t xaq_lmt;
 	/* Tx Fastpath data */
 	uintptr_t lmt_base __rte_cache_aligned;
 	uint64_t lso_tun_fmt;
diff --git a/drivers/event/cnxk/cn10k_worker.c b/drivers/event/cnxk/cn10k_worker.c
index 562d2fca13..ba3ea4bb35 100644
--- a/drivers/event/cnxk/cn10k_worker.c
+++ b/drivers/event/cnxk/cn10k_worker.c
@@ -77,6 +77,36 @@ cn10k_sso_hws_forward_event(struct cn10k_sso_hws *ws,
 		cn10k_sso_hws_fwd_group(ws, ev, grp);
 }
 
+static inline int32_t
+sso_read_xaq_space(struct cn10k_sso_hws *ws)
+{
+	return (ws->xaq_lmt - __atomic_load_n(ws->fc_mem, __ATOMIC_RELAXED)) *
+	       ROC_SSO_XAE_PER_XAQ;
+}
+
+static inline void
+sso_lmt_aw_wait_fc(struct cn10k_sso_hws *ws, int64_t req)
+{
+	int64_t cached, refill;
+
+retry:
+	while (__atomic_load_n(ws->fc_cache_space, __ATOMIC_RELAXED) < 0)
+		;
+
+	cached = __atomic_sub_fetch(ws->fc_cache_space, req, __ATOMIC_ACQUIRE);
+	/* Check if there is enough space, else update and retry. */
+	if (cached < 0) {
+		/* Check if we have space else retry. */
+		do {
+			refill = sso_read_xaq_space(ws);
+		} while (refill <= 0);
+		__atomic_compare_exchange(ws->fc_cache_space, &cached, &refill,
+					  0, __ATOMIC_RELEASE,
+					  __ATOMIC_RELAXED);
+		goto retry;
+	}
+}
+
 uint16_t __rte_hot
 cn10k_sso_hws_enq(void *port, const struct rte_event *ev)
 {
@@ -103,6 +133,253 @@ cn10k_sso_hws_enq(void *port, const struct rte_event *ev)
 	return 1;
 }
 
+#define VECTOR_SIZE_BITS	     0xFFFFFFFFFFF80000ULL
+#define VECTOR_GET_LINE_OFFSET(line) (19 + (3 * line))
+
+static uint64_t
+vector_size_partial_mask(uint16_t off, uint16_t cnt)
+{
+	return (VECTOR_SIZE_BITS & ~(~0x0ULL << off)) |
+	       ((uint64_t)(cnt - 1) << off);
+}
+
+static __rte_always_inline uint16_t
+cn10k_sso_hws_new_event_lmtst(struct cn10k_sso_hws *ws, uint8_t queue_id,
+			      const struct rte_event ev[], uint16_t n)
+{
+	uint16_t lines, partial_line, burst, left;
+	uint64_t wdata[2], pa[2] = {0};
+	uintptr_t lmt_addr;
+	uint16_t sz0, sz1;
+	uint16_t lmt_id;
+
+	sz0 = sz1 = 0;
+	lmt_addr = ws->lmt_base;
+	ROC_LMT_BASE_ID_GET(lmt_addr, lmt_id);
+
+	left = n;
+again:
+	burst = RTE_MIN(
+		BIT(ROC_SSO_AW_PER_LMT_LINE_LOG2 + ROC_LMT_LINES_PER_CORE_LOG2),
+		left);
+
+	/* Set wdata */
+	lines = burst >> ROC_SSO_AW_PER_LMT_LINE_LOG2;
+	partial_line = burst & (BIT(ROC_SSO_AW_PER_LMT_LINE_LOG2) - 1);
+	wdata[0] = wdata[1] = 0;
+	if (lines > BIT(ROC_LMT_LINES_PER_STR_LOG2)) {
+		wdata[0] = lmt_id;
+		wdata[0] |= 15ULL << 12;
+		wdata[0] |= VECTOR_SIZE_BITS;
+		pa[0] = (ws->grp_base + (queue_id << 12) +
+			 SSO_LF_GGRP_OP_AW_LMTST) |
+			(0x7 << 4);
+		sz0 = 16 << ROC_SSO_AW_PER_LMT_LINE_LOG2;
+
+		wdata[1] = lmt_id + 16;
+		pa[1] = (ws->grp_base + (queue_id << 12) +
+			 SSO_LF_GGRP_OP_AW_LMTST) |
+			(0x7 << 4);
+
+		lines -= 17;
+		wdata[1] |= partial_line ? (uint64_t)(lines + 1) << 12 :
+						 (uint64_t)(lines << 12);
+		wdata[1] |= partial_line ?
+				    vector_size_partial_mask(
+					    VECTOR_GET_LINE_OFFSET(lines),
+					    partial_line) :
+				    VECTOR_SIZE_BITS;
+		sz1 = burst - sz0;
+		partial_line = 0;
+	} else if (lines) {
+		/* We need to handle two cases here:
+		 * 1. Partial line spill over to wdata[1] i.e. lines == 16
+		 * 2. Partial line with spill lines < 16.
+		 */
+		wdata[0] = lmt_id;
+		pa[0] = (ws->grp_base + (queue_id << 12) +
+			 SSO_LF_GGRP_OP_AW_LMTST) |
+			(0x7 << 4);
+		sz0 = lines << ROC_SSO_AW_PER_LMT_LINE_LOG2;
+		if (lines == 16) {
+			wdata[0] |= 15ULL << 12;
+			wdata[0] |= VECTOR_SIZE_BITS;
+			if (partial_line) {
+				wdata[1] = lmt_id + 16;
+				pa[1] = (ws->grp_base + (queue_id << 12) +
+					 SSO_LF_GGRP_OP_AW_LMTST) |
+					((partial_line - 1) << 4);
+			}
+		} else {
+			lines -= 1;
+			wdata[0] |= partial_line ? (uint64_t)(lines + 1) << 12 :
+							 (uint64_t)(lines << 12);
+			wdata[0] |=
+				partial_line ?
+					vector_size_partial_mask(
+						VECTOR_GET_LINE_OFFSET(lines),
+						partial_line) :
+					VECTOR_SIZE_BITS;
+			sz0 += partial_line;
+		}
+		sz1 = burst - sz0;
+		partial_line = 0;
+	}
+
+	/* Only partial lines */
+	if (partial_line) {
+		wdata[0] = lmt_id;
+		pa[0] = (ws->grp_base + (queue_id << 12) +
+			 SSO_LF_GGRP_OP_AW_LMTST) |
+			((partial_line - 1) << 4);
+		sz0 = partial_line;
+		sz1 = burst - sz0;
+	}
+
+#if defined(RTE_ARCH_ARM64)
+	uint64x2_t aw_mask = {0xC0FFFFFFFFULL, ~0x0ULL};
+	uint64x2_t tt_mask = {0x300000000ULL, 0};
+	uint16_t parts;
+
+	while (burst) {
+		parts = burst > 7 ? 8 : plt_align32prevpow2(burst);
+		burst -= parts;
+		/* Lets try to fill atleast one line per burst. */
+		switch (parts) {
+		case 8: {
+			uint64x2_t aw0, aw1, aw2, aw3, aw4, aw5, aw6, aw7;
+
+			aw0 = vandq_u64(vld1q_u64((const uint64_t *)&ev[0]),
+					aw_mask);
+			aw1 = vandq_u64(vld1q_u64((const uint64_t *)&ev[1]),
+					aw_mask);
+			aw2 = vandq_u64(vld1q_u64((const uint64_t *)&ev[2]),
+					aw_mask);
+			aw3 = vandq_u64(vld1q_u64((const uint64_t *)&ev[3]),
+					aw_mask);
+			aw4 = vandq_u64(vld1q_u64((const uint64_t *)&ev[4]),
+					aw_mask);
+			aw5 = vandq_u64(vld1q_u64((const uint64_t *)&ev[5]),
+					aw_mask);
+			aw6 = vandq_u64(vld1q_u64((const uint64_t *)&ev[6]),
+					aw_mask);
+			aw7 = vandq_u64(vld1q_u64((const uint64_t *)&ev[7]),
+					aw_mask);
+
+			aw0 = vorrq_u64(vandq_u64(vshrq_n_u64(aw0, 6), tt_mask),
+					aw0);
+			aw1 = vorrq_u64(vandq_u64(vshrq_n_u64(aw1, 6), tt_mask),
+					aw1);
+			aw2 = vorrq_u64(vandq_u64(vshrq_n_u64(aw2, 6), tt_mask),
+					aw2);
+			aw3 = vorrq_u64(vandq_u64(vshrq_n_u64(aw3, 6), tt_mask),
+					aw3);
+			aw4 = vorrq_u64(vandq_u64(vshrq_n_u64(aw4, 6), tt_mask),
+					aw4);
+			aw5 = vorrq_u64(vandq_u64(vshrq_n_u64(aw5, 6), tt_mask),
+					aw5);
+			aw6 = vorrq_u64(vandq_u64(vshrq_n_u64(aw6, 6), tt_mask),
+					aw6);
+			aw7 = vorrq_u64(vandq_u64(vshrq_n_u64(aw7, 6), tt_mask),
+					aw7);
+
+			vst1q_u64((void *)lmt_addr, aw0);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 16), aw1);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 32), aw2);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 48), aw3);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 64), aw4);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 80), aw5);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 96), aw6);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 112), aw7);
+			lmt_addr = (uintptr_t)PLT_PTR_ADD(lmt_addr, 128);
+		} break;
+		case 4: {
+			uint64x2_t aw0, aw1, aw2, aw3;
+			aw0 = vandq_u64(vld1q_u64((const uint64_t *)&ev[0]),
+					aw_mask);
+			aw1 = vandq_u64(vld1q_u64((const uint64_t *)&ev[1]),
+					aw_mask);
+			aw2 = vandq_u64(vld1q_u64((const uint64_t *)&ev[2]),
+					aw_mask);
+			aw3 = vandq_u64(vld1q_u64((const uint64_t *)&ev[3]),
+					aw_mask);
+
+			aw0 = vorrq_u64(vandq_u64(vshrq_n_u64(aw0, 6), tt_mask),
+					aw0);
+			aw1 = vorrq_u64(vandq_u64(vshrq_n_u64(aw1, 6), tt_mask),
+					aw1);
+			aw2 = vorrq_u64(vandq_u64(vshrq_n_u64(aw2, 6), tt_mask),
+					aw2);
+			aw3 = vorrq_u64(vandq_u64(vshrq_n_u64(aw3, 6), tt_mask),
+					aw3);
+
+			vst1q_u64((void *)lmt_addr, aw0);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 16), aw1);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 32), aw2);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 48), aw3);
+			lmt_addr = (uintptr_t)PLT_PTR_ADD(lmt_addr, 64);
+		} break;
+		case 2: {
+			uint64x2_t aw0, aw1;
+
+			aw0 = vandq_u64(vld1q_u64((const uint64_t *)&ev[0]),
+					aw_mask);
+			aw1 = vandq_u64(vld1q_u64((const uint64_t *)&ev[1]),
+					aw_mask);
+
+			aw0 = vorrq_u64(vandq_u64(vshrq_n_u64(aw0, 6), tt_mask),
+					aw0);
+			aw1 = vorrq_u64(vandq_u64(vshrq_n_u64(aw1, 6), tt_mask),
+					aw1);
+
+			vst1q_u64((void *)lmt_addr, aw0);
+			vst1q_u64((void *)PLT_PTR_ADD(lmt_addr, 16), aw1);
+			lmt_addr = (uintptr_t)PLT_PTR_ADD(lmt_addr, 32);
+		} break;
+		case 1: {
+			__uint128_t aw0;
+
+			aw0 = ev[0].u64;
+			aw0 <<= 64;
+			aw0 |= ev[0].event & (BIT_ULL(32) - 1);
+			aw0 |= (uint64_t)ev[0].sched_type << 32;
+
+			*((__uint128_t *)lmt_addr) = aw0;
+			lmt_addr = (uintptr_t)PLT_PTR_ADD(lmt_addr, 16);
+		} break;
+		}
+		ev += parts;
+	}
+#else
+	uint16_t i;
+
+	for (i = 0; i < burst; i++) {
+		__uint128_t aw0;
+
+		aw0 = ev[0].u64;
+		aw0 <<= 64;
+		aw0 |= ev[0].event & (BIT_ULL(32) - 1);
+		aw0 |= (uint64_t)ev[0].sched_type << 32;
+		*((__uint128_t *)lmt_addr) = aw0;
+		lmt_addr = (uintptr_t)PLT_PTR_ADD(lmt_addr, 16);
+	}
+#endif
+
+	/* wdata[0] will be always valid */
+	sso_lmt_aw_wait_fc(ws, sz0);
+	roc_lmt_submit_steorl(wdata[0], pa[0]);
+	if (wdata[1]) {
+		sso_lmt_aw_wait_fc(ws, sz1);
+		roc_lmt_submit_steorl(wdata[1], pa[1]);
+	}
+
+	left -= (sz0 + sz1);
+	if (left)
+		goto again;
+
+	return n;
+}
+
 uint16_t __rte_hot
 cn10k_sso_hws_enq_burst(void *port, const struct rte_event ev[],
 			uint16_t nb_events)
@@ -115,13 +392,32 @@ uint16_t __rte_hot
 cn10k_sso_hws_enq_new_burst(void *port, const struct rte_event ev[],
 			    uint16_t nb_events)
 {
+	uint16_t idx = 0, done = 0, rc = 0;
 	struct cn10k_sso_hws *ws = port;
-	uint16_t i, rc = 1;
+	uint8_t queue_id;
+	int32_t space;
+
+	/* Do a common back-pressure check and return */
+	space = sso_read_xaq_space(ws) - ROC_SSO_XAE_PER_XAQ;
+	if (space <= 0)
+		return 0;
+	nb_events = space < nb_events ? space : nb_events;
+
+	do {
+		queue_id = ev[idx].queue_id;
+		for (idx = idx + 1; idx < nb_events; idx++)
+			if (queue_id != ev[idx].queue_id)
+				break;
+
+		rc = cn10k_sso_hws_new_event_lmtst(ws, queue_id, &ev[done],
+						   idx - done);
+		if (rc != (idx - done))
+			return rc + done;
+		done += rc;
 
-	for (i = 0; i < nb_events && rc; i++)
-		rc = cn10k_sso_hws_new_event(ws, &ev[i]);
+	} while (done < nb_events);
 
-	return nb_events;
+	return done;
 }
 
 uint16_t __rte_hot
diff --git a/drivers/event/cnxk/cn9k_eventdev.c b/drivers/event/cnxk/cn9k_eventdev.c
index 7e8339bd3a..e59e537311 100644
--- a/drivers/event/cnxk/cn9k_eventdev.c
+++ b/drivers/event/cnxk/cn9k_eventdev.c
@@ -753,7 +753,7 @@ cn9k_sso_dev_configure(const struct rte_eventdev *event_dev)
 	struct cnxk_sso_evdev *dev = cnxk_sso_pmd_priv(event_dev);
 	int rc;
 
-	rc = cnxk_sso_dev_validate(event_dev);
+	rc = cnxk_sso_dev_validate(event_dev, 1, 1);
 	if (rc < 0) {
 		plt_err("Invalid event device configuration");
 		return -EINVAL;
diff --git a/drivers/event/cnxk/cnxk_eventdev.c b/drivers/event/cnxk/cnxk_eventdev.c
index cb9ba5d353..99f9cdcd0d 100644
--- a/drivers/event/cnxk/cnxk_eventdev.c
+++ b/drivers/event/cnxk/cnxk_eventdev.c
@@ -145,7 +145,8 @@ cnxk_sso_restore_links(const struct rte_eventdev *event_dev,
 }
 
 int
-cnxk_sso_dev_validate(const struct rte_eventdev *event_dev)
+cnxk_sso_dev_validate(const struct rte_eventdev *event_dev, uint32_t deq_depth,
+		      uint32_t enq_depth)
 {
 	struct rte_event_dev_config *conf = &event_dev->data->dev_conf;
 	struct cnxk_sso_evdev *dev = cnxk_sso_pmd_priv(event_dev);
@@ -173,12 +174,12 @@ cnxk_sso_dev_validate(const struct rte_eventdev *event_dev)
 		return -EINVAL;
 	}
 
-	if (conf->nb_event_port_dequeue_depth > 1) {
+	if (conf->nb_event_port_dequeue_depth > deq_depth) {
 		plt_err("Unsupported event port deq depth requested");
 		return -EINVAL;
 	}
 
-	if (conf->nb_event_port_enqueue_depth > 1) {
+	if (conf->nb_event_port_enqueue_depth > enq_depth) {
 		plt_err("Unsupported event port enq depth requested");
 		return -EINVAL;
 	}
@@ -630,6 +631,14 @@ cnxk_sso_init(struct rte_eventdev *event_dev)
 	}
 
 	dev = cnxk_sso_pmd_priv(event_dev);
+	dev->fc_cache_space = rte_zmalloc("fc_cache", PLT_CACHE_LINE_SIZE,
+					  PLT_CACHE_LINE_SIZE);
+	if (dev->fc_cache_space == NULL) {
+		plt_memzone_free(mz);
+		plt_err("Failed to reserve memory for XAQ fc cache");
+		return -ENOMEM;
+	}
+
 	pci_dev = container_of(event_dev->dev, struct rte_pci_device, device);
 	dev->sso.pci_dev = pci_dev;
 
diff --git a/drivers/event/cnxk/cnxk_eventdev.h b/drivers/event/cnxk/cnxk_eventdev.h
index c7cbd722ab..a2f30bfe5f 100644
--- a/drivers/event/cnxk/cnxk_eventdev.h
+++ b/drivers/event/cnxk/cnxk_eventdev.h
@@ -90,6 +90,7 @@ struct cnxk_sso_evdev {
 	uint32_t max_dequeue_timeout_ns;
 	int32_t max_num_events;
 	uint64_t xaq_lmt;
+	int64_t *fc_cache_space;
 	rte_iova_t fc_iova;
 	uint64_t rx_offloads;
 	uint64_t tx_offloads;
@@ -206,7 +207,8 @@ int cnxk_sso_fini(struct rte_eventdev *event_dev);
 int cnxk_sso_remove(struct rte_pci_device *pci_dev);
 void cnxk_sso_info_get(struct cnxk_sso_evdev *dev,
 		       struct rte_event_dev_info *dev_info);
-int cnxk_sso_dev_validate(const struct rte_eventdev *event_dev);
+int cnxk_sso_dev_validate(const struct rte_eventdev *event_dev,
+			  uint32_t deq_depth, uint32_t enq_depth);
 int cnxk_setup_event_ports(const struct rte_eventdev *event_dev,
 			   cnxk_sso_init_hws_mem_t init_hws_mem,
 			   cnxk_sso_hws_setup_t hws_setup);