[dpdk-dev,09/13] bbdev: measure offload cost

Message ID 20180426133008.12388-9-kamilx.chalupnik@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Pablo de Lara Guarch
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Kamil Chalupnik April 26, 2018, 1:30 p.m. UTC
  From: KamilX Chalupnik <kamilx.chalupnik@intel.com>

New test created to measure offload cost.
Changes were introduced in API, turbo software driver
and test application

Signed-off-by: Kamil Chalupnik <kamilx.chalupnik@intel.com>
Acked-by: Amr Mokhtar <amr.mokhtar@intel.com>
---
 app/test-bbdev/test_bbdev_perf.c                 | 333 ++++++++++++++++++-----
 drivers/baseband/turbo_sw/bbdev_turbo_software.c |  55 +++-
 lib/librte_bbdev/rte_bbdev.h                     |   4 +
 3 files changed, 313 insertions(+), 79 deletions(-)
  

Comments

De Lara Guarch, Pablo May 7, 2018, 1:29 p.m. UTC | #1
> -----Original Message-----
> From: Chalupnik, KamilX
> Sent: Thursday, April 26, 2018 2:30 PM
> To: dev@dpdk.org
> Cc: Mokhtar, Amr <amr.mokhtar@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Chalupnik, KamilX
> <kamilx.chalupnik@intel.com>
> Subject: [PATCH 09/13] bbdev: measure offload cost
> 

...

> --- a/lib/librte_bbdev/rte_bbdev.h
> +++ b/lib/librte_bbdev/rte_bbdev.h
> @@ -239,6 +239,10 @@ struct rte_bbdev_stats {
>  	uint64_t enqueue_err_count;
>  	/** Total error count on operations dequeued */
>  	uint64_t dequeue_err_count;
> +#ifdef RTE_TEST_BBDEV
> +	/** It stores offload time. */

Just "offload time" is fine.

> +	uint64_t offload_time;
> +#endif

Again, I don't think it is a good idea to have this compilation check.
RTE_TEST_BBDEV is used to enable the compilation of the test app,
so it shouldn't be used for anything else.
Also, in DPDK, we are avoiding the usage of this conditionals to enable/disable
pieces of code.

If you want to avoid the computation of this time, add a configuration option
in bbdev configuration structure (rte_bbdev_queue_conf?), so the decision is made at runtime.

>  };
> 
>  /**
> --
> 2.5.5
  
De Lara Guarch, Pablo May 8, 2018, 9:08 a.m. UTC | #2
Hi Kamil,

> -----Original Message-----
> From: Chalupnik, KamilX
> Sent: Tuesday, May 8, 2018 8:56 AM
> To: De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org
> Cc: Mokhtar, Amr <amr.mokhtar@intel.com>
> Subject: RE: [PATCH 09/13] bbdev: measure offload cost
> 
> 
> 
> > -----Original Message-----
> > From: De Lara Guarch, Pablo
> > Sent: Monday, May 7, 2018 3:29 PM
> > To: Chalupnik, KamilX <kamilx.chalupnik@intel.com>; dev@dpdk.org
> > Cc: Mokhtar, Amr <amr.mokhtar@intel.com>
> > Subject: RE: [PATCH 09/13] bbdev: measure offload cost
> >
> >
> >
> > > -----Original Message-----
> > > From: Chalupnik, KamilX
> > > Sent: Thursday, April 26, 2018 2:30 PM
> > > To: dev@dpdk.org
> > > Cc: Mokhtar, Amr <amr.mokhtar@intel.com>; De Lara Guarch, Pablo
> > > <pablo.de.lara.guarch@intel.com>; Chalupnik, KamilX
> > > <kamilx.chalupnik@intel.com>
> > > Subject: [PATCH 09/13] bbdev: measure offload cost
> > >
> >
> > ...
> >
> > > --- a/lib/librte_bbdev/rte_bbdev.h
> > > +++ b/lib/librte_bbdev/rte_bbdev.h
> > > @@ -239,6 +239,10 @@ struct rte_bbdev_stats {
> > >  	uint64_t enqueue_err_count;
> > >  	/** Total error count on operations dequeued */
> > >  	uint64_t dequeue_err_count;
> > > +#ifdef RTE_TEST_BBDEV
> > > +	/** It stores offload time. */
> >
> > Just "offload time" is fine.
> >
> > > +	uint64_t offload_time;
> > > +#endif
> >
> > Again, I don't think it is a good idea to have this compilation check.
> > RTE_TEST_BBDEV is used to enable the compilation of the test app, so
> > it shouldn't be used for anything else.
> > Also, in DPDK, we are avoiding the usage of this conditionals to
> > enable/disable pieces of code.
> 
> > If you want to avoid the computation of this time, add a configuration
> > option in bbdev configuration structure (rte_bbdev_queue_conf?), so
> > the decision is made at runtime.
> 
> Ok, but this 'offload_time' variable is used only for testing purposes so we
> decided that it should exist only when test application is being built.
> In case where we move the decision to runtime phase it may have bad impact on
> driver performance because then each time we have to check if 'offload_time'
> has to be calculated or not.

I understand the performance penalty. Anyway, if you go for build time check,
you should have another option name. The test app option should not affect the code in a library/PMD.
Also, this option should probably be disabled, to avoid the extra calculations unnecessarily.
Lastly, I think it would be better to always have "offload_time" field in the structure,
so remove the check there.

Thanks,
Pablo

> 
> > >  };
> > >
> > >  /**
> > > --
> > > 2.5.5
  
De Lara Guarch, Pablo May 8, 2018, 10:16 a.m. UTC | #3
> -----Original Message-----
> From: Chalupnik, KamilX
> Sent: Tuesday, May 8, 2018 10:48 AM
> To: De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org
> Cc: Mokhtar, Amr <amr.mokhtar@intel.com>
> Subject: RE: [PATCH 09/13] bbdev: measure offload cost
> 
> Hi Pablo,
> 
> > -----Original Message-----
> > From: De Lara Guarch, Pablo
> > Sent: Tuesday, May 8, 2018 11:08 AM
> > To: Chalupnik, KamilX <kamilx.chalupnik@intel.com>; dev@dpdk.org
> > Cc: Mokhtar, Amr <amr.mokhtar@intel.com>
> > Subject: RE: [PATCH 09/13] bbdev: measure offload cost
> >
> > Hi Kamil,
> >
> > > -----Original Message-----
> > > From: Chalupnik, KamilX
> > > Sent: Tuesday, May 8, 2018 8:56 AM
> > > To: De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>;
> > > dev@dpdk.org
> > > Cc: Mokhtar, Amr <amr.mokhtar@intel.com>
> > > Subject: RE: [PATCH 09/13] bbdev: measure offload cost
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: De Lara Guarch, Pablo
> > > > Sent: Monday, May 7, 2018 3:29 PM
> > > > To: Chalupnik, KamilX <kamilx.chalupnik@intel.com>; dev@dpdk.org
> > > > Cc: Mokhtar, Amr <amr.mokhtar@intel.com>
> > > > Subject: RE: [PATCH 09/13] bbdev: measure offload cost
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Chalupnik, KamilX
> > > > > Sent: Thursday, April 26, 2018 2:30 PM
> > > > > To: dev@dpdk.org
> > > > > Cc: Mokhtar, Amr <amr.mokhtar@intel.com>; De Lara Guarch, Pablo
> > > > > <pablo.de.lara.guarch@intel.com>; Chalupnik, KamilX
> > > > > <kamilx.chalupnik@intel.com>
> > > > > Subject: [PATCH 09/13] bbdev: measure offload cost
> > > > >
> > > >
> > > > ...
> > > >
> > > > > --- a/lib/librte_bbdev/rte_bbdev.h
> > > > > +++ b/lib/librte_bbdev/rte_bbdev.h
> > > > > @@ -239,6 +239,10 @@ struct rte_bbdev_stats {
> > > > >  	uint64_t enqueue_err_count;
> > > > >  	/** Total error count on operations dequeued */
> > > > >  	uint64_t dequeue_err_count;
> > > > > +#ifdef RTE_TEST_BBDEV
> > > > > +	/** It stores offload time. */
> > > >
> > > > Just "offload time" is fine.
> > > >
> > > > > +	uint64_t offload_time;
> > > > > +#endif
> > > >
> > > > Again, I don't think it is a good idea to have this compilation check.
> > > > RTE_TEST_BBDEV is used to enable the compilation of the test app,
> > > > so it shouldn't be used for anything else.
> > > > Also, in DPDK, we are avoiding the usage of this conditionals to
> > > > enable/disable pieces of code.
> > >
> > > > If you want to avoid the computation of this time, add a
> > > > configuration option in bbdev configuration structure
> > > > (rte_bbdev_queue_conf?), so the decision is made at runtime.
> > >
> > > Ok, but this 'offload_time' variable is used only for testing
> > > purposes so we decided that it should exist only when test application is
> being built.
> > > In case where we move the decision to runtime phase it may have bad
> > > impact on driver performance because then each time we have to check
> > > if
> > 'offload_time'
> > > has to be calculated or not.
> >
> > I understand the performance penalty. Anyway, if you go for build time
> > check, you should have another option name. The test app option should
> > not affect the code in a library/PMD.
> > Also, this option should probably be disabled, to avoid the extra
> > calculations unnecessarily.
> > Lastly, I think it would be better to always have "offload_time" field
> > in the structure, so remove the check there.
> >
> 
> So should I create and use in driver new guard for 'offload_time' and add it to
> config/common_base (or set it in Makefile when RTE_TEST_BBDEV is set). Are
> you ok for such exception to the DPDK code guidelines in this case?

I would leave "offload_time" always available in the structure, so its size doesn't change
depending on this option.

It is better to avoid adding these options, but if we do it for perf reasons (if it affects data path code),
then at least it should be well designed, so there should be a separate option (either under TURBO_SW or BBDEV),
and I think it should be disabled by default.

The test app is enabled by default, so users most likely would have this enabled without needing it,
impacting performance. Therefore, I'd say it is better to avoid relating this to the test app.
You can document that this option should be enabled if offload times are wanted
(and maybe check in the test app if the compilation flag is set and show a message to enable it if user is interested).

> 
> Best regards,
> Kamil
  

Patch

diff --git a/app/test-bbdev/test_bbdev_perf.c b/app/test-bbdev/test_bbdev_perf.c
index 0a2fdd3..d9b34e2 100644
--- a/app/test-bbdev/test_bbdev_perf.c
+++ b/app/test-bbdev/test_bbdev_perf.c
@@ -84,6 +84,28 @@  struct thread_params {
 	struct test_op_params *op_params;
 };
 
+/* Stores time statistics */
+struct test_time_stats {
+	/* Stores software enqueue total working time */
+	uint64_t enq_sw_tot_time;
+	/* Stores minimum value of software enqueue working time */
+	uint64_t enq_sw_min_time;
+	/* Stores maximum value of software enqueue working time */
+	uint64_t enq_sw_max_time;
+	/* Stores turbo enqueue total working time */
+	uint64_t enq_tur_tot_time;
+	/* Stores minimum value of turbo enqueue working time */
+	uint64_t enq_tur_min_time;
+	/* Stores maximum value of turbo enqueue working time */
+	uint64_t enq_tur_max_time;
+	/* Stores dequeue total working time */
+	uint64_t deq_tot_time;
+	/* Stores minimum value of dequeue working time */
+	uint64_t deq_min_time;
+	/* Stores maximum value of dequeue working time */
+	uint64_t deq_max_time;
+};
+
 typedef int (test_case_function)(struct active_device *ad,
 		struct test_op_params *op_params);
 
@@ -1144,7 +1166,6 @@  dequeue_event_callback(uint16_t dev_id,
 	double in_len;
 
 	struct thread_params *tp = cb_arg;
-
 	RTE_SET_USED(ret_param);
 	queue_id = tp->queue_id;
 
@@ -1689,20 +1710,21 @@  throughput_test(struct active_device *ad,
 }
 
 static int
-operation_latency_test_dec(struct rte_mempool *mempool,
+latency_test_dec(struct rte_mempool *mempool,
 		struct test_buffers *bufs, struct rte_bbdev_dec_op *ref_op,
 		int vector_mask, uint16_t dev_id, uint16_t queue_id,
 		const uint16_t num_to_process, uint16_t burst_sz,
-		uint64_t *total_time)
+		uint64_t *total_time, uint64_t *min_time, uint64_t *max_time)
 {
 	int ret = TEST_SUCCESS;
 	uint16_t i, j, dequeued;
 	struct rte_bbdev_dec_op *ops_enq[MAX_BURST], *ops_deq[MAX_BURST];
-	uint64_t start_time = 0;
+	uint64_t start_time = 0, last_time = 0;
 
 	for (i = 0, dequeued = 0; dequeued < num_to_process; ++i) {
 		uint16_t enq = 0, deq = 0;
 		bool first_time = true;
+		last_time = 0;
 
 		if (unlikely(num_to_process - dequeued < burst_sz))
 			burst_sz = num_to_process - dequeued;
@@ -1732,11 +1754,15 @@  operation_latency_test_dec(struct rte_mempool *mempool,
 			deq += rte_bbdev_dequeue_dec_ops(dev_id, queue_id,
 					&ops_deq[deq], burst_sz - deq);
 			if (likely(first_time && (deq > 0))) {
-				*total_time += rte_rdtsc_precise() - start_time;
+				last_time = rte_rdtsc_precise() - start_time;
 				first_time = false;
 			}
 		} while (unlikely(burst_sz != deq));
 
+		*max_time = RTE_MAX(*max_time, last_time);
+		*min_time = RTE_MIN(*min_time, last_time);
+		*total_time += last_time;
+
 		if (test_vector.op_type != RTE_BBDEV_OP_NONE) {
 			ret = validate_dec_op(ops_deq, burst_sz, ref_op,
 					vector_mask);
@@ -1751,20 +1777,21 @@  operation_latency_test_dec(struct rte_mempool *mempool,
 }
 
 static int
-operation_latency_test_enc(struct rte_mempool *mempool,
+latency_test_enc(struct rte_mempool *mempool,
 		struct test_buffers *bufs, struct rte_bbdev_enc_op *ref_op,
 		uint16_t dev_id, uint16_t queue_id,
 		const uint16_t num_to_process, uint16_t burst_sz,
-		uint64_t *total_time)
+		uint64_t *total_time, uint64_t *min_time, uint64_t *max_time)
 {
 	int ret = TEST_SUCCESS;
 	uint16_t i, j, dequeued;
 	struct rte_bbdev_enc_op *ops_enq[MAX_BURST], *ops_deq[MAX_BURST];
-	uint64_t start_time = 0;
+	uint64_t start_time = 0, last_time = 0;
 
 	for (i = 0, dequeued = 0; dequeued < num_to_process; ++i) {
 		uint16_t enq = 0, deq = 0;
 		bool first_time = true;
+		last_time = 0;
 
 		if (unlikely(num_to_process - dequeued < burst_sz))
 			burst_sz = num_to_process - dequeued;
@@ -1793,11 +1820,15 @@  operation_latency_test_enc(struct rte_mempool *mempool,
 			deq += rte_bbdev_dequeue_enc_ops(dev_id, queue_id,
 					&ops_deq[deq], burst_sz - deq);
 			if (likely(first_time && (deq > 0))) {
-				*total_time += rte_rdtsc_precise() - start_time;
+				last_time += rte_rdtsc_precise() - start_time;
 				first_time = false;
 			}
 		} while (unlikely(burst_sz != deq));
 
+		*max_time = RTE_MAX(*max_time, last_time);
+		*min_time = RTE_MIN(*min_time, last_time);
+		*total_time += last_time;
+
 		if (test_vector.op_type != RTE_BBDEV_OP_NONE) {
 			ret = validate_enc_op(ops_deq, burst_sz, ref_op);
 			TEST_ASSERT_SUCCESS(ret, "Validation failed!");
@@ -1811,7 +1842,7 @@  operation_latency_test_enc(struct rte_mempool *mempool,
 }
 
 static int
-operation_latency_test(struct active_device *ad,
+latency_test(struct active_device *ad,
 		struct test_op_params *op_params)
 {
 	int iter;
@@ -1821,9 +1852,12 @@  operation_latency_test(struct active_device *ad,
 	const uint16_t queue_id = ad->queue_ids[0];
 	struct test_buffers *bufs = NULL;
 	struct rte_bbdev_info info;
-	uint64_t total_time = 0;
+	uint64_t total_time, min_time, max_time;
 	const char *op_type_str;
 
+	total_time = max_time = 0;
+	min_time = UINT64_MAX;
+
 	TEST_ASSERT_SUCCESS((burst_sz > MAX_BURST),
 			"BURST_SIZE should be <= %u", MAX_BURST);
 
@@ -1838,36 +1872,65 @@  operation_latency_test(struct active_device *ad,
 			info.dev_name, burst_sz, num_to_process, op_type_str);
 
 	if (op_type == RTE_BBDEV_OP_TURBO_DEC)
-		iter = operation_latency_test_dec(op_params->mp, bufs,
+		iter = latency_test_dec(op_params->mp, bufs,
 				op_params->ref_dec_op, op_params->vector_mask,
 				ad->dev_id, queue_id, num_to_process,
-				burst_sz, &total_time);
+				burst_sz, &total_time, &min_time, &max_time);
 	else
-		iter = operation_latency_test_enc(op_params->mp, bufs,
+		iter = latency_test_enc(op_params->mp, bufs,
 				op_params->ref_enc_op, ad->dev_id, queue_id,
-				num_to_process, burst_sz, &total_time);
+				num_to_process, burst_sz, &total_time,
+				&min_time, &max_time);
 
 	if (iter <= 0)
 		return TEST_FAILED;
 
-	printf("\toperation avg. latency: %lg cycles, %lg us\n",
+	printf("\toperation latency:\n"
+			"\t\tavg latency: %lg cycles, %lg us\n"
+			"\t\tmin latency: %lg cycles, %lg us\n"
+			"\t\tmax latency: %lg cycles, %lg us\n",
 			(double)total_time / (double)iter,
 			(double)(total_time * 1000000) / (double)iter /
+			(double)rte_get_tsc_hz(), (double)min_time,
+			(double)(min_time * 1000000) / (double)rte_get_tsc_hz(),
+			(double)max_time, (double)(max_time * 1000000) /
 			(double)rte_get_tsc_hz());
 
 	return TEST_SUCCESS;
 }
 
 static int
+get_bbdev_queue_stats(uint16_t dev_id, uint16_t queue_id,
+		struct rte_bbdev_stats *stats)
+{
+	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
+	struct rte_bbdev_stats *q_stats;
+
+	if (queue_id >= dev->data->num_queues)
+		return -1;
+
+	q_stats = &dev->data->queues[queue_id].queue_stats;
+
+	stats->enqueued_count = q_stats->enqueued_count;
+	stats->dequeued_count = q_stats->dequeued_count;
+	stats->enqueue_err_count = q_stats->enqueue_err_count;
+	stats->dequeue_err_count = q_stats->dequeue_err_count;
+	stats->offload_time = q_stats->offload_time;
+
+	return 0;
+}
+
+static int
 offload_latency_test_dec(struct rte_mempool *mempool, struct test_buffers *bufs,
 		struct rte_bbdev_dec_op *ref_op, uint16_t dev_id,
 		uint16_t queue_id, const uint16_t num_to_process,
-		uint16_t burst_sz, uint64_t *enq_total_time,
-		uint64_t *deq_total_time)
+		uint16_t burst_sz, struct test_time_stats *time_st)
 {
-	int i, dequeued;
+	int i, dequeued, ret;
 	struct rte_bbdev_dec_op *ops_enq[MAX_BURST], *ops_deq[MAX_BURST];
 	uint64_t enq_start_time, deq_start_time;
+	uint64_t enq_sw_last_time, deq_last_time;
+	struct rte_bbdev_stats stats;
 
 	for (i = 0, dequeued = 0; dequeued < num_to_process; ++i) {
 		uint16_t enq = 0, deq = 0;
@@ -1883,24 +1946,54 @@  offload_latency_test_dec(struct rte_mempool *mempool, struct test_buffers *bufs,
 					bufs->soft_outputs,
 					ref_op);
 
-		/* Start time measurment for enqueue function offload latency */
-		enq_start_time = rte_rdtsc();
+		/* Start time meas for enqueue function offload latency */
+		enq_start_time = rte_rdtsc_precise();
 		do {
 			enq += rte_bbdev_enqueue_dec_ops(dev_id, queue_id,
 					&ops_enq[enq], burst_sz - enq);
 		} while (unlikely(burst_sz != enq));
-		*enq_total_time += rte_rdtsc() - enq_start_time;
+
+		ret = get_bbdev_queue_stats(dev_id, queue_id, &stats);
+		TEST_ASSERT_SUCCESS(ret,
+				"Failed to get stats for queue (%u) of device (%u)",
+				queue_id, dev_id);
+
+		enq_sw_last_time = rte_rdtsc_precise() - enq_start_time -
+				stats.offload_time;
+		time_st->enq_sw_max_time = RTE_MAX(time_st->enq_sw_max_time,
+				enq_sw_last_time);
+		time_st->enq_sw_min_time = RTE_MIN(time_st->enq_sw_min_time,
+				enq_sw_last_time);
+		time_st->enq_sw_tot_time += enq_sw_last_time;
+
+		time_st->enq_tur_max_time = RTE_MAX(time_st->enq_tur_max_time,
+				stats.offload_time);
+		time_st->enq_tur_min_time = RTE_MIN(time_st->enq_tur_min_time,
+				stats.offload_time);
+		time_st->enq_tur_tot_time += stats.offload_time;
 
 		/* ensure enqueue has been completed */
 		rte_delay_ms(10);
 
-		/* Start time measurment for dequeue function offload latency */
-		deq_start_time = rte_rdtsc();
+		/* Start time meas for dequeue function offload latency */
+		deq_start_time = rte_rdtsc_precise();
+		/* Dequeue one operation */
 		do {
 			deq += rte_bbdev_dequeue_dec_ops(dev_id, queue_id,
+					&ops_deq[deq], 1);
+		} while (unlikely(deq != 1));
+
+		deq_last_time = rte_rdtsc_precise() - deq_start_time;
+		time_st->deq_max_time = RTE_MAX(time_st->deq_max_time,
+				deq_last_time);
+		time_st->deq_min_time = RTE_MIN(time_st->deq_min_time,
+				deq_last_time);
+		time_st->deq_tot_time += deq_last_time;
+
+		/* Dequeue remaining operations if needed*/
+		while (burst_sz != deq)
+			deq += rte_bbdev_dequeue_dec_ops(dev_id, queue_id,
 					&ops_deq[deq], burst_sz - deq);
-		} while (unlikely(burst_sz != deq));
-		*deq_total_time += rte_rdtsc() - deq_start_time;
 
 		rte_bbdev_dec_op_free_bulk(ops_enq, deq);
 		dequeued += deq;
@@ -1913,12 +2006,13 @@  static int
 offload_latency_test_enc(struct rte_mempool *mempool, struct test_buffers *bufs,
 		struct rte_bbdev_enc_op *ref_op, uint16_t dev_id,
 		uint16_t queue_id, const uint16_t num_to_process,
-		uint16_t burst_sz, uint64_t *enq_total_time,
-		uint64_t *deq_total_time)
+		uint16_t burst_sz, struct test_time_stats *time_st)
 {
-	int i, dequeued;
+	int i, dequeued, ret;
 	struct rte_bbdev_enc_op *ops_enq[MAX_BURST], *ops_deq[MAX_BURST];
 	uint64_t enq_start_time, deq_start_time;
+	uint64_t enq_sw_last_time, deq_last_time;
+	struct rte_bbdev_stats stats;
 
 	for (i = 0, dequeued = 0; dequeued < num_to_process; ++i) {
 		uint16_t enq = 0, deq = 0;
@@ -1933,24 +2027,53 @@  offload_latency_test_enc(struct rte_mempool *mempool, struct test_buffers *bufs,
 					bufs->hard_outputs,
 					ref_op);
 
-		/* Start time measurment for enqueue function offload latency */
-		enq_start_time = rte_rdtsc();
+		/* Start time meas for enqueue function offload latency */
+		enq_start_time = rte_rdtsc_precise();
 		do {
 			enq += rte_bbdev_enqueue_enc_ops(dev_id, queue_id,
 					&ops_enq[enq], burst_sz - enq);
 		} while (unlikely(burst_sz != enq));
-		*enq_total_time += rte_rdtsc() - enq_start_time;
+
+		ret = get_bbdev_queue_stats(dev_id, queue_id, &stats);
+		TEST_ASSERT_SUCCESS(ret,
+				"Failed to get stats for queue (%u) of device (%u)",
+				queue_id, dev_id);
+
+		enq_sw_last_time = rte_rdtsc_precise() - enq_start_time -
+				stats.offload_time;
+		time_st->enq_sw_max_time = RTE_MAX(time_st->enq_sw_max_time,
+				enq_sw_last_time);
+		time_st->enq_sw_min_time = RTE_MIN(time_st->enq_sw_min_time,
+				enq_sw_last_time);
+		time_st->enq_sw_tot_time += enq_sw_last_time;
+
+		time_st->enq_tur_max_time = RTE_MAX(time_st->enq_tur_max_time,
+				stats.offload_time);
+		time_st->enq_tur_min_time = RTE_MIN(time_st->enq_tur_min_time,
+				stats.offload_time);
+		time_st->enq_tur_tot_time += stats.offload_time;
 
 		/* ensure enqueue has been completed */
 		rte_delay_ms(10);
 
-		/* Start time measurment for dequeue function offload latency */
-		deq_start_time = rte_rdtsc();
+		/* Start time meas for dequeue function offload latency */
+		deq_start_time = rte_rdtsc_precise();
+		/* Dequeue one operation */
 		do {
 			deq += rte_bbdev_dequeue_enc_ops(dev_id, queue_id,
+					&ops_deq[deq], 1);
+		} while (unlikely(deq != 1));
+
+		deq_last_time = rte_rdtsc_precise() - deq_start_time;
+		time_st->deq_max_time = RTE_MAX(time_st->deq_max_time,
+				deq_last_time);
+		time_st->deq_min_time = RTE_MIN(time_st->deq_min_time,
+				deq_last_time);
+		time_st->deq_tot_time += deq_last_time;
+
+		while (burst_sz != deq)
+			deq += rte_bbdev_dequeue_enc_ops(dev_id, queue_id,
 					&ops_deq[deq], burst_sz - deq);
-		} while (unlikely(burst_sz != deq));
-		*deq_total_time += rte_rdtsc() - deq_start_time;
 
 		rte_bbdev_enc_op_free_bulk(ops_enq, deq);
 		dequeued += deq;
@@ -1960,11 +2083,10 @@  offload_latency_test_enc(struct rte_mempool *mempool, struct test_buffers *bufs,
 }
 
 static int
-offload_latency_test(struct active_device *ad,
+offload_cost_test(struct active_device *ad,
 		struct test_op_params *op_params)
 {
 	int iter;
-	uint64_t enq_total_time = 0, deq_total_time = 0;
 	uint16_t burst_sz = op_params->burst_sz;
 	const uint16_t num_to_process = op_params->num_to_process;
 	const enum rte_bbdev_op_type op_type = test_vector.op_type;
@@ -1972,6 +2094,12 @@  offload_latency_test(struct active_device *ad,
 	struct test_buffers *bufs = NULL;
 	struct rte_bbdev_info info;
 	const char *op_type_str;
+	struct test_time_stats time_st;
+
+	memset(&time_st, 0, sizeof(struct test_time_stats));
+	time_st.enq_sw_min_time = UINT64_MAX;
+	time_st.enq_tur_min_time = UINT64_MAX;
+	time_st.deq_min_time = UINT64_MAX;
 
 	TEST_ASSERT_SUCCESS((burst_sz > MAX_BURST),
 			"BURST_SIZE should be <= %u", MAX_BURST);
@@ -1989,26 +2117,51 @@  offload_latency_test(struct active_device *ad,
 	if (op_type == RTE_BBDEV_OP_TURBO_DEC)
 		iter = offload_latency_test_dec(op_params->mp, bufs,
 				op_params->ref_dec_op, ad->dev_id, queue_id,
-				num_to_process, burst_sz, &enq_total_time,
-				&deq_total_time);
+				num_to_process, burst_sz, &time_st);
 	else
 		iter = offload_latency_test_enc(op_params->mp, bufs,
 				op_params->ref_enc_op, ad->dev_id, queue_id,
-				num_to_process, burst_sz, &enq_total_time,
-				&deq_total_time);
+				num_to_process, burst_sz, &time_st);
 
 	if (iter <= 0)
 		return TEST_FAILED;
 
-	printf("\tenq offload avg. latency: %lg cycles, %lg us\n",
-			(double)enq_total_time / (double)iter,
-			(double)(enq_total_time * 1000000) / (double)iter /
-			(double)rte_get_tsc_hz());
-
-	printf("\tdeq offload avg. latency: %lg cycles, %lg us\n",
-			(double)deq_total_time / (double)iter,
-			(double)(deq_total_time * 1000000) / (double)iter /
-			(double)rte_get_tsc_hz());
+	printf("\tenq offload cost latency:\n"
+			"\t\tsoftware avg %lg cycles, %lg us\n"
+			"\t\tsoftware min %lg cycles, %lg us\n"
+			"\t\tsoftware max %lg cycles, %lg us\n"
+			"\t\tturbo avg %lg cycles, %lg us\n"
+			"\t\tturbo min %lg cycles, %lg us\n"
+			"\t\tturbo max %lg cycles, %lg us\n",
+			(double)time_st.enq_sw_tot_time / (double)iter,
+			(double)(time_st.enq_sw_tot_time * 1000000) /
+			(double)iter / (double)rte_get_tsc_hz(),
+			(double)time_st.enq_sw_min_time,
+			(double)(time_st.enq_sw_min_time * 1000000) /
+			rte_get_tsc_hz(), (double)time_st.enq_sw_max_time,
+			(double)(time_st.enq_sw_max_time * 1000000) /
+			rte_get_tsc_hz(), (double)time_st.enq_tur_tot_time /
+			(double)iter,
+			(double)(time_st.enq_tur_tot_time * 1000000) /
+			(double)iter / (double)rte_get_tsc_hz(),
+			(double)time_st.enq_tur_min_time,
+			(double)(time_st.enq_tur_min_time * 1000000) /
+			rte_get_tsc_hz(), (double)time_st.enq_tur_max_time,
+			(double)(time_st.enq_tur_max_time * 1000000) /
+			rte_get_tsc_hz());
+
+	printf("\tdeq offload cost latency - one op:\n"
+			"\t\tavg %lg cycles, %lg us\n"
+			"\t\tmin %lg cycles, %lg us\n"
+			"\t\tmax %lg cycles, %lg us\n",
+			(double)time_st.deq_tot_time / (double)iter,
+			(double)(time_st.deq_tot_time * 1000000) /
+			(double)iter / (double)rte_get_tsc_hz(),
+			(double)time_st.deq_min_time,
+			(double)(time_st.deq_min_time * 1000000) /
+			rte_get_tsc_hz(), (double)time_st.deq_max_time,
+			(double)(time_st.deq_max_time * 1000000) /
+			rte_get_tsc_hz());
 
 	return TEST_SUCCESS;
 }
@@ -2016,21 +2169,28 @@  offload_latency_test(struct active_device *ad,
 static int
 offload_latency_empty_q_test_dec(uint16_t dev_id, uint16_t queue_id,
 		const uint16_t num_to_process, uint16_t burst_sz,
-		uint64_t *deq_total_time)
+		uint64_t *deq_tot_time, uint64_t *deq_min_time,
+		uint64_t *deq_max_time)
 {
 	int i, deq_total;
 	struct rte_bbdev_dec_op *ops[MAX_BURST];
-	uint64_t deq_start_time;
+	uint64_t deq_start_time, deq_last_time;
 
 	/* Test deq offload latency from an empty queue */
-	deq_start_time = rte_rdtsc_precise();
+
 	for (i = 0, deq_total = 0; deq_total < num_to_process;
 			++i, deq_total += burst_sz) {
+		deq_start_time = rte_rdtsc_precise();
+
 		if (unlikely(num_to_process - deq_total < burst_sz))
 			burst_sz = num_to_process - deq_total;
 		rte_bbdev_dequeue_dec_ops(dev_id, queue_id, ops, burst_sz);
+
+		deq_last_time = rte_rdtsc_precise() - deq_start_time;
+		*deq_max_time = RTE_MAX(*deq_max_time, deq_last_time);
+		*deq_min_time = RTE_MIN(*deq_min_time, deq_last_time);
+		*deq_tot_time += deq_last_time;
 	}
-	*deq_total_time = rte_rdtsc_precise() - deq_start_time;
 
 	return i;
 }
@@ -2038,21 +2198,27 @@  offload_latency_empty_q_test_dec(uint16_t dev_id, uint16_t queue_id,
 static int
 offload_latency_empty_q_test_enc(uint16_t dev_id, uint16_t queue_id,
 		const uint16_t num_to_process, uint16_t burst_sz,
-		uint64_t *deq_total_time)
+		uint64_t *deq_tot_time, uint64_t *deq_min_time,
+		uint64_t *deq_max_time)
 {
 	int i, deq_total;
 	struct rte_bbdev_enc_op *ops[MAX_BURST];
-	uint64_t deq_start_time;
+	uint64_t deq_start_time, deq_last_time;
 
 	/* Test deq offload latency from an empty queue */
-	deq_start_time = rte_rdtsc_precise();
 	for (i = 0, deq_total = 0; deq_total < num_to_process;
 			++i, deq_total += burst_sz) {
+		deq_start_time = rte_rdtsc_precise();
+
 		if (unlikely(num_to_process - deq_total < burst_sz))
 			burst_sz = num_to_process - deq_total;
 		rte_bbdev_dequeue_enc_ops(dev_id, queue_id, ops, burst_sz);
+
+		deq_last_time = rte_rdtsc_precise() - deq_start_time;
+		*deq_max_time = RTE_MAX(*deq_max_time, deq_last_time);
+		*deq_min_time = RTE_MIN(*deq_min_time, deq_last_time);
+		*deq_tot_time += deq_last_time;
 	}
-	*deq_total_time = rte_rdtsc_precise() - deq_start_time;
 
 	return i;
 }
@@ -2062,7 +2228,7 @@  offload_latency_empty_q_test(struct active_device *ad,
 		struct test_op_params *op_params)
 {
 	int iter;
-	uint64_t deq_total_time = 0;
+	uint64_t deq_tot_time, deq_min_time, deq_max_time;
 	uint16_t burst_sz = op_params->burst_sz;
 	const uint16_t num_to_process = op_params->num_to_process;
 	const enum rte_bbdev_op_type op_type = test_vector.op_type;
@@ -2070,6 +2236,9 @@  offload_latency_empty_q_test(struct active_device *ad,
 	struct rte_bbdev_info info;
 	const char *op_type_str;
 
+	deq_tot_time = deq_max_time = 0;
+	deq_min_time = UINT64_MAX;
+
 	TEST_ASSERT_SUCCESS((burst_sz > MAX_BURST),
 			"BURST_SIZE should be <= %u", MAX_BURST);
 
@@ -2084,18 +2253,26 @@  offload_latency_empty_q_test(struct active_device *ad,
 
 	if (op_type == RTE_BBDEV_OP_TURBO_DEC)
 		iter = offload_latency_empty_q_test_dec(ad->dev_id, queue_id,
-				num_to_process, burst_sz, &deq_total_time);
+				num_to_process, burst_sz, &deq_tot_time,
+				&deq_min_time, &deq_max_time);
 	else
 		iter = offload_latency_empty_q_test_enc(ad->dev_id, queue_id,
-				num_to_process, burst_sz, &deq_total_time);
+				num_to_process, burst_sz, &deq_tot_time,
+				&deq_min_time, &deq_max_time);
 
 	if (iter <= 0)
 		return TEST_FAILED;
 
-	printf("\tempty deq offload avg. latency: %lg cycles, %lg us\n",
-			(double)deq_total_time / (double)iter,
-			(double)(deq_total_time * 1000000) / (double)iter /
-			(double)rte_get_tsc_hz());
+	printf("\tempty deq offload\n"
+			"\t\tavg. latency: %lg cycles, %lg us\n"
+			"\t\tmin. latency: %lg cycles, %lg us\n"
+			"\t\tmax. latency: %lg cycles, %lg us\n",
+			(double)deq_tot_time / (double)iter,
+			(double)(deq_tot_time * 1000000) / (double)iter /
+			(double)rte_get_tsc_hz(), (double)deq_min_time,
+			(double)(deq_min_time * 1000000) / rte_get_tsc_hz(),
+			(double)deq_max_time, (double)(deq_max_time * 1000000) /
+			rte_get_tsc_hz());
 
 	return TEST_SUCCESS;
 }
@@ -2107,9 +2284,9 @@  throughput_tc(void)
 }
 
 static int
-offload_latency_tc(void)
+offload_cost_tc(void)
 {
-	return run_test_case(offload_latency_test);
+	return run_test_case(offload_cost_test);
 }
 
 static int
@@ -2119,9 +2296,9 @@  offload_latency_empty_q_tc(void)
 }
 
 static int
-operation_latency_tc(void)
+latency_tc(void)
 {
-	return run_test_case(operation_latency_test);
+	return run_test_case(latency_test);
 }
 
 static int
@@ -2145,7 +2322,7 @@  static struct unit_test_suite bbdev_validation_testsuite = {
 	.setup = testsuite_setup,
 	.teardown = testsuite_teardown,
 	.unit_test_cases = {
-		TEST_CASE_ST(ut_setup, ut_teardown, operation_latency_tc),
+		TEST_CASE_ST(ut_setup, ut_teardown, latency_tc),
 		TEST_CASES_END() /**< NULL terminate unit test array */
 	}
 };
@@ -2155,9 +2332,18 @@  static struct unit_test_suite bbdev_latency_testsuite = {
 	.setup = testsuite_setup,
 	.teardown = testsuite_teardown,
 	.unit_test_cases = {
-		TEST_CASE_ST(ut_setup, ut_teardown, offload_latency_tc),
+		TEST_CASE_ST(ut_setup, ut_teardown, latency_tc),
+		TEST_CASES_END() /**< NULL terminate unit test array */
+	}
+};
+
+static struct unit_test_suite bbdev_offload_cost_testsuite = {
+	.suite_name = "BBdev Offload Cost Tests",
+	.setup = testsuite_setup,
+	.teardown = testsuite_teardown,
+	.unit_test_cases = {
+		TEST_CASE_ST(ut_setup, ut_teardown, offload_cost_tc),
 		TEST_CASE_ST(ut_setup, ut_teardown, offload_latency_empty_q_tc),
-		TEST_CASE_ST(ut_setup, ut_teardown, operation_latency_tc),
 		TEST_CASES_END() /**< NULL terminate unit test array */
 	}
 };
@@ -2175,4 +2361,5 @@  static struct unit_test_suite bbdev_interrupt_testsuite = {
 REGISTER_TEST_COMMAND(throughput, bbdev_throughput_testsuite);
 REGISTER_TEST_COMMAND(validation, bbdev_validation_testsuite);
 REGISTER_TEST_COMMAND(latency, bbdev_latency_testsuite);
+REGISTER_TEST_COMMAND(offload, bbdev_offload_cost_testsuite);
 REGISTER_TEST_COMMAND(interrupt, bbdev_interrupt_testsuite);
diff --git a/drivers/baseband/turbo_sw/bbdev_turbo_software.c b/drivers/baseband/turbo_sw/bbdev_turbo_software.c
index 1664fca..af484fc 100644
--- a/drivers/baseband/turbo_sw/bbdev_turbo_software.c
+++ b/drivers/baseband/turbo_sw/bbdev_turbo_software.c
@@ -9,6 +9,7 @@ 
 #include <rte_malloc.h>
 #include <rte_ring.h>
 #include <rte_kvargs.h>
+#include <rte_cycles.h>
 
 #include <rte_bbdev.h>
 #include <rte_bbdev_pmd.h>
@@ -455,7 +456,8 @@  static inline void
 process_enc_cb(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op,
 		uint8_t r, uint8_t c, uint16_t k, uint16_t ncb,
 		uint32_t e, struct rte_mbuf *m_in, struct rte_mbuf *m_out,
-		uint16_t in_offset, uint16_t out_offset, uint16_t total_left)
+		uint16_t in_offset, uint16_t out_offset, uint16_t total_left,
+		struct rte_bbdev_stats *q_stats)
 {
 	int ret;
 	int16_t k_idx;
@@ -469,6 +471,11 @@  process_enc_cb(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op,
 	struct bblib_turbo_encoder_response turbo_resp;
 	struct bblib_rate_match_dl_request rm_req;
 	struct bblib_rate_match_dl_response rm_resp;
+#ifdef RTE_TEST_BBDEV
+	uint64_t start_time;
+#else
+	RTE_SET_USED(q_stats);
+#endif
 
 	k_idx = compute_idx(k);
 	in = rte_pktmbuf_mtod_offset(m_in, uint8_t *, in_offset);
@@ -499,7 +506,13 @@  process_enc_cb(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op,
 		}
 
 		crc_resp.data = in;
+#ifdef RTE_TEST_BBDEV
+		start_time = rte_rdtsc_precise();
+#endif
 		bblib_lte_crc24a_gen(&crc_req, &crc_resp);
+#ifdef RTE_TEST_BBDEV
+		q_stats->offload_time += rte_rdtsc_precise() - start_time;
+#endif
 	} else if (enc->op_flags & RTE_BBDEV_TURBO_CRC_24B_ATTACH) {
 		/* CRC24B */
 		ret = is_enc_input_valid(k - 24, k_idx, total_left);
@@ -525,7 +538,13 @@  process_enc_cb(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op,
 		}
 
 		crc_resp.data = in;
+#ifdef RTE_TEST_BBDEV
+		start_time = rte_rdtsc_precise();
+#endif
 		bblib_lte_crc24b_gen(&crc_req, &crc_resp);
+#ifdef RTE_TEST_BBDEV
+		q_stats->offload_time += rte_rdtsc_precise() - start_time;
+#endif
 	} else {
 		ret = is_enc_input_valid(k, k_idx, total_left);
 		if (ret != 0) {
@@ -572,12 +591,21 @@  process_enc_cb(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op,
 	turbo_resp.output_win_0 = out0;
 	turbo_resp.output_win_1 = out1;
 	turbo_resp.output_win_2 = out2;
+
+#ifdef RTE_TEST_BBDEV
+	start_time = rte_rdtsc_precise();
+#endif
+
 	if (bblib_turbo_encoder(&turbo_req, &turbo_resp) != 0) {
 		op->status |= 1 << RTE_BBDEV_DRV_ERROR;
 		rte_bbdev_log(ERR, "Turbo Encoder failed");
 		return;
 	}
 
+#ifdef RTE_TEST_BBDEV
+	q_stats->offload_time += rte_rdtsc_precise() - start_time;
+#endif
+
 	/* Restore 3 first bytes of next CB if they were overwritten by CRC*/
 	if (first_3_bytes != 0)
 		*((uint64_t *)&in[(k - 32) >> 3]) = first_3_bytes;
@@ -639,6 +667,10 @@  process_enc_cb(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op,
 		else
 			rm_req.bypass_rvidx = 0;
 
+#ifdef RTE_TEST_BBDEV
+		start_time = rte_rdtsc_precise();
+#endif
+
 		if (bblib_rate_match_dl(&rm_req, &rm_resp) != 0) {
 			op->status |= 1 << RTE_BBDEV_DRV_ERROR;
 			rte_bbdev_log(ERR, "Rate matching failed");
@@ -651,6 +683,10 @@  process_enc_cb(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op,
 		mask_id = (e & 7) >> 1;
 		rm_out[out_len - 1] &= mask_out[mask_id];
 
+#ifdef RTE_TEST_BBDEV
+		q_stats->offload_time += rte_rdtsc_precise() - start_time;
+#endif
+
 		enc->output.length += rm_resp.OutputLen;
 	} else {
 		/* Rate matching is bypassed */
@@ -678,7 +714,8 @@  process_enc_cb(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op,
 }
 
 static inline void
-enqueue_enc_one_op(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op)
+enqueue_enc_one_op(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op,
+		struct rte_bbdev_stats *queue_stats)
 {
 	uint8_t c, r, crc24_bits = 0;
 	uint16_t k, ncb;
@@ -733,7 +770,8 @@  enqueue_enc_one_op(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op)
 		}
 
 		process_enc_cb(q, op, r, c, k, ncb, e, m_in,
-				m_out, in_offset, out_offset, total_left);
+				m_out, in_offset, out_offset, total_left,
+				queue_stats);
 		/* Update total_left */
 		total_left -= (k - crc24_bits) >> 3;
 		/* Update offsets for next CBs (if exist) */
@@ -755,12 +793,15 @@  enqueue_enc_one_op(struct turbo_sw_queue *q, struct rte_bbdev_enc_op *op)
 
 static inline uint16_t
 enqueue_enc_all_ops(struct turbo_sw_queue *q, struct rte_bbdev_enc_op **ops,
-		uint16_t nb_ops)
+		uint16_t nb_ops, struct rte_bbdev_stats *queue_stats)
 {
 	uint16_t i;
+#ifdef RTE_TEST_BBDEV
+	queue_stats->offload_time = 0;
+#endif
 
 	for (i = 0; i < nb_ops; ++i)
-		enqueue_enc_one_op(q, ops[i]);
+		enqueue_enc_one_op(q, ops[i], queue_stats);
 
 	return rte_ring_enqueue_burst(q->processed_pkts, (void **)ops, nb_ops,
 			NULL);
@@ -939,6 +980,8 @@  process_dec_cb(struct turbo_sw_queue *q, struct rte_bbdev_dec_op *op,
 	turbo_req.k = k;
 	turbo_req.k_idx = k_idx;
 	turbo_req.max_iter_num = dec->iter_max;
+	turbo_req.early_term_disable = !check_bit(dec->op_flags,
+			RTE_BBDEV_TURBO_EARLY_TERMINATION);
 	turbo_resp.ag_buf = q->ag;
 	turbo_resp.cb_buf = q->code_block;
 	turbo_resp.output = out;
@@ -1051,7 +1094,7 @@  enqueue_enc_ops(struct rte_bbdev_queue_data *q_data,
 	struct turbo_sw_queue *q = queue;
 	uint16_t nb_enqueued = 0;
 
-	nb_enqueued = enqueue_enc_all_ops(q, ops, nb_ops);
+	nb_enqueued = enqueue_enc_all_ops(q, ops, nb_ops, &q_data->queue_stats);
 
 	q_data->queue_stats.enqueue_err_count += nb_ops - nb_enqueued;
 	q_data->queue_stats.enqueued_count += nb_enqueued;
diff --git a/lib/librte_bbdev/rte_bbdev.h b/lib/librte_bbdev/rte_bbdev.h
index 5e7e495..bd0ec24 100644
--- a/lib/librte_bbdev/rte_bbdev.h
+++ b/lib/librte_bbdev/rte_bbdev.h
@@ -239,6 +239,10 @@  struct rte_bbdev_stats {
 	uint64_t enqueue_err_count;
 	/** Total error count on operations dequeued */
 	uint64_t dequeue_err_count;
+#ifdef RTE_TEST_BBDEV
+	/** It stores offload time. */
+	uint64_t offload_time;
+#endif
 };
 
 /**