[dpdk-dev,v2,6/7] rte_sched: eliminate floating point in calculating byte clock

Message ID 3EB4FA525960D640B5BDFFD6A3D8912632318070@IRSMSX108.ger.corp.intel.com (mailing list archive)
State Not Applicable, archived
Headers

Commit Message

Cristian Dumitrescu Feb. 16, 2015, 10:44 p.m. UTC
Hi Stephen,

Sorry, NACK.

1. Overflow issue
As you declare cycles_per_byte as uint32_t, for a CPU frequency of 2-3 GHz, the line of code below results in overflow:
	port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) / params->rate;
Therefore, there is most likely a significant accuracy loss, which might result in more packets allowed to go out than it should.

2. Integer division has a higher cost than floating point division
My understanding is we are considering a performance improvement by replacing the double precision floating point division in:
	double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
with an integer division:
	uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) / port->cycles_per_byte;
I don't think this is going to have the claimed benefit, as acording to "Intel 64 and IA-32 Architectures Optimization  Reference Manual" (Appendix C), the latency of the integer division instruction is significantly bigger than the latency of integer division:
	Instruction FDIV double precision: latency = 38-40 cycles
	Instruction IDIV: latency = 56 - 80 cycles

3. Alternative
I hear though your suggestion about replacing the floating point division with a more performant construction. One suggestion would be to replace it with an integer multiplication followed by a shift right, probably by using a uint64_t bytes_per_cycle_scaled_up (the inverse of cycles_per_bytes). I need to prototype this code myself. Would you be OK to look into providing an alternative implementation?

Thanks,
Cristian


-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
Sent: Thursday, February 5, 2015 6:14 AM
To: dev@dpdk.org
Cc: Stephen Hemminger
Subject: [dpdk-dev] [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock

From: Stephen Hemminger <shemming@brocade.com>

The old code was doing a floating point divide for each rte_dequeue()
which is very expensive. Change to using fixed point scaled math instead.
This improved performance from 5Gbit/sec to 10 Gbit/sec

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/librte_sched/rte_sched.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)
  

Comments

Stephen Hemminger Feb. 17, 2015, 4:05 p.m. UTC | #1
On Mon, 16 Feb 2015 22:44:31 +0000
"Dumitrescu, Cristian" <cristian.dumitrescu@intel.com> wrote:

> Hi Stephen,
> 
> Sorry, NACK.
> 
> 1. Overflow issue
> As you declare cycles_per_byte as uint32_t, for a CPU frequency of 2-3 GHz, the line of code below results in overflow:
> 	port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) / params->rate;
> Therefore, there is most likely a significant accuracy loss, which might result in more packets allowed to go out than it should.

The tsc shifted is still 64 bits.
and rate is 32 bits bytes/sec.

I chose scale such that
if clock = 3 Ghz
then min rate = 715 bytes/sec =  5722 bits/sec

> 2. Integer division has a higher cost than floating point division
> My understanding is we are considering a performance improvement by replacing the double precision floating point division in:
> 	double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
> with an integer division:
> 	uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) / port->cycles_per_byte;
> I don't think this is going to have the claimed benefit, as acording to "Intel 64 and IA-32 Architectures Optimization  Reference Manual" (Appendix C), the latency of the integer division instruction is significantly bigger than the latency of integer division:
> 	Instruction FDIV double precision: latency = 38-40 cycles
> 	Instruction IDIV: latency = 56 - 80 cycles

I observed that performance when from 5Gbit/sec to 10Gbit/sec.
Mostly because the floating point engages more instruction units and does not
pipeline. Cycle count is not everything.  This was on Ivy Bridge processor.


> 3. Alternative
> I hear though your suggestion about replacing the floating point division with a more performant construction. One suggestion would be to replace it with an integer multiplication followed by a shift right, probably by using a uint64_t bytes_per_cycle_scaled_up (the inverse of cycles_per_bytes). I need to prototype this code myself. Would you be OK to look into providing an alternative implementation?
>

I looked into multiplative integer method, and will do it in future. But it has
more scaling issues since it would require that the values both be 32 bits.
  

Patch

diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
index 55fbc14..3023457 100644
--- a/lib/librte_sched/rte_sched.c
+++ b/lib/librte_sched/rte_sched.c
@@ -102,6 +102,9 @@ 
 
 #define RTE_SCHED_BMP_POS_INVALID             UINT32_MAX
 
+/* For cycles_per_byte calculation */
+#define RTE_SCHED_TIME_SHIFT		      20
+
 struct rte_sched_subport {
 	/* Token bucket (TB) */
 	uint64_t tb_time; /* time of last update */
@@ -239,7 +242,7 @@  struct rte_sched_port {
 	uint64_t time_cpu_cycles;     /* Current CPU time measured in CPU cyles */
 	uint64_t time_cpu_bytes;      /* Current CPU time measured in bytes */
 	uint64_t time;                /* Current NIC TX time measured in bytes */
-	double cycles_per_byte;       /* CPU cycles per byte */
+	uint32_t cycles_per_byte;       /* CPU cycles per byte (scaled) */
 
 	/* Scheduling loop detection */
 	uint32_t pipe_loop;
@@ -657,7 +660,9 @@  rte_sched_port_config(struct rte_sched_port_params *params)
 	port->time_cpu_cycles = rte_get_tsc_cycles();
 	port->time_cpu_bytes = 0;
 	port->time = 0;
-	port->cycles_per_byte = ((double) rte_get_tsc_hz()) / ((double) params->rate);
+
+	port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT)
+		/ params->rate;
 
 	/* Scheduling loop detection */
 	port->pipe_loop = RTE_SCHED_PIPE_INVALID;
@@ -2156,11 +2161,12 @@  rte_sched_port_time_resync(struct rte_sched_port *port)
 {
 	uint64_t cycles = rte_get_tsc_cycles();
 	uint64_t cycles_diff = cycles - port->time_cpu_cycles;
-	double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
+	uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT)
+		/ port->cycles_per_byte;
 
 	/* Advance port time */
 	port->time_cpu_cycles = cycles;
-	port->time_cpu_bytes += (uint64_t) bytes_diff;
+	port->time_cpu_bytes += bytes_diff;
 	if (port->time < port->time_cpu_bytes) {
 		port->time = port->time_cpu_bytes;
 	}