[dpdk-dev,v2,6/7] rte_sched: eliminate floating point in calculating byte clock
Commit Message
Hi Stephen,
Sorry, NACK.
1. Overflow issue
As you declare cycles_per_byte as uint32_t, for a CPU frequency of 2-3 GHz, the line of code below results in overflow:
port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) / params->rate;
Therefore, there is most likely a significant accuracy loss, which might result in more packets allowed to go out than it should.
2. Integer division has a higher cost than floating point division
My understanding is we are considering a performance improvement by replacing the double precision floating point division in:
double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
with an integer division:
uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) / port->cycles_per_byte;
I don't think this is going to have the claimed benefit, as acording to "Intel 64 and IA-32 Architectures Optimization Reference Manual" (Appendix C), the latency of the integer division instruction is significantly bigger than the latency of integer division:
Instruction FDIV double precision: latency = 38-40 cycles
Instruction IDIV: latency = 56 - 80 cycles
3. Alternative
I hear though your suggestion about replacing the floating point division with a more performant construction. One suggestion would be to replace it with an integer multiplication followed by a shift right, probably by using a uint64_t bytes_per_cycle_scaled_up (the inverse of cycles_per_bytes). I need to prototype this code myself. Would you be OK to look into providing an alternative implementation?
Thanks,
Cristian
-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Stephen Hemminger
Sent: Thursday, February 5, 2015 6:14 AM
To: dev@dpdk.org
Cc: Stephen Hemminger
Subject: [dpdk-dev] [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock
From: Stephen Hemminger <shemming@brocade.com>
The old code was doing a floating point divide for each rte_dequeue()
which is very expensive. Change to using fixed point scaled math instead.
This improved performance from 5Gbit/sec to 10 Gbit/sec
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
lib/librte_sched/rte_sched.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
Comments
On Mon, 16 Feb 2015 22:44:31 +0000
"Dumitrescu, Cristian" <cristian.dumitrescu@intel.com> wrote:
> Hi Stephen,
>
> Sorry, NACK.
>
> 1. Overflow issue
> As you declare cycles_per_byte as uint32_t, for a CPU frequency of 2-3 GHz, the line of code below results in overflow:
> port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) / params->rate;
> Therefore, there is most likely a significant accuracy loss, which might result in more packets allowed to go out than it should.
The tsc shifted is still 64 bits.
and rate is 32 bits bytes/sec.
I chose scale such that
if clock = 3 Ghz
then min rate = 715 bytes/sec = 5722 bits/sec
> 2. Integer division has a higher cost than floating point division
> My understanding is we are considering a performance improvement by replacing the double precision floating point division in:
> double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
> with an integer division:
> uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) / port->cycles_per_byte;
> I don't think this is going to have the claimed benefit, as acording to "Intel 64 and IA-32 Architectures Optimization Reference Manual" (Appendix C), the latency of the integer division instruction is significantly bigger than the latency of integer division:
> Instruction FDIV double precision: latency = 38-40 cycles
> Instruction IDIV: latency = 56 - 80 cycles
I observed that performance when from 5Gbit/sec to 10Gbit/sec.
Mostly because the floating point engages more instruction units and does not
pipeline. Cycle count is not everything. This was on Ivy Bridge processor.
> 3. Alternative
> I hear though your suggestion about replacing the floating point division with a more performant construction. One suggestion would be to replace it with an integer multiplication followed by a shift right, probably by using a uint64_t bytes_per_cycle_scaled_up (the inverse of cycles_per_bytes). I need to prototype this code myself. Would you be OK to look into providing an alternative implementation?
>
I looked into multiplative integer method, and will do it in future. But it has
more scaling issues since it would require that the values both be 32 bits.
@@ -102,6 +102,9 @@
#define RTE_SCHED_BMP_POS_INVALID UINT32_MAX
+/* For cycles_per_byte calculation */
+#define RTE_SCHED_TIME_SHIFT 20
+
struct rte_sched_subport {
/* Token bucket (TB) */
uint64_t tb_time; /* time of last update */
@@ -239,7 +242,7 @@ struct rte_sched_port {
uint64_t time_cpu_cycles; /* Current CPU time measured in CPU cyles */
uint64_t time_cpu_bytes; /* Current CPU time measured in bytes */
uint64_t time; /* Current NIC TX time measured in bytes */
- double cycles_per_byte; /* CPU cycles per byte */
+ uint32_t cycles_per_byte; /* CPU cycles per byte (scaled) */
/* Scheduling loop detection */
uint32_t pipe_loop;
@@ -657,7 +660,9 @@ rte_sched_port_config(struct rte_sched_port_params *params)
port->time_cpu_cycles = rte_get_tsc_cycles();
port->time_cpu_bytes = 0;
port->time = 0;
- port->cycles_per_byte = ((double) rte_get_tsc_hz()) / ((double) params->rate);
+
+ port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT)
+ / params->rate;
/* Scheduling loop detection */
port->pipe_loop = RTE_SCHED_PIPE_INVALID;
@@ -2156,11 +2161,12 @@ rte_sched_port_time_resync(struct rte_sched_port *port)
{
uint64_t cycles = rte_get_tsc_cycles();
uint64_t cycles_diff = cycles - port->time_cpu_cycles;
- double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
+ uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT)
+ / port->cycles_per_byte;
/* Advance port time */
port->time_cpu_cycles = cycles;
- port->time_cpu_bytes += (uint64_t) bytes_diff;
+ port->time_cpu_bytes += bytes_diff;
if (port->time < port->time_cpu_bytes) {
port->time = port->time_cpu_bytes;
}