[dpdk-dev,v4,16/17] ring: add sched_yield to avoid spin forever

Message ID 1422842559-13617-17-git-send-email-cunming.liang@intel.com (mailing list archive)
State Superseded, archived
Headers

Commit Message

Cunming Liang Feb. 2, 2015, 2:02 a.m. UTC
  Add a sched_yield() syscall if the thread spins for too long, waiting other thread to finish its operations on the ring.
That gives pre-empted thread a chance to proceed and finish with ring enqnue/dequeue operation.
The purpose is to reduce contention on the ring.

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
---
 lib/librte_ring/rte_ring.h | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)
  

Comments

Olivier Matz Feb. 6, 2015, 3:19 p.m. UTC | #1
Hi,

On 02/02/2015 03:02 AM, Cunming Liang wrote:
> Add a sched_yield() syscall if the thread spins for too long, waiting other thread to finish its operations on the ring.
> That gives pre-empted thread a chance to proceed and finish with ring enqnue/dequeue operation.
> The purpose is to reduce contention on the ring.
> 
> Signed-off-by: Cunming Liang <cunming.liang@intel.com>
> ---
>  lib/librte_ring/rte_ring.h | 35 +++++++++++++++++++++++++++++------
>  1 file changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index 39bacdd..c402c73 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -126,6 +126,7 @@ struct rte_ring_debug_stats {
>  
>  #define RTE_RING_NAMESIZE 32 /**< The maximum length of a ring name. */
>  #define RTE_RING_MZ_PREFIX "RG_"
> +#define RTE_RING_PAUSE_REP 0x100  /**< yield after num of times pause. */
>  
>  /**
>   * An RTE ring structure.
> @@ -410,7 +411,7 @@ __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
>  	uint32_t cons_tail, free_entries;
>  	const unsigned max = n;
>  	int success;
> -	unsigned i;
> +	unsigned i, rep;
>  	uint32_t mask = r->prod.mask;
>  	int ret;
>  
> @@ -468,8 +469,19 @@ __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
>  	 * If there are other enqueues in progress that preceded us,
>  	 * we need to wait for them to complete
>  	 */
> -	while (unlikely(r->prod.tail != prod_head))
> -		rte_pause();
> +	do {
> +		/* avoid spin too long waiting for other thread finish */
> +		for (rep = RTE_RING_PAUSE_REP;
> +		     rep != 0 && r->prod.tail != prod_head; rep--)
> +			rte_pause();
> +
> +		/*
> +		 * It gives pre-empted thread a chance to proceed and
> +		 * finish with ring enqnue operation.
> +		 */
> +		if (rep == 0)
> +			sched_yield();
> +	} while (rep == 0);
>  
>  	r->prod.tail = prod_next;
>  	return ret;
> @@ -589,7 +601,7 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,
>  	uint32_t cons_next, entries;
>  	const unsigned max = n;
>  	int success;
> -	unsigned i;
> +	unsigned i, rep;
>  	uint32_t mask = r->prod.mask;
>  
>  	/* move cons.head atomically */
> @@ -634,8 +646,19 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,
>  	 * If there are other dequeues in progress that preceded us,
>  	 * we need to wait for them to complete
>  	 */
> -	while (unlikely(r->cons.tail != cons_head))
> -		rte_pause();
> +	do {
> +		/* avoid spin too long waiting for other thread finish */
> +		for (rep = RTE_RING_PAUSE_REP;
> +		     rep != 0 && r->cons.tail != cons_head; rep--)
> +			rte_pause();
> +
> +		/*
> +		 * It gives pre-empted thread a chance to proceed and
> +		 * finish with ring denqnue operation.
> +		 */
> +		if (rep == 0)
> +			sched_yield();
> +	} while (rep == 0);
>  
>  	__RING_STAT_ADD(r, deq_success, n);
>  	r->cons.tail = cons_next;
> 

The ring library was designed with the assumption that the code is not
preemptable. The code is lock-less but not wait-less. Actually, if the
code is preempted at a bad moment, it can spin forever until it's
unscheduled.

I wonder if adding a sched_yield() may not penalize the current
implementations that only use one pthread per core? Even if there
is only one pthread in the scheduler queue for this CPU, calling
the scheduler code may cost thousands of cycles.

Also, where does this value "RTE_RING_PAUSE_REP 0x100" comes from?
Why 0x100 is better than 42 or than 10000?

I think it could be good to check if there is a performance impact
with this change, especially where there is a lot of contention on
the ring. If it has an impact, what about adding a compile or runtime
option?


Regards,
Olivier
  
Ananyev, Konstantin Feb. 9, 2015, 3:43 p.m. UTC | #2
Hi Olivier,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier MATZ
> Sent: Friday, February 06, 2015 3:20 PM
> To: Liang, Cunming; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v4 16/17] ring: add sched_yield to avoid spin forever
> 
> Hi,
> 
> On 02/02/2015 03:02 AM, Cunming Liang wrote:
> > Add a sched_yield() syscall if the thread spins for too long, waiting other thread to finish its operations on the ring.
> > That gives pre-empted thread a chance to proceed and finish with ring enqnue/dequeue operation.
> > The purpose is to reduce contention on the ring.
> >
> > Signed-off-by: Cunming Liang <cunming.liang@intel.com>
> > ---
> >  lib/librte_ring/rte_ring.h | 35 +++++++++++++++++++++++++++++------
> >  1 file changed, 29 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > index 39bacdd..c402c73 100644
> > --- a/lib/librte_ring/rte_ring.h
> > +++ b/lib/librte_ring/rte_ring.h
> > @@ -126,6 +126,7 @@ struct rte_ring_debug_stats {
> >
> >  #define RTE_RING_NAMESIZE 32 /**< The maximum length of a ring name. */
> >  #define RTE_RING_MZ_PREFIX "RG_"
> > +#define RTE_RING_PAUSE_REP 0x100  /**< yield after num of times pause. */
> >
> >  /**
> >   * An RTE ring structure.
> > @@ -410,7 +411,7 @@ __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
> >  	uint32_t cons_tail, free_entries;
> >  	const unsigned max = n;
> >  	int success;
> > -	unsigned i;
> > +	unsigned i, rep;
> >  	uint32_t mask = r->prod.mask;
> >  	int ret;
> >
> > @@ -468,8 +469,19 @@ __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
> >  	 * If there are other enqueues in progress that preceded us,
> >  	 * we need to wait for them to complete
> >  	 */
> > -	while (unlikely(r->prod.tail != prod_head))
> > -		rte_pause();
> > +	do {
> > +		/* avoid spin too long waiting for other thread finish */
> > +		for (rep = RTE_RING_PAUSE_REP;
> > +		     rep != 0 && r->prod.tail != prod_head; rep--)
> > +			rte_pause();
> > +
> > +		/*
> > +		 * It gives pre-empted thread a chance to proceed and
> > +		 * finish with ring enqnue operation.
> > +		 */
> > +		if (rep == 0)
> > +			sched_yield();
> > +	} while (rep == 0);
> >
> >  	r->prod.tail = prod_next;
> >  	return ret;
> > @@ -589,7 +601,7 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,
> >  	uint32_t cons_next, entries;
> >  	const unsigned max = n;
> >  	int success;
> > -	unsigned i;
> > +	unsigned i, rep;
> >  	uint32_t mask = r->prod.mask;
> >
> >  	/* move cons.head atomically */
> > @@ -634,8 +646,19 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,
> >  	 * If there are other dequeues in progress that preceded us,
> >  	 * we need to wait for them to complete
> >  	 */
> > -	while (unlikely(r->cons.tail != cons_head))
> > -		rte_pause();
> > +	do {
> > +		/* avoid spin too long waiting for other thread finish */
> > +		for (rep = RTE_RING_PAUSE_REP;
> > +		     rep != 0 && r->cons.tail != cons_head; rep--)
> > +			rte_pause();
> > +
> > +		/*
> > +		 * It gives pre-empted thread a chance to proceed and
> > +		 * finish with ring denqnue operation.
> > +		 */
> > +		if (rep == 0)
> > +			sched_yield();
> > +	} while (rep == 0);
> >
> >  	__RING_STAT_ADD(r, deq_success, n);
> >  	r->cons.tail = cons_next;
> >
> 
> The ring library was designed with the assumption that the code is not
> preemptable. The code is lock-less but not wait-less. Actually, if the
> code is preempted at a bad moment, it can spin forever until it's
> unscheduled.
> 
> I wonder if adding a sched_yield() may not penalize the current
> implementations that only use one pthread per core? Even if there
> is only one pthread in the scheduler queue for this CPU, calling
> the scheduler code may cost thousands of cycles.
> 
> Also, where does this value "RTE_RING_PAUSE_REP 0x100" comes from?
> Why 0x100 is better than 42 or than 10000?

The idea was to have something few times bigger than actual number
active cores in the system, to minimise chance of  a sched_yield() being called
for the case when we have one thread per physical core.  
My thought was that having that many repeats would make such chance neglectable.
Though, right now, I don't have any data to back it up.     

> I think it could be good to check if there is a performance impact
> with this change, especially where there is a lot of contention on
> the ring. If it has an impact, what about adding a compile or runtime
> option?

Good idea, probably we should make RTE_RING_PAUSE_REP  configuration option
and let say avoid emitting ' sched_yield();' at all, if  RTE_RING_PAUSE_REP == 0.

Konstantin

> 
> 
> Regards,
> Olivier
  
Olivier Matz Feb. 10, 2015, 4:53 p.m. UTC | #3
Hi Konstantin,

On 02/09/2015 04:43 PM, Ananyev, Konstantin wrote:
>> The ring library was designed with the assumption that the code is not
>> preemptable. The code is lock-less but not wait-less. Actually, if the
>> code is preempted at a bad moment, it can spin forever until it's
>> unscheduled.
>>
>> I wonder if adding a sched_yield() may not penalize the current
>> implementations that only use one pthread per core? Even if there
>> is only one pthread in the scheduler queue for this CPU, calling
>> the scheduler code may cost thousands of cycles.
>>
>> Also, where does this value "RTE_RING_PAUSE_REP 0x100" comes from?
>> Why 0x100 is better than 42 or than 10000?
>
> The idea was to have something few times bigger than actual number
> active cores in the system, to minimise chance of  a sched_yield() being called
> for the case when we have one thread per physical core.
> My thought was that having that many repeats would make such chance neglectable.
> Though, right now, I don't have any data to back it up.
>
>> I think it could be good to check if there is a performance impact
>> with this change, especially where there is a lot of contention on
>> the ring. If it has an impact, what about adding a compile or runtime
>> option?
>
> Good idea, probably we should make RTE_RING_PAUSE_REP  configuration option
> and let say avoid emitting ' sched_yield();' at all, if  RTE_RING_PAUSE_REP == 0.

Yes, it looks like a good compromise.

Olivier
  

Patch

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 39bacdd..c402c73 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -126,6 +126,7 @@  struct rte_ring_debug_stats {
 
 #define RTE_RING_NAMESIZE 32 /**< The maximum length of a ring name. */
 #define RTE_RING_MZ_PREFIX "RG_"
+#define RTE_RING_PAUSE_REP 0x100  /**< yield after num of times pause. */
 
 /**
  * An RTE ring structure.
@@ -410,7 +411,7 @@  __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
 	uint32_t cons_tail, free_entries;
 	const unsigned max = n;
 	int success;
-	unsigned i;
+	unsigned i, rep;
 	uint32_t mask = r->prod.mask;
 	int ret;
 
@@ -468,8 +469,19 @@  __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
 	 * If there are other enqueues in progress that preceded us,
 	 * we need to wait for them to complete
 	 */
-	while (unlikely(r->prod.tail != prod_head))
-		rte_pause();
+	do {
+		/* avoid spin too long waiting for other thread finish */
+		for (rep = RTE_RING_PAUSE_REP;
+		     rep != 0 && r->prod.tail != prod_head; rep--)
+			rte_pause();
+
+		/*
+		 * It gives pre-empted thread a chance to proceed and
+		 * finish with ring enqnue operation.
+		 */
+		if (rep == 0)
+			sched_yield();
+	} while (rep == 0);
 
 	r->prod.tail = prod_next;
 	return ret;
@@ -589,7 +601,7 @@  __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,
 	uint32_t cons_next, entries;
 	const unsigned max = n;
 	int success;
-	unsigned i;
+	unsigned i, rep;
 	uint32_t mask = r->prod.mask;
 
 	/* move cons.head atomically */
@@ -634,8 +646,19 @@  __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,
 	 * If there are other dequeues in progress that preceded us,
 	 * we need to wait for them to complete
 	 */
-	while (unlikely(r->cons.tail != cons_head))
-		rte_pause();
+	do {
+		/* avoid spin too long waiting for other thread finish */
+		for (rep = RTE_RING_PAUSE_REP;
+		     rep != 0 && r->cons.tail != cons_head; rep--)
+			rte_pause();
+
+		/*
+		 * It gives pre-empted thread a chance to proceed and
+		 * finish with ring denqnue operation.
+		 */
+		if (rep == 0)
+			sched_yield();
+	} while (rep == 0);
 
 	__RING_STAT_ADD(r, deq_success, n);
 	r->cons.tail = cons_next;