[v2] eal/unix: allow creating thread with real-time priority

Message ID 20231025151352.995318-1-thomas@monjalon.net (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v2] eal/unix: allow creating thread with real-time priority |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/loongarch-compilation fail ninja build failure
ci/Intel-compilation fail Compilation issues
ci/github-robot: build fail github build: failed
ci/iol-testing fail build patch failure

Commit Message

Thomas Monjalon Oct. 25, 2023, 3:13 p.m. UTC
  When adding an API for creating threads,
the real-time priority has been forbidden on Unix.

There is a known issue with ring behaviour,
but it should not be completely forbidden.

Real-time thread can block some kernel threads on the same core,
making the system unstable.
That's why a pause is added in the test thread.
This pause is a new API function rte_thread_yield(),
compatible with both Unix and Windows.

Fixes: ca04c78b6262 ("eal: get/set thread priority per thread identifier")
Fixes: ce6e911d20f6 ("eal: add thread lifetime API")
Fixes: a7ba40b2b1bf ("drivers: convert to internal control threads")
Cc: stable@dpdk.org

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_threads.c                       | 11 +---------
 .../prog_guide/env_abstraction_layer.rst      |  4 +++-
 lib/eal/include/rte_thread.h                  | 13 ++++++++++--
 lib/eal/unix/rte_thread.c                     | 21 +++++++++++--------
 lib/eal/version.map                           |  3 +++
 lib/eal/windows/rte_thread.c                  |  6 ++++++
 6 files changed, 36 insertions(+), 22 deletions(-)
  

Comments

Stephen Hemminger Oct. 25, 2023, 3:37 p.m. UTC | #1
On Wed, 25 Oct 2023 17:13:14 +0200
Thomas Monjalon <thomas@monjalon.net> wrote:

>  	case RTE_THREAD_PRIORITY_REALTIME_CRITICAL:
> +		/*
> +		 * WARNING: Real-time busy loop takes priority on kernel threads,
> +		 *          making the system unstable.
> +		 *          There is also a known issue when using rte_ring.
> +		 */

I was thinking something like:

	static bool warned;
	if (!warned) {
		RTE_LOG(NOTICE, EAL, "Real time priority is unstable when thread is polling without sleep\n");
		warned = true;
	}
  
Thomas Monjalon Oct. 25, 2023, 4:46 p.m. UTC | #2
25/10/2023 17:37, Stephen Hemminger:
> On Wed, 25 Oct 2023 17:13:14 +0200
> Thomas Monjalon <thomas@monjalon.net> wrote:
> 
> >  	case RTE_THREAD_PRIORITY_REALTIME_CRITICAL:
> > +		/*
> > +		 * WARNING: Real-time busy loop takes priority on kernel threads,
> > +		 *          making the system unstable.
> > +		 *          There is also a known issue when using rte_ring.
> > +		 */
> 
> I was thinking something like:
> 
> 	static bool warned;
> 	if (!warned) {
> 		RTE_LOG(NOTICE, EAL, "Real time priority is unstable when thread is polling without sleep\n");
> 		warned = true;
> 	}

I'm not sure about bothering users.
They can fear something is wrong even if the developer took care of it.
I think doc warnings for developers are more appropriate.
I've added notes in the API.
  
Morten Brørup Oct. 25, 2023, 5:54 p.m. UTC | #3
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, 25 October 2023 18.46
> 
> 25/10/2023 17:37, Stephen Hemminger:
> > On Wed, 25 Oct 2023 17:13:14 +0200
> > Thomas Monjalon <thomas@monjalon.net> wrote:
> >
> > >  	case RTE_THREAD_PRIORITY_REALTIME_CRITICAL:
> > > +		/*
> > > +		 * WARNING: Real-time busy loop takes priority on kernel
> threads,
> > > +		 *          making the system unstable.
> > > +		 *          There is also a known issue when using
> rte_ring.
> > > +		 */
> >
> > I was thinking something like:
> >
> > 	static bool warned;
> > 	if (!warned) {
> > 		RTE_LOG(NOTICE, EAL, "Real time priority is unstable when
> thread is polling without sleep\n");
> > 		warned = true;
> > 	}
> 
> I'm not sure about bothering users.
> They can fear something is wrong even if the developer took care of it.
> I think doc warnings for developers are more appropriate.
> I've added notes in the API.

I agree with Thomas on this.

If you want the log message, please degrade it to INFO or DEBUG level. It is only relevant when chasing problems, not for normal production - and thus NOTICE is too high.


Someone might build a kernel with options to keep non-dataplane threads off some dedicated CPU cores, so they can be used for guaranteed low-latency dataplane threads. We do. We don't use real-time priority, though.

For reference, we did some experiments (using this custom built kernel) with a dedicated thread doing nothing but a loop calling rte_rdtsc_precise() and registering the delta. Although the overwhelming majority is ca. CPU 80 cycles, there are some big outliers at ca. 9,000 CPU cycles. (Order of magnitude: ca. 45 of these big outliers per minute.) Apparently some kernel threads steal some cycles from this thread, regardless of our customizations. We haven't bothered analyzing and optimizing it further.

I think our experiment supports the need to allow kernel threads to run, e.g. by calling sleep() or similar, when an EAL thread has real-time priority.
  
Stephen Hemminger Oct. 25, 2023, 9:33 p.m. UTC | #4
On Wed, 25 Oct 2023 19:54:06 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> I agree with Thomas on this.
> 
> If you want the log message, please degrade it to INFO or DEBUG level. It is only relevant when chasing problems, not for normal production - and thus NOTICE is too high.

I don't want the message to be hidden.
If we get any bug reports want to be able to say "read the log, don't do that".

> Someone might build a kernel with options to keep non-dataplane threads off some dedicated CPU cores, so they can be used for guaranteed low-latency dataplane threads. We do. We don't use real-time priority, though.

This is really, hard to do. Isolated CPU's are not isolated from interrupts and other sources which end up scheduling work as kernel threads. Plus there is the behavior where kernel decides to turn a soft irq into a kernel thread, then starve
itself. Under starvation, disk corruption is likely if interrupts never get processed :-(

> For reference, we did some experiments (using this custom built kernel) with a dedicated thread doing nothing but a loop calling rte_rdtsc_precise() and registering the delta. Although the overwhelming majority is ca. CPU 80 cycles, there are some big outliers at ca. 9,000 CPU cycles. (Order of magnitude: ca. 45 of these big outliers per minute.) Apparently some kernel threads steal some cycles from this thread, regardless of our customizations. We haven't bothered analyzing and optimizing it further.

Was this on isolated CPU?
Did you check that that CPU was excluded from the smp_affinty mask on all devices?
Did you enable the kernel feature to avoid clock ticks if CPU is dedicated?
Same thing for RCU, need to adjust parameters?

Also, on many systems there can be SMI BIOS hidden execution that will cause big outliers.

Lastly never try and use CPU 0. The kernel uses CPU 0 as catch all in lots of places.

> I think our experiment supports the need to allow kernel threads to run, e.g. by calling sleep() or similar, when an EAL thread has real-time priority.
  
Stephen Hemminger Oct. 26, 2023, midnight UTC | #5
On Wed, 25 Oct 2023 19:54:06 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> Someone might build a kernel with options to keep non-dataplane threads off some dedicated CPU cores, so they can be used for guaranteed low-latency dataplane threads. We do. We don't use real-time priority, though.
> 
> For reference, we did some experiments (using this custom built kernel) with a dedicated thread doing nothing but a loop calling rte_rdtsc_precise() and registering the delta. Although the overwhelming majority is ca. CPU 80 cycles, there are some big outliers at ca. 9,000 CPU cycles. (Order of magnitude: ca. 45 of these big outliers per minute.) Apparently some kernel threads steal some cycles from this thread, regardless of our customizations. We haven't bothered analyzing and optimizing it further.
> 
> I think our experiment supports the need to allow kernel threads to run, e.g. by calling sleep() or similar, when an EAL thread has real-time priority.

First. We need to dispel the myth that on Linux real time is faster.
It isn't, you can ask the RT kernel developers if you need more support.
The purpose of RT is to have user process respond in a deterministic fashion
to a kernel event (ie. usually HW interrupt or IPC). In most cases, this is
not how DPDK applications are written. It is possible to use hardware interrupts
in DPDK, but getting it right is hard, and the latency of HW to kernel to DPDK
in userspace is a long time. RT will make it more consistent, but it won't
remove the overhead; i.e less long tail with RT, but average latency will
still be too long for most network applications.

Users say "but my polling thread is getting latency", then you need to make sure
your application is running on dedicated cores and the system is configured to
not use those cores for HW events.  Doing RT won't fix that.

Then some user say "but I want to run multiple polling threads on a single core".
That plain won't work no matter what the priority.
  
Morten Brørup Oct. 26, 2023, 7:33 a.m. UTC | #6
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 25 October 2023 23.33
> 
> On Wed, 25 Oct 2023 19:54:06 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > I agree with Thomas on this.
> >
> > If you want the log message, please degrade it to INFO or DEBUG level. It is
> only relevant when chasing problems, not for normal production - and thus
> NOTICE is too high.
> 
> I don't want the message to be hidden.
> If we get any bug reports want to be able to say "read the log, don't do
> that".

Since Stephen is arguing so strongly for it, I have changed my mind, and now support Stephen's suggestion.

It's a tradeoff: Noise for carefully designed systems, vs. important bug hunting information for systems under development (or casually developed systems).
As Stephen points out, it is a good starting point to check for bug reports possibly related to this. And, I suppose the experienced users who really understands it will not be seriously confused by such a NOTICE message in the log.

> 
> > Someone might build a kernel with options to keep non-dataplane threads off
> some dedicated CPU cores, so they can be used for guaranteed low-latency
> dataplane threads. We do. We don't use real-time priority, though.
> 
> This is really, hard to do.

As my kids would say: This is really, really, really, really, really hard to do!

We have not been able to find an authoritative source of documentation describing how to do it. :-(

And our experiment shows that we didn't 100 % succeed doing it. But we got close enough for our purposes. Outliers of max 9,000 CPU cycles on a 3+ GHz CPU corresponds to max 3 microseconds of added worst-case latency.

It would be great for latency-sensitive applications if the DPDK documentation went more into detail on this topic. However, if the DPDK runs on top of a Linux distro, it essentially depends on the distro, and should be documented there. And if running on top of a custom built Linux Kernel, it essentially depends on the kernel, and should be documented there. In other words: Such information should be contributed there, and not in the DPDK documentation. ;-)

> Isolated CPU's are not isolated from interrupts
> and other sources which end up scheduling work as kernel threads. Plus there
> is the behavior where kernel decides to turn a soft irq into a kernel thread,
> then starve itself.

We have configured the kernel to put all of this on CPU 0. (Details further below.)

> Under starvation, disk corruption is likely if interrupts never get
> processed :-(
> 
> > For reference, we did some experiments (using this custom built kernel) with
> a dedicated thread doing nothing but a loop calling rte_rdtsc_precise() and
> registering the delta. Although the overwhelming majority is ca. CPU 80
> cycles, there are some big outliers at ca. 9,000 CPU cycles. (Order of
> magnitude: ca. 45 of these big outliers per minute.) Apparently some kernel
> threads steal some cycles from this thread, regardless of our customizations.
> We haven't bothered analyzing and optimizing it further.
> 
> Was this on isolated CPU?

Yes. We isolate all CPUs but CPU 0.

> Did you check that that CPU was excluded from the smp_affinty mask on all
> devices?

Not sure how to do that?

NB: We are currently only using single-socket hardware - this makes some things easier. Perhaps this is one of those things?

> Did you enable the kernel feature to avoid clock ticks if CPU is dedicated?

Yes:
# Timers subsystem
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ_FULL_ALL=y

CONFIG_CMDLINE="isolcpus=1-32 irqaffinity=0 rcu_nocb_poll"

> Same thing for RCU, need to adjust parameters?

Yes:
# RCU Subsystem
CONFIG_TREE_RCU=y
CONFIG_SRCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y

> 
> Also, on many systems there can be SMI BIOS hidden execution that will cause
> big outliers.

Yes, this is a big surprise to many people, when it happens. Our hardware doesn't suffer from that.

> 
> Lastly never try and use CPU 0. The kernel uses CPU 0 as catch all in lots of
> places.

Yes, this is very important! We treat CPU 0 as if any random process or interrupt handler can take it away at any time.

> 
> > I think our experiment supports the need to allow kernel threads to run,
> e.g. by calling sleep() or similar, when an EAL thread has real-time priority.
  
Stephen Hemminger Oct. 26, 2023, 4:32 p.m. UTC | #7
On Thu, 26 Oct 2023 09:33:42 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Wednesday, 25 October 2023 23.33
> > 
> > On Wed, 25 Oct 2023 19:54:06 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >   
> > > I agree with Thomas on this.
> > >
> > > If you want the log message, please degrade it to INFO or DEBUG level. It is  
> > only relevant when chasing problems, not for normal production - and thus
> > NOTICE is too high.
> > 
> > I don't want the message to be hidden.
> > If we get any bug reports want to be able to say "read the log, don't do
> > that".  
> 
> Since Stephen is arguing so strongly for it, I have changed my mind, and now support Stephen's suggestion.
> 
> It's a tradeoff: Noise for carefully designed systems, vs. important bug hunting information for systems under development (or casually developed systems).
> As Stephen points out, it is a good starting point to check for bug reports possibly related to this. And, I suppose the experienced users who really understands it will not be seriously confused by such a NOTICE message in the log.
> 
> >   
> > > Someone might build a kernel with options to keep non-dataplane threads off  
> > some dedicated CPU cores, so they can be used for guaranteed low-latency
> > dataplane threads. We do. We don't use real-time priority, though.
> > 
> > This is really, hard to do.  
> 
> As my kids would say: This is really, really, really, really, really hard to do!
> 
> We have not been able to find an authoritative source of documentation describing how to do it. :-(
> 
> And our experiment shows that we didn't 100 % succeed doing it. But we got close enough for our purposes. Outliers of max 9,000 CPU cycles on a 3+ GHz CPU corresponds to max 3 microseconds of added worst-case latency.
> 
> It would be great for latency-sensitive applications if the DPDK documentation went more into detail on this topic. However, if the DPDK runs on top of a Linux distro, it essentially depends on the distro, and should be documented there. And if running on top of a custom built Linux Kernel, it essentially depends on the kernel, and should be documented there. In other words: Such information should be contributed there, and not in the DPDK documentation. ;-)
> 
> > Isolated CPU's are not isolated from interrupts
> > and other sources which end up scheduling work as kernel threads. Plus there
> > is the behavior where kernel decides to turn a soft irq into a kernel thread,
> > then starve itself.  
> 
> We have configured the kernel to put all of this on CPU 0. (Details further below.)
> 
> > Under starvation, disk corruption is likely if interrupts never get
> > processed :-(
> >   
> > > For reference, we did some experiments (using this custom built kernel) with  
> > a dedicated thread doing nothing but a loop calling rte_rdtsc_precise() and
> > registering the delta. Although the overwhelming majority is ca. CPU 80
> > cycles, there are some big outliers at ca. 9,000 CPU cycles. (Order of
> > magnitude: ca. 45 of these big outliers per minute.) Apparently some kernel
> > threads steal some cycles from this thread, regardless of our customizations.
> > We haven't bothered analyzing and optimizing it further.
> > 
> > Was this on isolated CPU?  
> 
> Yes. We isolate all CPUs but CPU 0.
> 
> > Did you check that that CPU was excluded from the smp_affinty mask on all
> > devices?  
> 
> Not sure how to do that?
> 
> NB: We are currently only using single-socket hardware - this makes some things easier. Perhaps this is one of those things?
> 
> > Did you enable the kernel feature to avoid clock ticks if CPU is dedicated?  
> 
> Yes:
> # Timers subsystem
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ_COMMON=y
> CONFIG_NO_HZ_FULL=y
> CONFIG_NO_HZ_FULL_ALL=y
> 
> CONFIG_CMDLINE="isolcpus=1-32 irqaffinity=0 rcu_nocb_poll"
> 
> > Same thing for RCU, need to adjust parameters?  
> 
> Yes:
> # RCU Subsystem
> CONFIG_TREE_RCU=y
> CONFIG_SRCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_CONTEXT_TRACKING=y
> CONFIG_RCU_NOCB_CPU=y
> CONFIG_RCU_NOCB_CPU_ALL=y
> 
> > 
> > Also, on many systems there can be SMI BIOS hidden execution that will cause
> > big outliers.  
> 
> Yes, this is a big surprise to many people, when it happens. Our hardware doesn't suffer from that.
> 
> > 
> > Lastly never try and use CPU 0. The kernel uses CPU 0 as catch all in lots of
> > places.  
> 
> Yes, this is very important! We treat CPU 0 as if any random process or interrupt handler can take it away at any time.
> 
> >   
> > > I think our experiment supports the need to allow kernel threads to run,  
> > e.g. by calling sleep() or similar, when an EAL thread has real-time priority.  
> 

One benefit of doing real-time thread is that kernel will be more precise in
any calls to sleep. If you do small sleep in normal thread, the kernel will round
up the timer to try and avoid reprogramming timer chip and to save power (less wakeups from idle).
With RT thread it will do "you wanted 21us, ok for you will do 21us"

The project that was originally Vyatta, has a script that tries to isolate interrupts etc.
I started it but they have worked on it since then.

   https://github.com/danos/vyatta-cpu-shield

It adjust kernel workers, softirq, cgroups etc
  
Morten Brørup Oct. 26, 2023, 5:07 p.m. UTC | #8
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Thursday, 26 October 2023 18.32
> 
> On Thu, 26 Oct 2023 09:33:42 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Wednesday, 25 October 2023 23.33
> > >
> > > On Wed, 25 Oct 2023 19:54:06 +0200
> > > Morten Brørup <mb@smartsharesystems.com> wrote:

[...]

> > > > Someone might build a kernel with options to keep non-dataplane
> threads off
> > > some dedicated CPU cores, so they can be used for guaranteed low-
> latency
> > > dataplane threads. We do. We don't use real-time priority, though.
> > >
> > > This is really, hard to do.
> >
> > As my kids would say: This is really, really, really, really, really
> hard to do!
> >
> > We have not been able to find an authoritative source of
> documentation describing how to do it. :-(

[...]

> One benefit of doing real-time thread is that kernel will be more
> precise in
> any calls to sleep. If you do small sleep in normal thread, the kernel
> will round
> up the timer to try and avoid reprogramming timer chip and to save
> power (less wakeups from idle).
> With RT thread it will do "you wanted 21us, ok for you will do 21us"

So we don't need PR_SET_TIMERSLACK with RT threads?

> 
> The project that was originally Vyatta, has a script that tries to
> isolate interrupts etc.
> I started it but they have worked on it since then.
> 
>    https://github.com/danos/vyatta-cpu-shield
> 
> It adjust kernel workers, softirq, cgroups etc

This script looks very interesting. Thank you, Stephen!
  

Patch

diff --git a/app/test/test_threads.c b/app/test/test_threads.c
index 4ac3f2671a..9a449ba9c5 100644
--- a/app/test/test_threads.c
+++ b/app/test/test_threads.c
@@ -22,7 +22,7 @@  thread_main(void *arg)
 	__atomic_store_n(&thread_id_ready, 1, __ATOMIC_RELEASE);
 
 	while (__atomic_load_n(&thread_id_ready, __ATOMIC_ACQUIRE) == 1)
-		;
+		rte_thread_yield(); /* required in case of real-time priority */
 
 	return 0;
 }
@@ -97,21 +97,12 @@  test_thread_priority(void)
 		"Priority set mismatches priority get");
 
 	priority = RTE_THREAD_PRIORITY_REALTIME_CRITICAL;
-#ifndef RTE_EXEC_ENV_WINDOWS
-	RTE_TEST_ASSERT(rte_thread_set_priority(thread_id, priority) == ENOTSUP,
-		"Priority set to critical should fail");
-	RTE_TEST_ASSERT(rte_thread_get_priority(thread_id, &priority) == 0,
-		"Failed to get thread priority");
-	RTE_TEST_ASSERT(priority == RTE_THREAD_PRIORITY_NORMAL,
-		"Failed set to critical should have retained normal");
-#else
 	RTE_TEST_ASSERT(rte_thread_set_priority(thread_id, priority) == 0,
 		"Priority set to critical should succeed");
 	RTE_TEST_ASSERT(rte_thread_get_priority(thread_id, &priority) == 0,
 		"Failed to get thread priority");
 	RTE_TEST_ASSERT(priority == RTE_THREAD_PRIORITY_REALTIME_CRITICAL,
 		"Priority set mismatches priority get");
-#endif
 
 	priority = RTE_THREAD_PRIORITY_NORMAL;
 	RTE_TEST_ASSERT(rte_thread_set_priority(thread_id, priority) == 0,
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 6debf54efb..d1f7cae7cd 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -815,7 +815,9 @@  Known Issues
 
   4. It MAY be used by preemptible multi-producer and/or preemptible multi-consumer pthreads whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE or SCHED_BATCH. User SHOULD be aware of the performance penalty before using it.
 
-  5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
+  5. It MUST not be used by multi-producer/consumer pthreads
+     whose scheduling policies are ``SCHED_FIFO``
+     or ``SCHED_RR`` (``RTE_THREAD_PRIORITY_REALTIME_CRITICAL``).
 
   Alternatively, applications can use the lock-free stack mempool handler. When
   considering this handler, note that:
diff --git a/lib/eal/include/rte_thread.h b/lib/eal/include/rte_thread.h
index 8da9d4d3fb..eeccc40532 100644
--- a/lib/eal/include/rte_thread.h
+++ b/lib/eal/include/rte_thread.h
@@ -56,10 +56,11 @@  typedef uint32_t (*rte_thread_func) (void *arg);
  * Thread priority values.
  */
 enum rte_thread_priority {
+	/** Normal thread priority, the default. */
 	RTE_THREAD_PRIORITY_NORMAL            = 0,
-	/**< normal thread priority, the default */
+	/** Highest thread priority, use with caution.
+	 *  WARNING: System may be unstable because of a real-time busy loop. */
 	RTE_THREAD_PRIORITY_REALTIME_CRITICAL = 1,
-	/**< highest thread priority allowed */
 };
 
 /**
@@ -183,6 +184,14 @@  int rte_thread_join(rte_thread_t thread_id, uint32_t *value_ptr);
  */
 int rte_thread_detach(rte_thread_t thread_id);
 
+/**
+ * Allow another thread to run on the same CPU core.
+ *
+ * Especially useful in real-time thread priority.
+ * @see RTE_THREAD_PRIORITY_REALTIME_CRITICAL
+ */
+void rte_thread_yield(void);
+
 /**
  * Get the id of the calling thread.
  *
diff --git a/lib/eal/unix/rte_thread.c b/lib/eal/unix/rte_thread.c
index 36a21ab2f9..399acf2fa0 100644
--- a/lib/eal/unix/rte_thread.c
+++ b/lib/eal/unix/rte_thread.c
@@ -5,6 +5,7 @@ 
 
 #include <errno.h>
 #include <pthread.h>
+#include <sched.h>
 #include <stdbool.h>
 #include <stdlib.h>
 #include <string.h>
@@ -49,6 +50,11 @@  thread_map_priority_to_os_value(enum rte_thread_priority eal_pri, int *os_pri,
 			sched_get_priority_max(SCHED_OTHER)) / 2;
 		break;
 	case RTE_THREAD_PRIORITY_REALTIME_CRITICAL:
+		/*
+		 * WARNING: Real-time busy loop takes priority on kernel threads,
+		 *          making the system unstable.
+		 *          There is also a known issue when using rte_ring.
+		 */
 		*pol = SCHED_RR;
 		*os_pri = sched_get_priority_max(SCHED_RR);
 		break;
@@ -153,11 +159,6 @@  rte_thread_create(rte_thread_t *thread_id,
 			goto cleanup;
 		}
 
-		if (thread_attr->priority ==
-				RTE_THREAD_PRIORITY_REALTIME_CRITICAL) {
-			ret = ENOTSUP;
-			goto cleanup;
-		}
 		ret = thread_map_priority_to_os_value(thread_attr->priority,
 				&param.sched_priority, &policy);
 		if (ret != 0)
@@ -227,6 +228,12 @@  rte_thread_detach(rte_thread_t thread_id)
 	return pthread_detach((pthread_t)thread_id.opaque_id);
 }
 
+void
+rte_thread_yield(void)
+{
+	sched_yield();
+}
+
 int
 rte_thread_equal(rte_thread_t t1, rte_thread_t t2)
 {
@@ -275,10 +282,6 @@  rte_thread_set_priority(rte_thread_t thread_id,
 	int policy;
 	int ret;
 
-	/* Realtime priority can cause crashes on non-Windows platforms. */
-	if (priority == RTE_THREAD_PRIORITY_REALTIME_CRITICAL)
-		return ENOTSUP;
-
 	ret = thread_map_priority_to_os_value(priority, &param.sched_priority,
 		&policy);
 	if (ret != 0)
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e00a844805..0b4a503c5f 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -413,6 +413,9 @@  EXPERIMENTAL {
 	# added in 23.07
 	rte_memzone_max_get;
 	rte_memzone_max_set;
+
+	# added in 23.11
+	rte_thread_yield;
 };
 
 INTERNAL {
diff --git a/lib/eal/windows/rte_thread.c b/lib/eal/windows/rte_thread.c
index acf648456c..b0373b1a55 100644
--- a/lib/eal/windows/rte_thread.c
+++ b/lib/eal/windows/rte_thread.c
@@ -304,6 +304,12 @@  rte_thread_detach(rte_thread_t thread_id)
 	return 0;
 }
 
+void
+rte_thread_yield(void)
+{
+	Sleep(0);
+}
+
 int
 rte_thread_equal(rte_thread_t t1, rte_thread_t t2)
 {