eal/unix: allow creating thread with real-time priority

Message ID 20231024125416.798897-1-thomas@monjalon.net (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series eal/unix: allow creating thread with real-time priority |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing fail Unit Testing FAIL
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-compile-amd64-testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/intel-Functional success Functional PASS
ci/github-robot: build success github build: passed
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-compile-arm64-testing success Testing PASS

Commit Message

Thomas Monjalon Oct. 24, 2023, 12:54 p.m. UTC
  When adding an API for creating threads,
the real-time priority has been forbidden on Unix.

There is a known issue with ring behaviour,
but it should not be completely forbidden.

Fixes: ca04c78b6262 ("eal: get/set thread priority per thread identifier")
Fixes: ce6e911d20f6 ("eal: add thread lifetime API")
Fixes: a7ba40b2b1bf ("drivers: convert to internal control threads")
Cc: stable@dpdk.org

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
---
 app/test/test_threads.c                         | 9 ---------
 doc/guides/prog_guide/env_abstraction_layer.rst | 4 +++-
 lib/eal/include/rte_thread.h                    | 2 +-
 lib/eal/unix/rte_thread.c                       | 9 ---------
 4 files changed, 4 insertions(+), 20 deletions(-)
  

Comments

Morten Brørup Oct. 24, 2023, 1:55 p.m. UTC | #1
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Tuesday, 24 October 2023 14.54
> 
> When adding an API for creating threads,
> the real-time priority has been forbidden on Unix.
> 
> There is a known issue with ring behaviour,
> but it should not be completely forbidden.
> 
> Fixes: ca04c78b6262 ("eal: get/set thread priority per thread identifier")
> Fixes: ce6e911d20f6 ("eal: add thread lifetime API")
> Fixes: a7ba40b2b1bf ("drivers: convert to internal control threads")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
> ---

[...]

> @@ -815,7 +815,9 @@ Known Issues
> 
>    4. It MAY be used by preemptible multi-producer and/or preemptible multi-
> consumer pthreads whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE
> or SCHED_BATCH. User SHOULD be aware of the performance penalty before using
> it.
> 
> -  5. It MUST not be used by multi-producer/consumer pthreads, whose
> scheduling policies are SCHED_FIFO or SCHED_RR.
> +  5. It MUST not be used by multi-producer/consumer pthreads
> +     whose scheduling policies are ``SCHED_FIFO``
> +     or ``SCHED_RR`` (``RTE_THREAD_PRIORITY_REALTIME_CRITICAL``).

Do the RTS or HTS ring modes make any difference here?

Anyway, I agree that real-time priority should not be forbidden on Unix.

Acked-by: Morten Brørup <mb@smartsharesystems.com>
  
Stephen Hemminger Oct. 24, 2023, 4:04 p.m. UTC | #2
On Tue, 24 Oct 2023 15:55:13 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > 
> >    4. It MAY be used by preemptible multi-producer and/or preemptible multi-
> > consumer pthreads whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE
> > or SCHED_BATCH. User SHOULD be aware of the performance penalty before using
> > it.
> > 
> > -  5. It MUST not be used by multi-producer/consumer pthreads, whose
> > scheduling policies are SCHED_FIFO or SCHED_RR.
> > +  5. It MUST not be used by multi-producer/consumer pthreads
> > +     whose scheduling policies are ``SCHED_FIFO``
> > +     or ``SCHED_RR`` (``RTE_THREAD_PRIORITY_REALTIME_CRITICAL``).  
> 
> Do the RTS or HTS ring modes make any difference here?
> 
> Anyway, I agree that real-time priority should not be forbidden on Unix.
> 
> Acked-by: Morten Brørup <mb@smartsharesystems.com>

Please add a big warning message in the rte_thread.c and the documentation
to describe the problem. Need to have the "you have been warned" action.

Use of RT priority is incompatible with 100% poll mode as is typically done
in DPDK applications. A real time thread has higher priority than other necessary
kernel threads on the same CPU. Therefore if the RT thread never sleeps, critical
system actions such as delayed writes, network packet processing and timer updates
will not happen which makes the system unstable.

Multiple DPDK users have learned this the hard way.
  
Thomas Monjalon Oct. 25, 2023, 1:15 p.m. UTC | #3
24/10/2023 18:04, Stephen Hemminger:
> On Tue, 24 Oct 2023 15:55:13 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > 
> > >    4. It MAY be used by preemptible multi-producer and/or preemptible multi-
> > > consumer pthreads whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE
> > > or SCHED_BATCH. User SHOULD be aware of the performance penalty before using
> > > it.
> > > 
> > > -  5. It MUST not be used by multi-producer/consumer pthreads, whose
> > > scheduling policies are SCHED_FIFO or SCHED_RR.
> > > +  5. It MUST not be used by multi-producer/consumer pthreads
> > > +     whose scheduling policies are ``SCHED_FIFO``
> > > +     or ``SCHED_RR`` (``RTE_THREAD_PRIORITY_REALTIME_CRITICAL``).  
> > 
> > Do the RTS or HTS ring modes make any difference here?
> > 
> > Anyway, I agree that real-time priority should not be forbidden on Unix.
> > 
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Please add a big warning message in the rte_thread.c and the documentation
> to describe the problem. Need to have the "you have been warned" action.

Yes I can add more warnings.

> Use of RT priority is incompatible with 100% poll mode as is typically done
> in DPDK applications. A real time thread has higher priority than other necessary
> kernel threads on the same CPU. Therefore if the RT thread never sleeps, critical
> system actions such as delayed writes, network packet processing and timer updates
> will not happen which makes the system unstable.

Yes, and it is shown by the test on loongarch:
DPDK:fast-tests / threads_autotest        TIMEOUT        80.01s
http://mails.dpdk.org/archives/test-report/2023-October/488760.html

I'll try to pass the test by adding a sleep in the test thread.

> Multiple DPDK users have learned this the hard way.
  
Bruce Richardson Oct. 25, 2023, 1:34 p.m. UTC | #4
On Wed, Oct 25, 2023 at 03:15:49PM +0200, Thomas Monjalon wrote:
> 24/10/2023 18:04, Stephen Hemminger:
> > On Tue, 24 Oct 2023 15:55:13 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> > 
> > > > 
> > > >    4. It MAY be used by preemptible multi-producer and/or preemptible multi-
> > > > consumer pthreads whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE
> > > > or SCHED_BATCH. User SHOULD be aware of the performance penalty before using
> > > > it.
> > > > 
> > > > -  5. It MUST not be used by multi-producer/consumer pthreads, whose
> > > > scheduling policies are SCHED_FIFO or SCHED_RR.
> > > > +  5. It MUST not be used by multi-producer/consumer pthreads
> > > > +     whose scheduling policies are ``SCHED_FIFO``
> > > > +     or ``SCHED_RR`` (``RTE_THREAD_PRIORITY_REALTIME_CRITICAL``).  
> > > 
> > > Do the RTS or HTS ring modes make any difference here?
> > > 
> > > Anyway, I agree that real-time priority should not be forbidden on Unix.
> > > 
> > > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > 
> > Please add a big warning message in the rte_thread.c and the documentation
> > to describe the problem. Need to have the "you have been warned" action.
> 
> Yes I can add more warnings.
> 
> > Use of RT priority is incompatible with 100% poll mode as is typically done
> > in DPDK applications. A real time thread has higher priority than other necessary
> > kernel threads on the same CPU. Therefore if the RT thread never sleeps, critical
> > system actions such as delayed writes, network packet processing and timer updates
> > will not happen which makes the system unstable.
> 
> Yes, and it is shown by the test on loongarch:
> DPDK:fast-tests / threads_autotest        TIMEOUT        80.01s
> http://mails.dpdk.org/archives/test-report/2023-October/488760.html
> 
> I'll try to pass the test by adding a sleep in the test thread.
> 

"sched_yield()" rather than sleep perhaps? Might better convey the
intention of the call.
  
Thomas Monjalon Oct. 25, 2023, 1:44 p.m. UTC | #5
25/10/2023 15:34, Bruce Richardson:
> On Wed, Oct 25, 2023 at 03:15:49PM +0200, Thomas Monjalon wrote:
> > 24/10/2023 18:04, Stephen Hemminger:
> > > On Tue, 24 Oct 2023 15:55:13 +0200
> > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > > 
> > > > > 
> > > > >    4. It MAY be used by preemptible multi-producer and/or preemptible multi-
> > > > > consumer pthreads whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE
> > > > > or SCHED_BATCH. User SHOULD be aware of the performance penalty before using
> > > > > it.
> > > > > 
> > > > > -  5. It MUST not be used by multi-producer/consumer pthreads, whose
> > > > > scheduling policies are SCHED_FIFO or SCHED_RR.
> > > > > +  5. It MUST not be used by multi-producer/consumer pthreads
> > > > > +     whose scheduling policies are ``SCHED_FIFO``
> > > > > +     or ``SCHED_RR`` (``RTE_THREAD_PRIORITY_REALTIME_CRITICAL``).  
> > > > 
> > > > Do the RTS or HTS ring modes make any difference here?
> > > > 
> > > > Anyway, I agree that real-time priority should not be forbidden on Unix.
> > > > 
> > > > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > > 
> > > Please add a big warning message in the rte_thread.c and the documentation
> > > to describe the problem. Need to have the "you have been warned" action.
> > 
> > Yes I can add more warnings.
> > 
> > > Use of RT priority is incompatible with 100% poll mode as is typically done
> > > in DPDK applications. A real time thread has higher priority than other necessary
> > > kernel threads on the same CPU. Therefore if the RT thread never sleeps, critical
> > > system actions such as delayed writes, network packet processing and timer updates
> > > will not happen which makes the system unstable.
> > 
> > Yes, and it is shown by the test on loongarch:
> > DPDK:fast-tests / threads_autotest        TIMEOUT        80.01s
> > http://mails.dpdk.org/archives/test-report/2023-October/488760.html
> > 
> > I'll try to pass the test by adding a sleep in the test thread.
> > 
> 
> "sched_yield()" rather than sleep perhaps? Might better convey the
> intention of the call.

Do we have sched_yield on Windows?
  
Stephen Hemminger Oct. 25, 2023, 3:08 p.m. UTC | #6
On Wed, 25 Oct 2023 15:44:25 +0200
Thomas Monjalon <thomas@monjalon.net> wrote:

> > > 
> > > I'll try to pass the test by adding a sleep in the test thread.
> > >   
> > 
> > "sched_yield()" rather than sleep perhaps? Might better convey the
> > intention of the call.  
> 
> Do we have sched_yield on Windows?

Windows has an equivalent but sched_yield() won't work here.
Since the DPDK thread is still higher priority than the kernel thread,
the scheduler will reschedule the DPDK thread. You need to sleep
to let kthread run.
  
Bruce Richardson Oct. 25, 2023, 3:14 p.m. UTC | #7
On Wed, Oct 25, 2023 at 08:08:52AM -0700, Stephen Hemminger wrote:
> On Wed, 25 Oct 2023 15:44:25 +0200
> Thomas Monjalon <thomas@monjalon.net> wrote:
> 
> > > > 
> > > > I'll try to pass the test by adding a sleep in the test thread.
> > > >   
> > > 
> > > "sched_yield()" rather than sleep perhaps? Might better convey the
> > > intention of the call.  
> > 
> > Do we have sched_yield on Windows?
> 
> Windows has an equivalent but sched_yield() won't work here.
> Since the DPDK thread is still higher priority than the kernel thread,
> the scheduler will reschedule the DPDK thread. You need to sleep
> to let kthread run.

Interesting. Thanks for clarifying the situation.
  
Thomas Monjalon Oct. 25, 2023, 3:18 p.m. UTC | #8
25/10/2023 17:14, Bruce Richardson:
> On Wed, Oct 25, 2023 at 08:08:52AM -0700, Stephen Hemminger wrote:
> > On Wed, 25 Oct 2023 15:44:25 +0200
> > Thomas Monjalon <thomas@monjalon.net> wrote:
> > 
> > > > > 
> > > > > I'll try to pass the test by adding a sleep in the test thread.
> > > > >   
> > > > 
> > > > "sched_yield()" rather than sleep perhaps? Might better convey the
> > > > intention of the call.  
> > > 
> > > Do we have sched_yield on Windows?
> > 
> > Windows has an equivalent but sched_yield() won't work here.
> > Since the DPDK thread is still higher priority than the kernel thread,
> > the scheduler will reschedule the DPDK thread. You need to sleep
> > to let kthread run.
> 
> Interesting. Thanks for clarifying the situation.

Indeed interesting.
I've just sent a v2 before reading this.

So I should try a v3 with a sleep.
But then I need to find a better name than rte_thread_yield.
Ideas?
  
Thomas Monjalon Oct. 25, 2023, 3:32 p.m. UTC | #9
25/10/2023 17:18, Thomas Monjalon:
> 25/10/2023 17:14, Bruce Richardson:
> > On Wed, Oct 25, 2023 at 08:08:52AM -0700, Stephen Hemminger wrote:
> > > On Wed, 25 Oct 2023 15:44:25 +0200
> > > Thomas Monjalon <thomas@monjalon.net> wrote:
> > > 
> > > > > > 
> > > > > > I'll try to pass the test by adding a sleep in the test thread.
> > > > > >   
> > > > > 
> > > > > "sched_yield()" rather than sleep perhaps? Might better convey the
> > > > > intention of the call.  
> > > > 
> > > > Do we have sched_yield on Windows?
> > > 
> > > Windows has an equivalent but sched_yield() won't work here.
> > > Since the DPDK thread is still higher priority than the kernel thread,
> > > the scheduler will reschedule the DPDK thread. You need to sleep
> > > to let kthread run.
> > 
> > Interesting. Thanks for clarifying the situation.
> 
> Indeed interesting.
> I've just sent a v2 before reading this.
> 
> So I should try a v3 with a sleep.
> But then I need to find a better name than rte_thread_yield.
> Ideas?

I will go with rte_thread_yield_realtime().
Any sleep will suffice on Linux? What about a nanosleep?
I suppose Sleep(0) is OK on Windows?
  
Thomas Monjalon Oct. 26, 2023, 1:36 p.m. UTC | #10
25/10/2023 18:31, Thomas Monjalon:
> Real-time thread priority was been forbidden on Unix
> because of problems they can cause.
> Warnings and helpers are added to avoid deadlocks,
> so real-time can be allowed on all systems.

Unit test is failing:
DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s

It is seen in only 1 target (maybe the failure occurence is random):
  Debian 11 (Buster) (ARM) | PASS
  Fedora 37 (ARM)          | PASS
  CentOS Stream 9 (ARM)    | FAIL
  Fedora 38 (ARM)          | PASS
  Fedora 38 (ARM Clang)    | PASS
  Ubuntu 20.04 (ARM)       | PASS

I need to send a v4 with new implementation and better comments.
The Unix sleep will be upgraded from 1 ns to 1 us in case it makes a difference.
  
Morten Brørup Oct. 26, 2023, 1:44 p.m. UTC | #11
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, 26 October 2023 15.37
> 
> 25/10/2023 18:31, Thomas Monjalon:
> > Real-time thread priority was been forbidden on Unix
> > because of problems they can cause.
> > Warnings and helpers are added to avoid deadlocks,
> > so real-time can be allowed on all systems.
> 
> Unit test is failing:
> DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> 
> It is seen in only 1 target (maybe the failure occurence is random):
>   Debian 11 (Buster) (ARM) | PASS
>   Fedora 37 (ARM)          | PASS
>   CentOS Stream 9 (ARM)    | FAIL
>   Fedora 38 (ARM)          | PASS
>   Fedora 38 (ARM Clang)    | PASS
>   Ubuntu 20.04 (ARM)       | PASS
> 
> I need to send a v4 with new implementation and better comments.
> The Unix sleep will be upgraded from 1 ns to 1 us in case it makes a
> difference.

It will not make a difference. The kernel will go through the sleeping steps, then wake up again and see the real-time thread is ready to run, and then immediately schedule it.

For testing purposes, consider sleeping 10 milliseconds or something significant like that.
  
Morten Brørup Oct. 26, 2023, 1:57 p.m. UTC | #12
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, 26 October 2023 15.45
> 
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Thursday, 26 October 2023 15.37
> >
> > 25/10/2023 18:31, Thomas Monjalon:
> > > Real-time thread priority was been forbidden on Unix
> > > because of problems they can cause.
> > > Warnings and helpers are added to avoid deadlocks,
> > > so real-time can be allowed on all systems.
> >
> > Unit test is failing:
> > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> >
> > It is seen in only 1 target (maybe the failure occurence is random):
> >   Debian 11 (Buster) (ARM) | PASS
> >   Fedora 37 (ARM)          | PASS
> >   CentOS Stream 9 (ARM)    | FAIL
> >   Fedora 38 (ARM)          | PASS
> >   Fedora 38 (ARM Clang)    | PASS
> >   Ubuntu 20.04 (ARM)       | PASS
> >
> > I need to send a v4 with new implementation and better comments.
> > The Unix sleep will be upgraded from 1 ns to 1 us in case it makes a
> > difference.
> 
> It will not make a difference. The kernel will go through the sleeping steps,
> then wake up again and see the real-time thread is ready to run, and then
> immediately schedule it.
> 
> For testing purposes, consider sleeping 10 milliseconds or something
> significant like that.

A bit more details...

In our recent tests, nanosleep() itself took around 50 us. So you need to sleep longer than that for your thread not to be runnable when the nanosleep() wakes up again, because 50 us has already passed in "nanosleep overhead".
10 milliseconds provides plenty of margin, and corresponds to 10 jiffies on a 1000 Hz kernel. (I don't know if it makes any difference for the kernel scheduler if the timer crosses a jiffy border or not.)
  
Thomas Monjalon Oct. 26, 2023, 2:04 p.m. UTC | #13
26/10/2023 15:57, Morten Brørup:
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Thursday, 26 October 2023 15.45
> > 
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Thursday, 26 October 2023 15.37
> > >
> > > 25/10/2023 18:31, Thomas Monjalon:
> > > > Real-time thread priority was been forbidden on Unix
> > > > because of problems they can cause.
> > > > Warnings and helpers are added to avoid deadlocks,
> > > > so real-time can be allowed on all systems.
> > >
> > > Unit test is failing:
> > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> > >
> > > It is seen in only 1 target (maybe the failure occurence is random):
> > >   Debian 11 (Buster) (ARM) | PASS
> > >   Fedora 37 (ARM)          | PASS
> > >   CentOS Stream 9 (ARM)    | FAIL
> > >   Fedora 38 (ARM)          | PASS
> > >   Fedora 38 (ARM Clang)    | PASS
> > >   Ubuntu 20.04 (ARM)       | PASS
> > >
> > > I need to send a v4 with new implementation and better comments.
> > > The Unix sleep will be upgraded from 1 ns to 1 us in case it makes a
> > > difference.
> > 
> > It will not make a difference. The kernel will go through the sleeping steps,
> > then wake up again and see the real-time thread is ready to run, and then
> > immediately schedule it.
> > 
> > For testing purposes, consider sleeping 10 milliseconds or something
> > significant like that.
> 
> A bit more details...
> 
> In our recent tests, nanosleep() itself took around 50 us. So you need to sleep longer than that for your thread not to be runnable when the nanosleep() wakes up again, because 50 us has already passed in "nanosleep overhead".
> 10 milliseconds provides plenty of margin, and corresponds to 10 jiffies on a 1000 Hz kernel. (I don't know if it makes any difference for the kernel scheduler if the timer crosses a jiffy border or not.)

10 ms looks like an eternity.
I will try.
(Anyway I did a mistake when sending v4)
  
Morten Brørup Oct. 26, 2023, 2:08 p.m. UTC | #14
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, 26 October 2023 16.05
> 
> 26/10/2023 15:57, Morten Brørup:
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Thursday, 26 October 2023 15.45
> > >
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > Sent: Thursday, 26 October 2023 15.37
> > > >
> > > > 25/10/2023 18:31, Thomas Monjalon:
> > > > > Real-time thread priority was been forbidden on Unix
> > > > > because of problems they can cause.
> > > > > Warnings and helpers are added to avoid deadlocks,
> > > > > so real-time can be allowed on all systems.
> > > >
> > > > Unit test is failing:
> > > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> > > >
> > > > It is seen in only 1 target (maybe the failure occurence is random):
> > > >   Debian 11 (Buster) (ARM) | PASS
> > > >   Fedora 37 (ARM)          | PASS
> > > >   CentOS Stream 9 (ARM)    | FAIL
> > > >   Fedora 38 (ARM)          | PASS
> > > >   Fedora 38 (ARM Clang)    | PASS
> > > >   Ubuntu 20.04 (ARM)       | PASS
> > > >
> > > > I need to send a v4 with new implementation and better comments.
> > > > The Unix sleep will be upgraded from 1 ns to 1 us in case it makes a
> > > > difference.
> > >
> > > It will not make a difference. The kernel will go through the sleeping
> steps,
> > > then wake up again and see the real-time thread is ready to run, and then
> > > immediately schedule it.
> > >
> > > For testing purposes, consider sleeping 10 milliseconds or something
> > > significant like that.
> >
> > A bit more details...
> >
> > In our recent tests, nanosleep() itself took around 50 us. So you need to
> sleep longer than that for your thread not to be runnable when the nanosleep()
> wakes up again, because 50 us has already passed in "nanosleep overhead".
> > 10 milliseconds provides plenty of margin, and corresponds to 10 jiffies on
> a 1000 Hz kernel. (I don't know if it makes any difference for the kernel
> scheduler if the timer crosses a jiffy border or not.)
> 
> 10 ms looks like an eternity.

Agree. It is only for functional testing, not for production!

> I will try.
> (Anyway I did a mistake when sending v4)
> 
>
  
David Marchand Oct. 26, 2023, 2:10 p.m. UTC | #15
On Thu, Oct 26, 2023 at 3:57 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Thursday, 26 October 2023 15.45
> >
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Thursday, 26 October 2023 15.37
> > >
> > > 25/10/2023 18:31, Thomas Monjalon:
> > > > Real-time thread priority was been forbidden on Unix
> > > > because of problems they can cause.
> > > > Warnings and helpers are added to avoid deadlocks,
> > > > so real-time can be allowed on all systems.
> > >
> > > Unit test is failing:
> > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> > >
> > > It is seen in only 1 target (maybe the failure occurence is random):
> > >   Debian 11 (Buster) (ARM) | PASS
> > >   Fedora 37 (ARM)          | PASS
> > >   CentOS Stream 9 (ARM)    | FAIL
> > >   Fedora 38 (ARM)          | PASS
> > >   Fedora 38 (ARM Clang)    | PASS
> > >   Ubuntu 20.04 (ARM)       | PASS
> > >
> > > I need to send a v4 with new implementation and better comments.
> > > The Unix sleep will be upgraded from 1 ns to 1 us in case it makes a
> > > difference.
> >
> > It will not make a difference. The kernel will go through the sleeping steps,
> > then wake up again and see the real-time thread is ready to run, and then
> > immediately schedule it.
> >
> > For testing purposes, consider sleeping 10 milliseconds or something
> > significant like that.
>
> A bit more details...
>
> In our recent tests, nanosleep() itself took around 50 us. So you need to sleep longer than that for your thread not to be runnable when the nanosleep() wakes up again, because 50 us has already passed in "nanosleep overhead".

You may want to read manual for prctl and look for PR_SET_TIMERSLACK.
https://github.com/openvswitch/ovs/commit/f62629a55894546ff043e8a116c3c57aff73c285

> 10 milliseconds provides plenty of margin, and corresponds to 10 jiffies on a 1000 Hz kernel. (I don't know if it makes any difference for the kernel scheduler if the timer crosses a jiffy border or not.)
  
Thomas Monjalon Oct. 26, 2023, 2:30 p.m. UTC | #16
26/10/2023 16:08, Morten Brørup:
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Thursday, 26 October 2023 16.05
> > 
> > 26/10/2023 15:57, Morten Brørup:
> > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > Sent: Thursday, 26 October 2023 15.45
> > > >
> > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > Sent: Thursday, 26 October 2023 15.37
> > > > >
> > > > > 25/10/2023 18:31, Thomas Monjalon:
> > > > > > Real-time thread priority was been forbidden on Unix
> > > > > > because of problems they can cause.
> > > > > > Warnings and helpers are added to avoid deadlocks,
> > > > > > so real-time can be allowed on all systems.
> > > > >
> > > > > Unit test is failing:
> > > > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> > > > >
> > > > > It is seen in only 1 target (maybe the failure occurence is random):
> > > > >   Debian 11 (Buster) (ARM) | PASS
> > > > >   Fedora 37 (ARM)          | PASS
> > > > >   CentOS Stream 9 (ARM)    | FAIL
> > > > >   Fedora 38 (ARM)          | PASS
> > > > >   Fedora 38 (ARM Clang)    | PASS
> > > > >   Ubuntu 20.04 (ARM)       | PASS
> > > > >
> > > > > I need to send a v4 with new implementation and better comments.
> > > > > The Unix sleep will be upgraded from 1 ns to 1 us in case it makes a
> > > > > difference.
> > > >
> > > > It will not make a difference. The kernel will go through the sleeping
> > steps,
> > > > then wake up again and see the real-time thread is ready to run, and then
> > > > immediately schedule it.
> > > >
> > > > For testing purposes, consider sleeping 10 milliseconds or something
> > > > significant like that.
> > >
> > > A bit more details...
> > >
> > > In our recent tests, nanosleep() itself took around 50 us. So you need to
> > sleep longer than that for your thread not to be runnable when the nanosleep()
> > wakes up again, because 50 us has already passed in "nanosleep overhead".
> > > 10 milliseconds provides plenty of margin, and corresponds to 10 jiffies on
> > a 1000 Hz kernel. (I don't know if it makes any difference for the kernel
> > scheduler if the timer crosses a jiffy border or not.)
> > 
> > 10 ms looks like an eternity.
> 
> Agree. It is only for functional testing, not for production!

Realtime thread won't make any sense if we have to insert a long sleep.
We need to find a good sleep value or give up with real-time threads.
(note I'm not sure how much it is useful)

> > I will try.
> > (Anyway I did a mistake when sending v4)

I've sent a trial with 1 ms.
  
Morten Brørup Oct. 26, 2023, 2:50 p.m. UTC | #17
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, 26 October 2023 16.31
> 
> 26/10/2023 16:08, Morten Brørup:
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Thursday, 26 October 2023 16.05
> > >
> > > 26/10/2023 15:57, Morten Brørup:
> > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > Sent: Thursday, 26 October 2023 15.45
> > > > >
> > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > Sent: Thursday, 26 October 2023 15.37
> > > > > >
> > > > > > 25/10/2023 18:31, Thomas Monjalon:
> > > > > > > Real-time thread priority was been forbidden on Unix
> > > > > > > because of problems they can cause.
> > > > > > > Warnings and helpers are added to avoid deadlocks,
> > > > > > > so real-time can be allowed on all systems.
> > > > > >
> > > > > > Unit test is failing:
> > > > > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> > > > > >
> > > > > > It is seen in only 1 target (maybe the failure occurence is random):
> > > > > >   Debian 11 (Buster) (ARM) | PASS
> > > > > >   Fedora 37 (ARM)          | PASS
> > > > > >   CentOS Stream 9 (ARM)    | FAIL
> > > > > >   Fedora 38 (ARM)          | PASS
> > > > > >   Fedora 38 (ARM Clang)    | PASS
> > > > > >   Ubuntu 20.04 (ARM)       | PASS
> > > > > >
> > > > > > I need to send a v4 with new implementation and better comments.
> > > > > > The Unix sleep will be upgraded from 1 ns to 1 us in case it makes a
> > > > > > difference.
> > > > >
> > > > > It will not make a difference. The kernel will go through the sleeping
> > > steps,
> > > > > then wake up again and see the real-time thread is ready to run, and
> then
> > > > > immediately schedule it.
> > > > >
> > > > > For testing purposes, consider sleeping 10 milliseconds or something
> > > > > significant like that.
> > > >
> > > > A bit more details...
> > > >
> > > > In our recent tests, nanosleep() itself took around 50 us. So you need
> to
> > > sleep longer than that for your thread not to be runnable when the
> nanosleep()
> > > wakes up again, because 50 us has already passed in "nanosleep overhead".
> > > > 10 milliseconds provides plenty of margin, and corresponds to 10 jiffies
> on
> > > a 1000 Hz kernel. (I don't know if it makes any difference for the kernel
> > > scheduler if the timer crosses a jiffy border or not.)
> > >
> > > 10 ms looks like an eternity.
> >
> > Agree. It is only for functional testing, not for production!
> 
> Realtime thread won't make any sense if we have to insert a long sleep.

It seems David came to our rescue here!

I have just tried running our test again with prctl(PR_SET_TIMERSLACK) of 1 ns, and the nanosleep(1 ns) delay dropped from ca. 50 us to ca. 2.5 us.

The timeout parameter to epoll_wait() is in milliseconds, which is useless for low-latency.
Perhaps real-time threads can be used with epoll() combined with timerfd for nanosecond resolution timeout.

> We need to find a good sleep value or give up with real-time threads.
> (note I'm not sure how much it is useful)

Only the application developer knows how much delay is acceptable. Which is why I mentioned that the new yield functions should document the delay.

> 
> > > I will try.
> > > (Anyway I did a mistake when sending v4)
> 
> I've sent a trial with 1 ms.
>
  
Morten Brørup Oct. 26, 2023, 2:59 p.m. UTC | #18
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, 26 October 2023 16.50
> 
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Thursday, 26 October 2023 16.31
> >
> > 26/10/2023 16:08, Morten Brørup:
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > Sent: Thursday, 26 October 2023 16.05
> > > >
> > > > 26/10/2023 15:57, Morten Brørup:
> > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > Sent: Thursday, 26 October 2023 15.45
> > > > > >
> > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > Sent: Thursday, 26 October 2023 15.37
> > > > > > >
> > > > > > > 25/10/2023 18:31, Thomas Monjalon:
> > > > > > > > Real-time thread priority was been forbidden on Unix
> > > > > > > > because of problems they can cause.
> > > > > > > > Warnings and helpers are added to avoid deadlocks,
> > > > > > > > so real-time can be allowed on all systems.
> > > > > > >
> > > > > > > Unit test is failing:
> > > > > > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> > > > > > >
> > > > > > > It is seen in only 1 target (maybe the failure occurence is
> random):
> > > > > > >   Debian 11 (Buster) (ARM) | PASS
> > > > > > >   Fedora 37 (ARM)          | PASS
> > > > > > >   CentOS Stream 9 (ARM)    | FAIL
> > > > > > >   Fedora 38 (ARM)          | PASS
> > > > > > >   Fedora 38 (ARM Clang)    | PASS
> > > > > > >   Ubuntu 20.04 (ARM)       | PASS
> > > > > > >
> > > > > > > I need to send a v4 with new implementation and better comments.
> > > > > > > The Unix sleep will be upgraded from 1 ns to 1 us in case it makes
> a
> > > > > > > difference.
> > > > > >
> > > > > > It will not make a difference. The kernel will go through the
> sleeping
> > > > steps,
> > > > > > then wake up again and see the real-time thread is ready to run, and
> > then
> > > > > > immediately schedule it.
> > > > > >
> > > > > > For testing purposes, consider sleeping 10 milliseconds or something
> > > > > > significant like that.
> > > > >
> > > > > A bit more details...
> > > > >
> > > > > In our recent tests, nanosleep() itself took around 50 us. So you need
> > to
> > > > sleep longer than that for your thread not to be runnable when the
> > nanosleep()
> > > > wakes up again, because 50 us has already passed in "nanosleep
> overhead".
> > > > > 10 milliseconds provides plenty of margin, and corresponds to 10
> jiffies
> > on
> > > > a 1000 Hz kernel. (I don't know if it makes any difference for the
> kernel
> > > > scheduler if the timer crosses a jiffy border or not.)
> > > >
> > > > 10 ms looks like an eternity.
> > >
> > > Agree. It is only for functional testing, not for production!
> >
> > Realtime thread won't make any sense if we have to insert a long sleep.
> 
> It seems David came to our rescue here!
> 
> I have just tried running our test again with prctl(PR_SET_TIMERSLACK) of 1
> ns, and the nanosleep(1 ns) delay dropped from ca. 50 us to ca. 2.5 us.
> 
> The timeout parameter to epoll_wait() is in milliseconds, which is useless for
> low-latency.
> Perhaps real-time threads can be used with epoll() combined with timerfd for
> nanosecond resolution timeout.

Or epoll_pwait2(), which has nanosecond resolution timeout.

Unfortunately, rte_epoll_wait() is not an experimental API anymore, so we cannot change its timeout parameter from milliseconds to micro- or nanoseconds. We would have to introduce a new API for this.

> 
> > We need to find a good sleep value or give up with real-time threads.
> > (note I'm not sure how much it is useful)
> 
> Only the application developer knows how much delay is acceptable. Which is
> why I mentioned that the new yield functions should document the delay.
> 
> >
> > > > I will try.
> > > > (Anyway I did a mistake when sending v4)
> >
> > I've sent a trial with 1 ms.
> >
  
Bruce Richardson Oct. 26, 2023, 3:54 p.m. UTC | #19
On Thu, Oct 26, 2023 at 04:59:51PM +0200, Morten Brørup wrote:
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Thursday, 26 October 2023 16.50
> > 
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Thursday, 26 October 2023 16.31
> > >
> > > 26/10/2023 16:08, Morten Brørup:
> > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > Sent: Thursday, 26 October 2023 16.05
> > > > >
> > > > > 26/10/2023 15:57, Morten Brørup:
> > > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > > Sent: Thursday, 26 October 2023 15.45
> > > > > > >
> > > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > Sent: Thursday, 26 October 2023 15.37
> > > > > > > >
> > > > > > > > 25/10/2023 18:31, Thomas Monjalon:
> > > > > > > > > Real-time thread priority was been forbidden on Unix
> > > > > > > > > because of problems they can cause.
> > > > > > > > > Warnings and helpers are added to avoid deadlocks,
> > > > > > > > > so real-time can be allowed on all systems.
> > > > > > > >
> > > > > > > > Unit test is failing:
> > > > > > > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> > > > > > > >
> > > > > > > > It is seen in only 1 target (maybe the failure occurence is
> > random):
> > > > > > > >   Debian 11 (Buster) (ARM) | PASS
> > > > > > > >   Fedora 37 (ARM)          | PASS
> > > > > > > >   CentOS Stream 9 (ARM)    | FAIL
> > > > > > > >   Fedora 38 (ARM)          | PASS
> > > > > > > >   Fedora 38 (ARM Clang)    | PASS
> > > > > > > >   Ubuntu 20.04 (ARM)       | PASS
> > > > > > > >
> > > > > > > > I need to send a v4 with new implementation and better comments.
> > > > > > > > The Unix sleep will be upgraded from 1 ns to 1 us in case it makes
> > a
> > > > > > > > difference.
> > > > > > >
> > > > > > > It will not make a difference. The kernel will go through the
> > sleeping
> > > > > steps,
> > > > > > > then wake up again and see the real-time thread is ready to run, and
> > > then
> > > > > > > immediately schedule it.
> > > > > > >
> > > > > > > For testing purposes, consider sleeping 10 milliseconds or something
> > > > > > > significant like that.
> > > > > >
> > > > > > A bit more details...
> > > > > >
> > > > > > In our recent tests, nanosleep() itself took around 50 us. So you need
> > > to
> > > > > sleep longer than that for your thread not to be runnable when the
> > > nanosleep()
> > > > > wakes up again, because 50 us has already passed in "nanosleep
> > overhead".
> > > > > > 10 milliseconds provides plenty of margin, and corresponds to 10
> > jiffies
> > > on
> > > > > a 1000 Hz kernel. (I don't know if it makes any difference for the
> > kernel
> > > > > scheduler if the timer crosses a jiffy border or not.)
> > > > >
> > > > > 10 ms looks like an eternity.
> > > >
> > > > Agree. It is only for functional testing, not for production!
> > >
> > > Realtime thread won't make any sense if we have to insert a long sleep.
> > 
> > It seems David came to our rescue here!
> > 
> > I have just tried running our test again with prctl(PR_SET_TIMERSLACK) of 1
> > ns, and the nanosleep(1 ns) delay dropped from ca. 50 us to ca. 2.5 us.
> > 
> > The timeout parameter to epoll_wait() is in milliseconds, which is useless for
> > low-latency.
> > Perhaps real-time threads can be used with epoll() combined with timerfd for
> > nanosecond resolution timeout.
> 
> Or epoll_pwait2(), which has nanosecond resolution timeout.
> 
> Unfortunately, rte_epoll_wait() is not an experimental API anymore, so we cannot change its timeout parameter from milliseconds to micro- or nanoseconds. We would have to introduce a new API for this.
> 

Just an idea - can we change the timeout parameter to float rather than int,
and then use function versioning for backward compatibility for any
binaries passing int?
That way the actual meaning of the parameter doesn't change, but it still
allows sub-millisecond values (all-be-it with some loss of accuracy due to
float).

/Bruce
  
Thomas Monjalon Oct. 26, 2023, 4:07 p.m. UTC | #20
26/10/2023 17:54, Bruce Richardson:
> On Thu, Oct 26, 2023 at 04:59:51PM +0200, Morten Brørup wrote:
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Thursday, 26 October 2023 16.50
> > > 
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > Sent: Thursday, 26 October 2023 16.31
> > > >
> > > > 26/10/2023 16:08, Morten Brørup:
> > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > Sent: Thursday, 26 October 2023 16.05
> > > > > >
> > > > > > 26/10/2023 15:57, Morten Brørup:
> > > > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > > > Sent: Thursday, 26 October 2023 15.45
> > > > > > > >
> > > > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > > Sent: Thursday, 26 October 2023 15.37
> > > > > > > > >
> > > > > > > > > 25/10/2023 18:31, Thomas Monjalon:
> > > > > > > > > > Real-time thread priority was been forbidden on Unix
> > > > > > > > > > because of problems they can cause.
> > > > > > > > > > Warnings and helpers are added to avoid deadlocks,
> > > > > > > > > > so real-time can be allowed on all systems.
> > > > > > > > >
> > > > > > > > > Unit test is failing:
> > > > > > > > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01 s
> > > > > > > > >
> > > > > > > > > It is seen in only 1 target (maybe the failure occurence is
> > > random):
> > > > > > > > >   Debian 11 (Buster) (ARM) | PASS
> > > > > > > > >   Fedora 37 (ARM)          | PASS
> > > > > > > > >   CentOS Stream 9 (ARM)    | FAIL
> > > > > > > > >   Fedora 38 (ARM)          | PASS
> > > > > > > > >   Fedora 38 (ARM Clang)    | PASS
> > > > > > > > >   Ubuntu 20.04 (ARM)       | PASS
> > > > > > > > >
> > > > > > > > > I need to send a v4 with new implementation and better comments.
> > > > > > > > > The Unix sleep will be upgraded from 1 ns to 1 us in case it makes
> > > a
> > > > > > > > > difference.
> > > > > > > >
> > > > > > > > It will not make a difference. The kernel will go through the
> > > sleeping
> > > > > > steps,
> > > > > > > > then wake up again and see the real-time thread is ready to run, and
> > > > then
> > > > > > > > immediately schedule it.
> > > > > > > >
> > > > > > > > For testing purposes, consider sleeping 10 milliseconds or something
> > > > > > > > significant like that.
> > > > > > >
> > > > > > > A bit more details...
> > > > > > >
> > > > > > > In our recent tests, nanosleep() itself took around 50 us. So you need
> > > > to
> > > > > > sleep longer than that for your thread not to be runnable when the
> > > > nanosleep()
> > > > > > wakes up again, because 50 us has already passed in "nanosleep
> > > overhead".
> > > > > > > 10 milliseconds provides plenty of margin, and corresponds to 10
> > > jiffies
> > > > on
> > > > > > a 1000 Hz kernel. (I don't know if it makes any difference for the
> > > kernel
> > > > > > scheduler if the timer crosses a jiffy border or not.)
> > > > > >
> > > > > > 10 ms looks like an eternity.
> > > > >
> > > > > Agree. It is only for functional testing, not for production!
> > > >
> > > > Realtime thread won't make any sense if we have to insert a long sleep.
> > > 
> > > It seems David came to our rescue here!
> > > 
> > > I have just tried running our test again with prctl(PR_SET_TIMERSLACK) of 1
> > > ns, and the nanosleep(1 ns) delay dropped from ca. 50 us to ca. 2.5 us.
> > > 
> > > The timeout parameter to epoll_wait() is in milliseconds, which is useless for
> > > low-latency.
> > > Perhaps real-time threads can be used with epoll() combined with timerfd for
> > > nanosecond resolution timeout.
> > 
> > Or epoll_pwait2(), which has nanosecond resolution timeout.
> > 
> > Unfortunately, rte_epoll_wait() is not an experimental API anymore, so we cannot change its timeout parameter from milliseconds to micro- or nanoseconds. We would have to introduce a new API for this.
> > 
> 
> Just an idea - can we change the timeout parameter to float rather than int,
> and then use function versioning for backward compatibility for any
> binaries passing int?
> That way the actual meaning of the parameter doesn't change, but it still
> allows sub-millisecond values (all-be-it with some loss of accuracy due to
> float).

Sorry I'm not following why you want to use rte_epoll_wait()?

If the realtime thread has some blocking system calls,
no sleep is needed I think.
For average realtime thread, I suggest the API rte_thread_yield_realtime()
which could wait for 1 ms or less by default.
For smaller sleep, we can use PR_SET_TIMERSLACK and rte_delay_us_sleep().
If we provide an API for PR_SET_TIMERSLACK, we could adapt the duration
of rte_thread_yield_realtime() dynamically after calling prctl().
  
Stephen Hemminger Oct. 26, 2023, 4:28 p.m. UTC | #21
On Thu, 26 Oct 2023 16:08:02 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > > In our recent tests, nanosleep() itself took around 50 us. So you need to  
> > sleep longer than that for your thread not to be runnable when the nanosleep()
> > wakes up again, because 50 us has already passed in "nanosleep overhead".  
> > > 10 milliseconds provides plenty of margin, and corresponds to 10 jiffies on  
> > a 1000 Hz kernel. (I don't know if it makes any difference for the kernel
> > scheduler if the timer crosses a jiffy border or not.)
> > 
> > 10 ms looks like an eternity.  
> 
> Agree. It is only for functional testing, not for production!

To be safe the sleep has to be longer than the system clock tick.
Most systems are built today with HZ=250 but really should be using HZ=1000
on modern CPU's.
  
Morten Brørup Oct. 26, 2023, 4:50 p.m. UTC | #22
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, 26 October 2023 18.07
> 
> 26/10/2023 17:54, Bruce Richardson:
> > On Thu, Oct 26, 2023 at 04:59:51PM +0200, Morten Brørup wrote:
> > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > Sent: Thursday, 26 October 2023 16.50
> > > >
> > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > Sent: Thursday, 26 October 2023 16.31
> > > > >
> > > > > 26/10/2023 16:08, Morten Brørup:
> > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > Sent: Thursday, 26 October 2023 16.05
> > > > > > >
> > > > > > > 26/10/2023 15:57, Morten Brørup:
> > > > > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > > > > Sent: Thursday, 26 October 2023 15.45
> > > > > > > > >
> > > > > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > > > Sent: Thursday, 26 October 2023 15.37
> > > > > > > > > >
> > > > > > > > > > 25/10/2023 18:31, Thomas Monjalon:
> > > > > > > > > > > Real-time thread priority was been forbidden on Unix
> > > > > > > > > > > because of problems they can cause.
> > > > > > > > > > > Warnings and helpers are added to avoid deadlocks,
> > > > > > > > > > > so real-time can be allowed on all systems.
> > > > > > > > > >
> > > > > > > > > > Unit test is failing:
> > > > > > > > > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01
> s
> > > > > > > > > >
> > > > > > > > > > It is seen in only 1 target (maybe the failure
> occurence is
> > > > random):
> > > > > > > > > >   Debian 11 (Buster) (ARM) | PASS
> > > > > > > > > >   Fedora 37 (ARM)          | PASS
> > > > > > > > > >   CentOS Stream 9 (ARM)    | FAIL
> > > > > > > > > >   Fedora 38 (ARM)          | PASS
> > > > > > > > > >   Fedora 38 (ARM Clang)    | PASS
> > > > > > > > > >   Ubuntu 20.04 (ARM)       | PASS
> > > > > > > > > >
> > > > > > > > > > I need to send a v4 with new implementation and better
> comments.
> > > > > > > > > > The Unix sleep will be upgraded from 1 ns to 1 us in
> case it makes
> > > > a
> > > > > > > > > > difference.
> > > > > > > > >
> > > > > > > > > It will not make a difference. The kernel will go
> through the
> > > > sleeping
> > > > > > > steps,
> > > > > > > > > then wake up again and see the real-time thread is ready
> to run, and
> > > > > then
> > > > > > > > > immediately schedule it.
> > > > > > > > >
> > > > > > > > > For testing purposes, consider sleeping 10 milliseconds
> or something
> > > > > > > > > significant like that.
> > > > > > > >
> > > > > > > > A bit more details...
> > > > > > > >
> > > > > > > > In our recent tests, nanosleep() itself took around 50 us.
> So you need
> > > > > to
> > > > > > > sleep longer than that for your thread not to be runnable
> when the
> > > > > nanosleep()
> > > > > > > wakes up again, because 50 us has already passed in
> "nanosleep
> > > > overhead".
> > > > > > > > 10 milliseconds provides plenty of margin, and corresponds
> to 10
> > > > jiffies
> > > > > on
> > > > > > > a 1000 Hz kernel. (I don't know if it makes any difference
> for the
> > > > kernel
> > > > > > > scheduler if the timer crosses a jiffy border or not.)
> > > > > > >
> > > > > > > 10 ms looks like an eternity.
> > > > > >
> > > > > > Agree. It is only for functional testing, not for production!
> > > > >
> > > > > Realtime thread won't make any sense if we have to insert a long
> sleep.
> > > >
> > > > It seems David came to our rescue here!
> > > >
> > > > I have just tried running our test again with
> prctl(PR_SET_TIMERSLACK) of 1
> > > > ns, and the nanosleep(1 ns) delay dropped from ca. 50 us to ca.
> 2.5 us.
> > > >
> > > > The timeout parameter to epoll_wait() is in milliseconds, which is
> useless for
> > > > low-latency.
> > > > Perhaps real-time threads can be used with epoll() combined with
> timerfd for
> > > > nanosecond resolution timeout.
> > >
> > > Or epoll_pwait2(), which has nanosecond resolution timeout.
> > >
> > > Unfortunately, rte_epoll_wait() is not an experimental API anymore,
> so we cannot change its timeout parameter from milliseconds to micro- or
> nanoseconds. We would have to introduce a new API for this.
> > >
> >
> > Just an idea - can we change the timeout parameter to float rather
> than int,
> > and then use function versioning for backward compatibility for any
> > binaries passing int?
> > That way the actual meaning of the parameter doesn't change, but it
> still
> > allows sub-millisecond values (all-be-it with some loss of accuracy
> due to
> > float).

Too exotic for my taste. I would rather introduce rte_epoll_wait_ns() with timeout in nanoseconds than pass a float.

> 
> Sorry I'm not following why you want to use rte_epoll_wait()?

I don't have experience with it yet, but it seems to be the official DPDK API for blocking I/O system call.

> 
> If the realtime thread has some blocking system calls,
> no sleep is needed I think.

Correct.

> For average realtime thread, I suggest the API
> rte_thread_yield_realtime()
> which could wait for 1 ms or less by default.

If we introduce a yield() function, it should yield according to the O/S scheduling policy, e.g. the rest of the time slice allocated to the thread by the O/S scheduler (although that might not be available for real-time prioritized threads in Linux). I don't think we can make it O/S agnostic.

I don't think it should wait a fixed amount of time - we already have rte_delay_us_sleep() for that.

In my experiments with power saving, I ended up with a varying sleep duration, depending on traffic load. The l3fwd-power example also uses a varying sleep duration depending on traffic load.

> For smaller sleep, we can use PR_SET_TIMERSLACK and
> rte_delay_us_sleep().

Agree.

> If we provide an API for PR_SET_TIMERSLACK, we could adapt the duration
> of rte_thread_yield_realtime() dynamically after calling prctl().
> 

I'm not sure exposing an API for PR_SET_TIMERSLACK is the right solution.

I would rather have the EAL set the timer slack to minimum (1 ns) at EAL initialization. An EAL command line parameter could be added to change the default from 1 ns.

Also, something similar needs to be done for Windows.
  
Thomas Monjalon Oct. 26, 2023, 7:51 p.m. UTC | #23
26/10/2023 18:50, Morten Brørup:
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Thursday, 26 October 2023 18.07
> > 
> > 26/10/2023 17:54, Bruce Richardson:
> > > On Thu, Oct 26, 2023 at 04:59:51PM +0200, Morten Brørup wrote:
> > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > Sent: Thursday, 26 October 2023 16.50
> > > > >
> > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > Sent: Thursday, 26 October 2023 16.31
> > > > > >
> > > > > > 26/10/2023 16:08, Morten Brørup:
> > > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > Sent: Thursday, 26 October 2023 16.05
> > > > > > > >
> > > > > > > > 26/10/2023 15:57, Morten Brørup:
> > > > > > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > > > > > Sent: Thursday, 26 October 2023 15.45
> > > > > > > > > >
> > > > > > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > > > > Sent: Thursday, 26 October 2023 15.37
> > > > > > > > > > >
> > > > > > > > > > > 25/10/2023 18:31, Thomas Monjalon:
> > > > > > > > > > > > Real-time thread priority was been forbidden on Unix
> > > > > > > > > > > > because of problems they can cause.
> > > > > > > > > > > > Warnings and helpers are added to avoid deadlocks,
> > > > > > > > > > > > so real-time can be allowed on all systems.
> > > > > > > > > > >
> > > > > > > > > > > Unit test is failing:
> > > > > > > > > > > DPDK:fast-tests / threads_autotest      TIMEOUT 600.01
> > s
> > > > > > > > > > >
> > > > > > > > > > > It is seen in only 1 target (maybe the failure
> > occurence is
> > > > > random):
> > > > > > > > > > >   Debian 11 (Buster) (ARM) | PASS
> > > > > > > > > > >   Fedora 37 (ARM)          | PASS
> > > > > > > > > > >   CentOS Stream 9 (ARM)    | FAIL
> > > > > > > > > > >   Fedora 38 (ARM)          | PASS
> > > > > > > > > > >   Fedora 38 (ARM Clang)    | PASS
> > > > > > > > > > >   Ubuntu 20.04 (ARM)       | PASS
> > > > > > > > > > >
> > > > > > > > > > > I need to send a v4 with new implementation and better
> > comments.
> > > > > > > > > > > The Unix sleep will be upgraded from 1 ns to 1 us in
> > case it makes
> > > > > a
> > > > > > > > > > > difference.
> > > > > > > > > >
> > > > > > > > > > It will not make a difference. The kernel will go
> > through the
> > > > > sleeping
> > > > > > > > steps,
> > > > > > > > > > then wake up again and see the real-time thread is ready
> > to run, and
> > > > > > then
> > > > > > > > > > immediately schedule it.
> > > > > > > > > >
> > > > > > > > > > For testing purposes, consider sleeping 10 milliseconds
> > or something
> > > > > > > > > > significant like that.
> > > > > > > > >
> > > > > > > > > A bit more details...
> > > > > > > > >
> > > > > > > > > In our recent tests, nanosleep() itself took around 50 us.
> > So you need
> > > > > > to
> > > > > > > > sleep longer than that for your thread not to be runnable
> > when the
> > > > > > nanosleep()
> > > > > > > > wakes up again, because 50 us has already passed in
> > "nanosleep
> > > > > overhead".
> > > > > > > > > 10 milliseconds provides plenty of margin, and corresponds
> > to 10
> > > > > jiffies
> > > > > > on
> > > > > > > > a 1000 Hz kernel. (I don't know if it makes any difference
> > for the
> > > > > kernel
> > > > > > > > scheduler if the timer crosses a jiffy border or not.)
> > > > > > > >
> > > > > > > > 10 ms looks like an eternity.
> > > > > > >
> > > > > > > Agree. It is only for functional testing, not for production!
> > > > > >
> > > > > > Realtime thread won't make any sense if we have to insert a long
> > sleep.
> > > > >
> > > > > It seems David came to our rescue here!
> > > > >
> > > > > I have just tried running our test again with
> > prctl(PR_SET_TIMERSLACK) of 1
> > > > > ns, and the nanosleep(1 ns) delay dropped from ca. 50 us to ca.
> > 2.5 us.
> > > > >
> > > > > The timeout parameter to epoll_wait() is in milliseconds, which is
> > useless for
> > > > > low-latency.
> > > > > Perhaps real-time threads can be used with epoll() combined with
> > timerfd for
> > > > > nanosecond resolution timeout.
> > > >
> > > > Or epoll_pwait2(), which has nanosecond resolution timeout.
> > > >
> > > > Unfortunately, rte_epoll_wait() is not an experimental API anymore,
> > so we cannot change its timeout parameter from milliseconds to micro- or
> > nanoseconds. We would have to introduce a new API for this.
> > > >
> > >
> > > Just an idea - can we change the timeout parameter to float rather
> > than int,
> > > and then use function versioning for backward compatibility for any
> > > binaries passing int?
> > > That way the actual meaning of the parameter doesn't change, but it
> > still
> > > allows sub-millisecond values (all-be-it with some loss of accuracy
> > due to
> > > float).
> 
> Too exotic for my taste. I would rather introduce rte_epoll_wait_ns() with timeout in nanoseconds than pass a float.
> 
> > 
> > Sorry I'm not following why you want to use rte_epoll_wait()?
> 
> I don't have experience with it yet, but it seems to be the official DPDK API for blocking I/O system call.
> 
> > 
> > If the realtime thread has some blocking system calls,
> > no sleep is needed I think.
> 
> Correct.
> 
> > For average realtime thread, I suggest the API
> > rte_thread_yield_realtime()
> > which could wait for 1 ms or less by default.
> 
> If we introduce a yield() function, it should yield according to the O/S scheduling policy, e.g. the rest of the time slice allocated to the thread by the O/S scheduler (although that might not be available for real-time prioritized threads in Linux). I don't think we can make it O/S agnostic.
> 
> I don't think it should wait a fixed amount of time - we already have rte_delay_us_sleep() for that.
> 
> In my experiments with power saving, I ended up with a varying sleep duration, depending on traffic load. The l3fwd-power example also uses a varying sleep duration depending on traffic load.

I feel you lost the goal of this: schedule other threads (especially kernel threads).
So we don't care about the sleep duration at all,
except we want it minimal while allowing scheduling.

> > For smaller sleep, we can use PR_SET_TIMERSLACK and
> > rte_delay_us_sleep().
> 
> Agree.
> 
> > If we provide an API for PR_SET_TIMERSLACK, we could adapt the duration
> > of rte_thread_yield_realtime() dynamically after calling prctl().
> > 
> 
> I'm not sure exposing an API for PR_SET_TIMERSLACK is the right solution.
> 
> I would rather have the EAL set the timer slack to minimum (1 ns) at EAL initialization. An EAL command line parameter could be added to change the default from 1 ns.
> 
> Also, something similar needs to be done for Windows.

Windows should be fine with Sleep(0).
  
Thomas Monjalon Oct. 26, 2023, 7:55 p.m. UTC | #24
26/10/2023 18:28, Stephen Hemminger:
> On Thu, 26 Oct 2023 16:08:02 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > > In our recent tests, nanosleep() itself took around 50 us. So you need to  
> > > sleep longer than that for your thread not to be runnable when the nanosleep()
> > > wakes up again, because 50 us has already passed in "nanosleep overhead".  
> > > > 10 milliseconds provides plenty of margin, and corresponds to 10 jiffies on  
> > > a 1000 Hz kernel. (I don't know if it makes any difference for the kernel
> > > scheduler if the timer crosses a jiffy border or not.)
> > > 
> > > 10 ms looks like an eternity.  
> > 
> > Agree. It is only for functional testing, not for production!
> 
> To be safe the sleep has to be longer than the system clock tick.
> Most systems are built today with HZ=250 but really should be using HZ=1000
> on modern CPU's.

If it has to be more than 1 ms,
we should mention it is a slow call
which may be skipped if the thread is already blocking on something else.
  
Thomas Monjalon Oct. 26, 2023, 8 p.m. UTC | #25
26/10/2023 16:19, Thomas Monjalon:
> Real-time thread priority was been forbidden on Unix
> because of problems they can cause.
> Warnings and helpers are added to avoid deadlocks,
> so real-time can be allowed on all systems.
> 
> Thomas Monjalon (2):
>   eal: add thread yield functions
>   eal/unix: allow creating thread with real-time priority
> 
> v1: no yield at all
> v2: more comments, sched_yield() and Sleep(0) on Windows
> v3: 2 yield functions with sleep in realtime version
> v4: runtime warning, longer sleep on Unix and lighter yield on Windows
> v5: fix build and increase Unix sleep to 1 ms
> 
> Thomas Monjalon (2):
>   eal: add thread yield functions
>   eal/unix: allow creating thread with real-time priority

Now there is a test failing on Windows:
http://mails.dpdk.org/archives/test-report/2023-October/491475.html
  
Stephen Hemminger Oct. 26, 2023, 9:10 p.m. UTC | #26
On Thu, 26 Oct 2023 21:55:24 +0200
Thomas Monjalon <thomas@monjalon.net> wrote:

> > To be safe the sleep has to be longer than the system clock tick.
> > Most systems are built today with HZ=250 but really should be using HZ=1000
> > on modern CPU's.  
> 
> If it has to be more than 1 ms,
> we should mention it is a slow call
> which may be skipped if the thread is already blocking on something else.

A thread that is real time is not subject to timer slack.
Therefore even 100ns would probably work fine.


https://man7.org/linux/man-pages/man7/time.7.html

  High-resolution timers
       Before Linux 2.6.21, the accuracy of timer and sleep system calls
       (see below) was also limited by the size of the jiffy.

       Since Linux 2.6.21, Linux supports high-resolution timers (HRTs),
       optionally configurable via CONFIG_HIGH_RES_TIMERS.  On a system
       that supports HRTs, the accuracy of sleep and timer system calls
       is no longer constrained by the jiffy, but instead can be as
       accurate as the hardware allows (microsecond accuracy is typical
       of modern hardware).  You can determine whether high-resolution
       timers are supported by checking the resolution returned by a
       call to clock_getres(2) or looking at the "resolution" entries in
       /proc/timer_list.

       HRTs are not supported on all hardware architectures.  (Support
       is provided on x86, ARM, and PowerPC, among others.)
...
   Timer slack
       Since Linux 2.6.28, it is possible to control the "timer slack"
       value for a thread.  The timer slack is the length of time by
       which the kernel may delay the wake-up of certain system calls
       that block with a timeout.  Permitting this delay allows the
       kernel to coalesce wake-up events, thus possibly reducing the
       number of system wake-ups and saving power.  For more details,
       see the description of PR_SET_TIMERSLACK in prctl(2).

...
https://man7.org/linux/man-pages/man2/prctl.2.html


       PR_SET_TIMERSLACK (since Linux 2.6.28)
              Each thread has two associated timer slack values: a
              "default" value, and a "current" value.  This operation
              sets the "current" timer slack value for the calling
              thread.  arg2 is an unsigned long value, then maximum
              "current" value is ULONG_MAX and the minimum "current"
              value is 1.  If the nanosecond value supplied in arg2 is
              greater than zero, then the "current" value is set to this
              value.  If arg2 is equal to zero, the "current" timer
              slack is reset to the thread's "default" timer slack
              value.

              The "current" timer slack is used by the kernel to group
              timer expirations for the calling thread that are close to
              one another; as a consequence, timer expirations for the
              thread may be up to the specified number of nanoseconds
              late (but will never expire early).  Grouping timer
              expirations can help reduce system power consumption by
              minimizing CPU wake-ups.

              The timer expirations affected by timer slack are those
              set by select(2), pselect(2), poll(2), ppoll(2),
              epoll_wait(2), epoll_pwait(2), clock_nanosleep(2),
              nanosleep(2), and futex(2) (and thus the library functions
              implemented via futexes, including
              pthread_cond_timedwait(3), pthread_mutex_timedlock(3),
              pthread_rwlock_timedrdlock(3),
              pthread_rwlock_timedwrlock(3), and sem_timedwait(3)).

              Timer slack is not applied to threads that are scheduled
              under a real-time scheduling policy (see
              sched_setscheduler(2)).

              When a new thread is created, the two timer slack values
              are made the same as the "current" value of the creating
              thread.  Thereafter, a thread can adjust its "current"
              timer slack value via PR_SET_TIMERSLACK.  The "default"
              value can't be changed.  The timer slack values of init
              (PID 1), the ancestor of all processes, are 50,000
              nanoseconds (50 microseconds).  The timer slack value is
              inherited by a child created via fork(2), and is preserved
              across execve(2).

              Since Linux 4.6, the "current" timer slack value of any
              process can be examined and changed via the file
              /proc/pid/timerslack_ns.  See proc(5).
  
Tyler Retzlaff Oct. 26, 2023, 11:35 p.m. UTC | #27
On Thu, Oct 26, 2023 at 10:00:33PM +0200, Thomas Monjalon wrote:
> 26/10/2023 16:19, Thomas Monjalon:
> > Real-time thread priority was been forbidden on Unix
> > because of problems they can cause.
> > Warnings and helpers are added to avoid deadlocks,
> > so real-time can be allowed on all systems.
> > 
> > Thomas Monjalon (2):
> >   eal: add thread yield functions
> >   eal/unix: allow creating thread with real-time priority
> > 
> > v1: no yield at all
> > v2: more comments, sched_yield() and Sleep(0) on Windows
> > v3: 2 yield functions with sleep in realtime version
> > v4: runtime warning, longer sleep on Unix and lighter yield on Windows
> > v5: fix build and increase Unix sleep to 1 ms
> > 
> > Thomas Monjalon (2):
> >   eal: add thread yield functions
> >   eal/unix: allow creating thread with real-time priority
> 
> Now there is a test failing on Windows:
> http://mails.dpdk.org/archives/test-report/2023-October/491475.html

40/84 DPDK:fast-tests / lcores_autotest                TIMEOUT
626.38s

i don't think this test is reliable. if the failure is being reported
due to the lcores_autotest timeout.

i also see this test fail on linux for me occasionally when run in my
local lab.

ty
  
Morten Brørup Oct. 27, 2023, 7:19 a.m. UTC | #28
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, 26 October 2023 21.51
> 
> 26/10/2023 18:50, Morten Brørup:
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Thursday, 26 October 2023 18.07

[...]

> > > For average realtime thread, I suggest the API
> > > rte_thread_yield_realtime()
> > > which could wait for 1 ms or less by default.
> >
> > If we introduce a yield() function, it should yield according to the
> O/S scheduling policy, e.g. the rest of the time slice allocated to the
> thread by the O/S scheduler (although that might not be available for
> real-time prioritized threads in Linux). I don't think we can make it
> O/S agnostic.
> >
> > I don't think it should wait a fixed amount of time - we already have
> rte_delay_us_sleep() for that.
> >
> > In my experiments with power saving, I ended up with a varying sleep
> duration, depending on traffic load. The l3fwd-power example also uses
> a varying sleep duration depending on traffic load.
> 
> I feel you lost the goal of this: schedule other threads (especially
> kernel threads).
> So we don't care about the sleep duration at all,
> except we want it minimal while allowing scheduling.

I do understand the goal: We want to relinquish the CPU core briefly, so the RT-priority thread gives way for other threads to run on the same CPU core.

But I don't think it is possible for an RT-priority thread to yield without waiting in a blocking call or waiting for an explicit timeout. This is the reason why Stephen wants the big fat warning sign when assigning RT-priority to a thread.

This will happen if we don't provide something for the RT-priority thread to wait for:

The application thread calls "yield()", e.g. sleep(0), to pass control to the kernel scheduler.
The kernel scheduler looks at all tasks available. The application thread has RT-priority and is ready to run, so the kernel scheduler will pick this thread for running. Only if other RT-priority threads are ready to run, perhaps one of those will be picked. No other threads will be picked, not even kernel threads (because RT-priority is higher than the priority of the kernel threads).

[...]

> Windows should be fine with Sleep(0).

Reading the Win32 API documentation [1] carefully, it think it will generally behave like the Linux scheduler. It says:

"A value of zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run. If there are no other threads of equal priority ready to run, the function returns immediately, and the thread continues execution. This behavior changed starting with Windows Server 2003."

How I interpret this is: The Windows scheduler will not relinquish the CPU core to lower priority threads when calling Sleep(0). If the application thread has RT-priority, the scheduler will immediately pick the same thread. Only if other RT-priority threads are ready to run, one of those will be picked.

[1]: https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-sleep
  

Patch

diff --git a/app/test/test_threads.c b/app/test/test_threads.c
index 4ac3f2671a..4f7d10e074 100644
--- a/app/test/test_threads.c
+++ b/app/test/test_threads.c
@@ -97,21 +97,12 @@  test_thread_priority(void)
 		"Priority set mismatches priority get");
 
 	priority = RTE_THREAD_PRIORITY_REALTIME_CRITICAL;
-#ifndef RTE_EXEC_ENV_WINDOWS
-	RTE_TEST_ASSERT(rte_thread_set_priority(thread_id, priority) == ENOTSUP,
-		"Priority set to critical should fail");
-	RTE_TEST_ASSERT(rte_thread_get_priority(thread_id, &priority) == 0,
-		"Failed to get thread priority");
-	RTE_TEST_ASSERT(priority == RTE_THREAD_PRIORITY_NORMAL,
-		"Failed set to critical should have retained normal");
-#else
 	RTE_TEST_ASSERT(rte_thread_set_priority(thread_id, priority) == 0,
 		"Priority set to critical should succeed");
 	RTE_TEST_ASSERT(rte_thread_get_priority(thread_id, &priority) == 0,
 		"Failed to get thread priority");
 	RTE_TEST_ASSERT(priority == RTE_THREAD_PRIORITY_REALTIME_CRITICAL,
 		"Priority set mismatches priority get");
-#endif
 
 	priority = RTE_THREAD_PRIORITY_NORMAL;
 	RTE_TEST_ASSERT(rte_thread_set_priority(thread_id, priority) == 0,
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 6debf54efb..d1f7cae7cd 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -815,7 +815,9 @@  Known Issues
 
   4. It MAY be used by preemptible multi-producer and/or preemptible multi-consumer pthreads whose scheduling policy are all SCHED_OTHER(cfs), SCHED_IDLE or SCHED_BATCH. User SHOULD be aware of the performance penalty before using it.
 
-  5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
+  5. It MUST not be used by multi-producer/consumer pthreads
+     whose scheduling policies are ``SCHED_FIFO``
+     or ``SCHED_RR`` (``RTE_THREAD_PRIORITY_REALTIME_CRITICAL``).
 
   Alternatively, applications can use the lock-free stack mempool handler. When
   considering this handler, note that:
diff --git a/lib/eal/include/rte_thread.h b/lib/eal/include/rte_thread.h
index 8da9d4d3fb..5e624015e4 100644
--- a/lib/eal/include/rte_thread.h
+++ b/lib/eal/include/rte_thread.h
@@ -59,7 +59,7 @@  enum rte_thread_priority {
 	RTE_THREAD_PRIORITY_NORMAL            = 0,
 	/**< normal thread priority, the default */
 	RTE_THREAD_PRIORITY_REALTIME_CRITICAL = 1,
-	/**< highest thread priority allowed */
+	/**< highest thread priority, use with caution */
 };
 
 /**
diff --git a/lib/eal/unix/rte_thread.c b/lib/eal/unix/rte_thread.c
index 36a21ab2f9..0ff291c6f1 100644
--- a/lib/eal/unix/rte_thread.c
+++ b/lib/eal/unix/rte_thread.c
@@ -153,11 +153,6 @@  rte_thread_create(rte_thread_t *thread_id,
 			goto cleanup;
 		}
 
-		if (thread_attr->priority ==
-				RTE_THREAD_PRIORITY_REALTIME_CRITICAL) {
-			ret = ENOTSUP;
-			goto cleanup;
-		}
 		ret = thread_map_priority_to_os_value(thread_attr->priority,
 				&param.sched_priority, &policy);
 		if (ret != 0)
@@ -275,10 +270,6 @@  rte_thread_set_priority(rte_thread_t thread_id,
 	int policy;
 	int ret;
 
-	/* Realtime priority can cause crashes on non-Windows platforms. */
-	if (priority == RTE_THREAD_PRIORITY_REALTIME_CRITICAL)
-		return ENOTSUP;
-
 	ret = thread_map_priority_to_os_value(priority, &param.sched_priority,
 		&policy);
 	if (ret != 0)