[v6] eal: allow worker lcore stacks to be allocated from hugepage memory

Message ID 20220524195138.4963-1-donw@xsightlabs.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series [v6] eal: allow worker lcore stacks to be allocated from hugepage memory |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/iol-mellanox-Performance success Performance Testing PASS
ci/intel-Testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/github-robot: build success github build: passed
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS

Commit Message

Don Wallwork May 24, 2022, 7:51 p.m. UTC
  Add support for using hugepages for worker lcore stack memory.  The
intent is to improve performance by reducing stack memory related TLB
misses and also by using memory local to the NUMA node of each lcore.

EAL option '--huge-worker-stack [stack-size-in-kbytes]' is added to allow
the feature to be enabled at runtime.  If the size is not specified,
the system pthread stack size will be used.

Signed-off-by: Don Wallwork <donw@xsightlabs.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
---
 doc/guides/linux_gsg/eal_args.include.rst     |  6 ++
 .../prog_guide/env_abstraction_layer.rst      | 21 +++++++
 lib/eal/common/eal_common_options.c           | 35 +++++++++++
 lib/eal/common/eal_internal_cfg.h             |  4 ++
 lib/eal/common/eal_options.h                  |  2 +
 lib/eal/freebsd/eal.c                         |  6 ++
 lib/eal/linux/eal.c                           | 61 ++++++++++++++++++-
 lib/eal/windows/eal.c                         |  6 ++
 8 files changed, 139 insertions(+), 2 deletions(-)
  

Comments

Kathleen Capella June 1, 2022, 12:05 a.m. UTC | #1
Tested okay on N1SDP platform with testpmd (mac and io fwd) and l3fwd.

Tested-by: Kathleen Capella <kathleen.capella@arm.com>

> -----Original Message-----
> From: Don Wallwork <donw@xsightlabs.com>
> Sent: Tuesday, May 24, 2022 2:52 PM
> To: dev@dpdk.org
> Cc: donw@xsightlabs.com; stephen@networkplumber.org;
> fengchengwen@huawei.com; mb@smartsharesystems.com;
> anatoly.burakov@intel.com; dmitry.kozliuk@gmail.com;
> bruce.richardson@intel.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>;
> haiyue.wang@intel.com; Kathleen Capella <Kathleen.Capella@arm.com>
> Subject: [PATCH v6] eal: allow worker lcore stacks to be allocated from
> hugepage memory
> 
> Add support for using hugepages for worker lcore stack memory.  The intent is
> to improve performance by reducing stack memory related TLB misses and also
> by using memory local to the NUMA node of each lcore.
> 
> EAL option '--huge-worker-stack [stack-size-in-kbytes]' is added to allow the
> feature to be enabled at runtime.  If the size is not specified, the system pthread
> stack size will be used.
> 
> Signed-off-by: Don Wallwork <donw@xsightlabs.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> ---
>  doc/guides/linux_gsg/eal_args.include.rst     |  6 ++
>  .../prog_guide/env_abstraction_layer.rst      | 21 +++++++
>  lib/eal/common/eal_common_options.c           | 35 +++++++++++
>  lib/eal/common/eal_internal_cfg.h             |  4 ++
>  lib/eal/common/eal_options.h                  |  2 +
>  lib/eal/freebsd/eal.c                         |  6 ++
>  lib/eal/linux/eal.c                           | 61 ++++++++++++++++++-
>  lib/eal/windows/eal.c                         |  6 ++
>  8 files changed, 139 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/guides/linux_gsg/eal_args.include.rst
> b/doc/guides/linux_gsg/eal_args.include.rst
> index 3549a0cf56..9cfbf7de84 100644
> --- a/doc/guides/linux_gsg/eal_args.include.rst
> +++ b/doc/guides/linux_gsg/eal_args.include.rst
> @@ -116,6 +116,12 @@ Memory-related options
> 
>      Force IOVA mode to a specific value.
> 
> +*   ``--huge-worker-stack[=size]``
> +
> +    Allocate worker stack memory from hugepage memory. Stack size defaults
> +    to system pthread stack size unless the optional size (in kbytes) is
> +    specified.
> +
>  Debugging options
>  ~~~~~~~~~~~~~~~~~
> 
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst
> b/doc/guides/prog_guide/env_abstraction_layer.rst
> index 5f0748fba1..e74516f0cf 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -329,6 +329,27 @@ Another option is to use bigger page sizes. Since fewer
> pages are required to  cover the same memory area, fewer file descriptors will
> be stored internally  by EAL.
> 
> +.. _huge-worker-stack:
> +
> +Hugepage Worker Stacks
> +^^^^^^^^^^^^^^^^^^^^^^
> +
> +When the ``--huge-worker-stack[=size]`` EAL option is specified, worker
> +thread stacks are allocated from hugepage memory local to the NUMA node
> +of the thread. Worker stack size defaults to system pthread stack size
> +if the optional size parameter is not specified.
> +
> +.. warning::
> +    Stacks allocated from hugepage memory are not protected by guard
> +    pages. Worker stacks must be sufficiently sized to prevent stack
> +    overflow when this option is used.
> +
> +    As with normal thread stacks, hugepage worker thread stack size is
> +    fixed and is not dynamically resized. Therefore, an application that
> +    is free of stack page faults under a given load should be safe with
> +    hugepage worker thread stacks given the same thread stack size and
> +    loading conditions.
> +
>  Support for Externally Allocated Memory
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> diff --git a/lib/eal/common/eal_common_options.c
> b/lib/eal/common/eal_common_options.c
> index f247a42455..02e59051e8 100644
> --- a/lib/eal/common/eal_common_options.c
> +++ b/lib/eal/common/eal_common_options.c
> @@ -103,6 +103,7 @@ eal_long_options[] = {
>  	{OPT_TELEMETRY,         0, NULL, OPT_TELEMETRY_NUM        },
>  	{OPT_NO_TELEMETRY,      0, NULL, OPT_NO_TELEMETRY_NUM     },
>  	{OPT_FORCE_MAX_SIMD_BITWIDTH, 1, NULL,
> OPT_FORCE_MAX_SIMD_BITWIDTH_NUM},
> +	{OPT_HUGE_WORKER_STACK, 2, NULL,
> OPT_HUGE_WORKER_STACK_NUM     },
> 
>  	{0,                     0, NULL, 0                        }
>  };
> @@ -1618,6 +1619,26 @@ eal_parse_huge_unlink(const char *arg, struct
> hugepage_file_discipline *out)
>  	return -1;
>  }
> 
> +static int
> +eal_parse_huge_worker_stack(const char *arg, size_t
> +*huge_worker_stack_size) {
> +	size_t worker_stack_size;
> +	char *end;
> +
> +	if (arg == NULL || arg[0] == '\0') {
> +		*huge_worker_stack_size = WORKER_STACK_SIZE_FROM_OS;
> +		return 0;
> +	}
> +	errno = 0;
> +	worker_stack_size = strtoul(arg, &end, 10);
> +	if (errno || end == NULL || worker_stack_size == 0 ||
> +	    worker_stack_size >= (size_t)-1 / 1024)
> +		return -1;
> +
> +	*huge_worker_stack_size = worker_stack_size * 1024;
> +	return 0;
> +}
> +
>  int
>  eal_parse_common_option(int opt, const char *optarg,
>  			struct internal_config *conf)
> @@ -1921,6 +1942,15 @@ eal_parse_common_option(int opt, const char
> *optarg,
>  		}
>  		break;
> 
> +	case OPT_HUGE_WORKER_STACK_NUM:
> +		if (eal_parse_huge_worker_stack(optarg,
> +						&conf-
> >huge_worker_stack_size) < 0) {
> +			RTE_LOG(ERR, EAL, "invalid parameter for --"
> +				OPT_HUGE_WORKER_STACK"\n");
> +			return -1;
> +		}
> +		break;
> +
>  	/* don't know what to do, leave this to caller */
>  	default:
>  		return 1;
> @@ -2235,5 +2265,10 @@ eal_common_usage(void)
>  	       "  --"OPT_NO_PCI"            Disable PCI\n"
>  	       "  --"OPT_NO_HPET"           Disable HPET\n"
>  	       "  --"OPT_NO_SHCONF"         No shared config (mmap'd files)\n"
> +	       "  --"OPT_HUGE_WORKER_STACK"[=size]\n"
> +	       "                      Allocate worker thread stacks from\n"
> +	       "                      hugepage memory. Size is in units of\n"
> +	       "                      kbytes and defaults to system thread\n"
> +	       "                      stack size if not specified.\n"
>  	       "\n", RTE_MAX_LCORE);
>  }
> diff --git a/lib/eal/common/eal_internal_cfg.h
> b/lib/eal/common/eal_internal_cfg.h
> index b71faadd18..5e154967e4 100644
> --- a/lib/eal/common/eal_internal_cfg.h
> +++ b/lib/eal/common/eal_internal_cfg.h
> @@ -48,6 +48,9 @@ struct hugepage_file_discipline {
>  	bool unlink_existing;
>  };
> 
> +/** Worker hugepage stack size should default to OS value. */ #define
> +WORKER_STACK_SIZE_FROM_OS ((size_t)~0)
> +
>  /**
>   * internal configuration
>   */
> @@ -102,6 +105,7 @@ struct internal_config {
>  	unsigned int no_telemetry; /**< true to disable Telemetry */
>  	struct simd_bitwidth max_simd_bitwidth;
>  	/**< max simd bitwidth path to use */
> +	size_t huge_worker_stack_size; /**< worker thread stack size */
>  };
> 
>  void eal_reset_internal_config(struct internal_config *internal_cfg); diff --git
> a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h index
> 8e4f7202a2..3cc9cb6412 100644
> --- a/lib/eal/common/eal_options.h
> +++ b/lib/eal/common/eal_options.h
> @@ -87,6 +87,8 @@ enum {
>  	OPT_NO_TELEMETRY_NUM,
>  #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
>  	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
> +#define OPT_HUGE_WORKER_STACK  "huge-worker-stack"
> +	OPT_HUGE_WORKER_STACK_NUM,
> 
>  	OPT_LONG_MAX_NUM
>  };
> diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c index
> a6b20960f2..7368956649 100644
> --- a/lib/eal/freebsd/eal.c
> +++ b/lib/eal/freebsd/eal.c
> @@ -795,6 +795,12 @@ rte_eal_init(int argc, char **argv)
>  		config->main_lcore, (uintptr_t)pthread_self(), cpuset,
>  		ret == 0 ? "" : "...");
> 
> +	if (internal_conf->huge_worker_stack_size != 0) {
> +		rte_eal_init_alert("Hugepage worker stacks not supported");
> +		rte_errno = ENOTSUP;
> +		return -1;
> +	}
> +
>  	RTE_LCORE_FOREACH_WORKER(i) {
> 
>  		/*
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c index
> 1ef263434a..d28a0fdb78 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -857,6 +857,64 @@ is_iommu_enabled(void)
>  	return n > 2;
>  }
> 
> +static int
> +eal_worker_thread_create(struct internal_config *internal_conf,
> +			 int lcore_id)
> +{
> +	pthread_attr_t attr;
> +	size_t stack_size;
> +	void *stack_ptr;
> +	int ret;
> +
> +	if (internal_conf->huge_worker_stack_size == 0)
> +		return pthread_create(&lcore_config[lcore_id].thread_id,
> +				      NULL,
> +				      eal_thread_loop,
> +				      (void *)(uintptr_t)lcore_id);
> +
> +	/* Allocate NUMA aware stack memory and set pthread attributes */
> +	if (pthread_attr_init(&attr) != 0) {
> +		rte_eal_init_alert("Cannot init pthread attributes");
> +		rte_errno = EFAULT;
> +		return -1;
> +	}
> +	if (internal_conf->huge_worker_stack_size ==
> WORKER_STACK_SIZE_FROM_OS) {
> +		if (pthread_attr_getstacksize(&attr, &stack_size) != 0) {
> +			rte_errno = EFAULT;
> +			return -1;
> +		}
> +	} else {
> +		stack_size = internal_conf->huge_worker_stack_size;
> +	}
> +	stack_ptr = rte_zmalloc_socket("lcore_stack",
> +				       stack_size,
> +				       RTE_CACHE_LINE_SIZE,
> +				       rte_lcore_to_socket_id(lcore_id));
> +
> +	if (stack_ptr == NULL) {
> +		rte_eal_init_alert("Cannot allocate worker lcore stack
> memory");
> +		rte_errno = ENOMEM;
> +		return -1;
> +	}
> +
> +	if (pthread_attr_setstack(&attr, stack_ptr, stack_size) != 0) {
> +		rte_eal_init_alert("Cannot set pthread stack attributes");
> +		rte_errno = EFAULT;
> +		return -1;
> +	}
> +
> +	ret = pthread_create(&lcore_config[lcore_id].thread_id, &attr,
> +			     eal_thread_loop,
> +			     (void *)(uintptr_t)lcore_id);
> +
> +	if (pthread_attr_destroy(&attr) != 0) {
> +		rte_eal_init_alert("Cannot destroy pthread attributes");
> +		rte_errno = EFAULT;
> +		return -1;
> +	}
> +	return ret;
> +}
> +
>  /* Launch threads, called at application init(). */  int  rte_eal_init(int argc, char
> **argv) @@ -1144,8 +1202,7 @@ rte_eal_init(int argc, char **argv)
>  		lcore_config[i].state = WAIT;
> 
>  		/* create a thread for each lcore */
> -		ret = pthread_create(&lcore_config[i].thread_id, NULL,
> -				     eal_thread_loop, (void *)(uintptr_t)i);
> +		ret = eal_worker_thread_create(internal_conf, i);
>  		if (ret != 0)
>  			rte_panic("Cannot create thread\n");
> 
> diff --git a/lib/eal/windows/eal.c b/lib/eal/windows/eal.c index
> 122de2a319..5cd4a45872 100644
> --- a/lib/eal/windows/eal.c
> +++ b/lib/eal/windows/eal.c
> @@ -416,6 +416,12 @@ rte_eal_init(int argc, char **argv)
>  		config->main_lcore, (uintptr_t)pthread_self(), cpuset,
>  		ret == 0 ? "" : "...");
> 
> +	if (internal_conf->huge_worker_stack_size != 0) {
> +		rte_eal_init_alert("Hugepage worker stacks not supported");
> +		rte_errno = ENOTSUP;
> +		return -1;
> +	}
> +
>  	RTE_LCORE_FOREACH_WORKER(i) {
> 
>  		/*
> --
> 2.17.1
  
David Marchand June 20, 2022, 8:35 a.m. UTC | #2
On Tue, May 24, 2022 at 9:52 PM Don Wallwork <donw@xsightlabs.com> wrote:
>
> Add support for using hugepages for worker lcore stack memory.  The
> intent is to improve performance by reducing stack memory related TLB
> misses and also by using memory local to the NUMA node of each lcore.
> EAL option '--huge-worker-stack [stack-size-in-kbytes]' is added to allow
> the feature to be enabled at runtime.  If the size is not specified,
> the system pthread stack size will be used.

- About the name of the option... I don't have a better name.

Just want to highlight, that what this patch does is use the DPDK
memory allocator for the stack memory.
It happens that DPDK memory allocator is primarily used with
hugepages, but this is not systematic for example with the "no-huge"
mode of the DPDK memory allocator.

IOW, in this patch current form, you can still run as:

# dpdk-testpmd -c 3 --no-huge --huge-worker-stack=16 -m 40 -- etc...

Opinions?


- This patch adds one more EAL flag, we need some unit test (even if basic).


Comments below:

>
> Signed-off-by: Don Wallwork <donw@xsightlabs.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> ---
>  doc/guides/linux_gsg/eal_args.include.rst     |  6 ++
>  .../prog_guide/env_abstraction_layer.rst      | 21 +++++++
>  lib/eal/common/eal_common_options.c           | 35 +++++++++++
>  lib/eal/common/eal_internal_cfg.h             |  4 ++
>  lib/eal/common/eal_options.h                  |  2 +
>  lib/eal/freebsd/eal.c                         |  6 ++
>  lib/eal/linux/eal.c                           | 61 ++++++++++++++++++-
>  lib/eal/windows/eal.c                         |  6 ++
>  8 files changed, 139 insertions(+), 2 deletions(-)
>
> diff --git a/doc/guides/linux_gsg/eal_args.include.rst b/doc/guides/linux_gsg/eal_args.include.rst
> index 3549a0cf56..9cfbf7de84 100644
> --- a/doc/guides/linux_gsg/eal_args.include.rst
> +++ b/doc/guides/linux_gsg/eal_args.include.rst
> @@ -116,6 +116,12 @@ Memory-related options
>
>      Force IOVA mode to a specific value.
>
> +*   ``--huge-worker-stack[=size]``
> +
> +    Allocate worker stack memory from hugepage memory. Stack size defaults
> +    to system pthread stack size unless the optional size (in kbytes) is
> +    specified.
> +
>  Debugging options
>  ~~~~~~~~~~~~~~~~~
>
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
> index 5f0748fba1..e74516f0cf 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -329,6 +329,27 @@ Another option is to use bigger page sizes. Since fewer pages are required to
>  cover the same memory area, fewer file descriptors will be stored internally
>  by EAL.
>
> +.. _huge-worker-stack:

There is nothing pointing to this reference.
It can be removed.


> +
> +Hugepage Worker Stacks
> +^^^^^^^^^^^^^^^^^^^^^^
> +
> +When the ``--huge-worker-stack[=size]`` EAL option is specified, worker
> +thread stacks are allocated from hugepage memory local to the NUMA node
> +of the thread. Worker stack size defaults to system pthread stack size
> +if the optional size parameter is not specified.
> +
> +.. warning::
> +    Stacks allocated from hugepage memory are not protected by guard
> +    pages. Worker stacks must be sufficiently sized to prevent stack
> +    overflow when this option is used.
> +
> +    As with normal thread stacks, hugepage worker thread stack size is
> +    fixed and is not dynamically resized. Therefore, an application that
> +    is free of stack page faults under a given load should be safe with
> +    hugepage worker thread stacks given the same thread stack size and
> +    loading conditions.
> +
>  Support for Externally Allocated Memory
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
> index f247a42455..02e59051e8 100644
> --- a/lib/eal/common/eal_common_options.c
> +++ b/lib/eal/common/eal_common_options.c
> @@ -103,6 +103,7 @@ eal_long_options[] = {
>         {OPT_TELEMETRY,         0, NULL, OPT_TELEMETRY_NUM        },
>         {OPT_NO_TELEMETRY,      0, NULL, OPT_NO_TELEMETRY_NUM     },
>         {OPT_FORCE_MAX_SIMD_BITWIDTH, 1, NULL, OPT_FORCE_MAX_SIMD_BITWIDTH_NUM},
> +       {OPT_HUGE_WORKER_STACK, 2, NULL, OPT_HUGE_WORKER_STACK_NUM     },
>
>         {0,                     0, NULL, 0                        }
>  };
> @@ -1618,6 +1619,26 @@ eal_parse_huge_unlink(const char *arg, struct hugepage_file_discipline *out)
>         return -1;
>  }
>
> +static int
> +eal_parse_huge_worker_stack(const char *arg, size_t *huge_worker_stack_size)
> +{
> +       size_t worker_stack_size;

Nit: strtoul returns an unsigned long int.
POSIX defines size_t as an unsigned integer.
It does not specify though that size_t can handle a long unsigned integer.


> +       char *end;
> +
> +       if (arg == NULL || arg[0] == '\0') {
> +               *huge_worker_stack_size = WORKER_STACK_SIZE_FROM_OS;

We should resolve (in theory, via a OS-specific helper, but this is
overkill if we move this to Linux EAL) the default stack size here,
once and for all.
That simplifies the EAL worker thread create helper.
WORKER_STACK_SIZE_FROM_OS is then unneeded.

And this parser needs some debug level EAL log, so that users know the
feature is engaged.


> +               return 0;
> +       }
> +       errno = 0;
> +       worker_stack_size = strtoul(arg, &end, 10);
> +       if (errno || end == NULL || worker_stack_size == 0 ||
> +           worker_stack_size >= (size_t)-1 / 1024)
> +               return -1;
> +
> +       *huge_worker_stack_size = worker_stack_size * 1024;

With previous comments applied, this could look like:

+static int
+eal_parse_huge_worker_stack(const char *arg)
+{
+       struct internal_config *cfg = eal_get_internal_configuration();
+
+       if (arg == NULL || arg[0] == '\0') {
+               pthread_attr_t attr;
+               int ret;
+
+               if (pthread_attr_init(&attr) != 0) {
+                       RTE_LOG(ERR, EAL, "Could not retrieve default
stack size\n");
+                       return -1;
+               }
+               ret = pthread_attr_getstacksize(&attr,
&cfg->huge_worker_stack_size);
+               pthread_attr_destroy(&attr);
+               if (ret != 0) {
+                       RTE_LOG(ERR, EAL, "Could not retrieve default
stack size\n");
+                       return -1;
+               }
+       } else {
+               unsigned long int stack_size;
+               char *end;
+
+               errno = 0;
+               stack_size = strtoul(arg, &end, 10);
+               if (errno || end == NULL || stack_size == 0 ||
+                               stack_size >= (size_t)-1 / 1024)
+                       return -1;
+
+               cfg->huge_worker_stack_size = stack_size * 1024;
+       }
+
+       RTE_LOG(DEBUG, EAL, "EAL worker threads will use %zu kB of
DPDK memory as stack.\n",
+               cfg->huge_worker_stack_size / 1024);
+       return 0;
+}



> +       return 0;
> +}
> +
>  int
>  eal_parse_common_option(int opt, const char *optarg,
>                         struct internal_config *conf)
> @@ -1921,6 +1942,15 @@ eal_parse_common_option(int opt, const char *optarg,
>                 }
>                 break;
>
> +       case OPT_HUGE_WORKER_STACK_NUM:
> +               if (eal_parse_huge_worker_stack(optarg,
> +                                               &conf->huge_worker_stack_size) < 0) {
> +                       RTE_LOG(ERR, EAL, "invalid parameter for --"
> +                               OPT_HUGE_WORKER_STACK"\n");
> +                       return -1;
> +               }
> +               break;
> +

This parser and calling it should be moved out of the common options,
and moved to the Linux implementation (around
https://git.dpdk.org/dpdk/tree/lib/eal/linux/eal.c#n715) as it is a
Linux-only option at the moment.
Doing so, there is nothing to add to FreeBSD and Windows EAL.


>         /* don't know what to do, leave this to caller */
>         default:
>                 return 1;
> @@ -2235,5 +2265,10 @@ eal_common_usage(void)
>                "  --"OPT_NO_PCI"            Disable PCI\n"
>                "  --"OPT_NO_HPET"           Disable HPET\n"
>                "  --"OPT_NO_SHCONF"         No shared config (mmap'd files)\n"
> +              "  --"OPT_HUGE_WORKER_STACK"[=size]\n"
> +              "                      Allocate worker thread stacks from\n"
> +              "                      hugepage memory. Size is in units of\n"
> +              "                      kbytes and defaults to system thread\n"
> +              "                      stack size if not specified.\n"
>                "\n", RTE_MAX_LCORE);
>  }
> diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
> index b71faadd18..5e154967e4 100644
> --- a/lib/eal/common/eal_internal_cfg.h
> +++ b/lib/eal/common/eal_internal_cfg.h
> @@ -48,6 +48,9 @@ struct hugepage_file_discipline {
>         bool unlink_existing;
>  };
>
> +/** Worker hugepage stack size should default to OS value. */
> +#define WORKER_STACK_SIZE_FROM_OS ((size_t)~0)

No need for this limit, as per previous comment.


> +
>  /**
>   * internal configuration
>   */
> @@ -102,6 +105,7 @@ struct internal_config {
>         unsigned int no_telemetry; /**< true to disable Telemetry */
>         struct simd_bitwidth max_simd_bitwidth;
>         /**< max simd bitwidth path to use */
> +       size_t huge_worker_stack_size; /**< worker thread stack size */
>  };
>
>  void eal_reset_internal_config(struct internal_config *internal_cfg);
> diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
> index 8e4f7202a2..3cc9cb6412 100644
> --- a/lib/eal/common/eal_options.h
> +++ b/lib/eal/common/eal_options.h
> @@ -87,6 +87,8 @@ enum {
>         OPT_NO_TELEMETRY_NUM,
>  #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
>         OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
> +#define OPT_HUGE_WORKER_STACK  "huge-worker-stack"
> +       OPT_HUGE_WORKER_STACK_NUM,
>
>         OPT_LONG_MAX_NUM
>  };
> diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
> index a6b20960f2..7368956649 100644
> --- a/lib/eal/freebsd/eal.c
> +++ b/lib/eal/freebsd/eal.c
> @@ -795,6 +795,12 @@ rte_eal_init(int argc, char **argv)
>                 config->main_lcore, (uintptr_t)pthread_self(), cpuset,
>                 ret == 0 ? "" : "...");
>
> +       if (internal_conf->huge_worker_stack_size != 0) {
> +               rte_eal_init_alert("Hugepage worker stacks not supported");
> +               rte_errno = ENOTSUP;
> +               return -1;
> +       }
> +

As previously mentionned, this is unneeded if option is parsed only in
lib/eal/linux/eal.c.

>         RTE_LCORE_FOREACH_WORKER(i) {
>
>                 /*
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 1ef263434a..d28a0fdb78 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -857,6 +857,64 @@ is_iommu_enabled(void)
>         return n > 2;
>  }
>
> +static int
> +eal_worker_thread_create(struct internal_config *internal_conf,
> +                        int lcore_id)
> +{
> +       pthread_attr_t attr;
> +       size_t stack_size;
> +       void *stack_ptr;
> +       int ret;
> +
> +       if (internal_conf->huge_worker_stack_size == 0)
> +               return pthread_create(&lcore_config[lcore_id].thread_id,
> +                                     NULL,
> +                                     eal_thread_loop,
> +                                     (void *)(uintptr_t)lcore_id);

If you invert the branch here (checking that stack_size != 0), all the
stack setup can then be put under a branch and we have a more readable
unified call to pthread_create.
If more pthread attributes were to be added in the future, the code
would be ready too.


> +
> +       /* Allocate NUMA aware stack memory and set pthread attributes */
> +       if (pthread_attr_init(&attr) != 0) {
> +               rte_eal_init_alert("Cannot init pthread attributes");
> +               rte_errno = EFAULT;
> +               return -1;
> +       }
> +       if (internal_conf->huge_worker_stack_size == WORKER_STACK_SIZE_FROM_OS) {
> +               if (pthread_attr_getstacksize(&attr, &stack_size) != 0) {
> +                       rte_errno = EFAULT;
> +                       return -1;
> +               }
> +       } else {
> +               stack_size = internal_conf->huge_worker_stack_size;
> +       }
> +       stack_ptr = rte_zmalloc_socket("lcore_stack",
> +                                      stack_size,
> +                                      RTE_CACHE_LINE_SIZE,
> +                                      rte_lcore_to_socket_id(lcore_id));

This stack_ptr is "leaked" if any later branch in this function ends
up in failure.


> +
> +       if (stack_ptr == NULL) {
> +               rte_eal_init_alert("Cannot allocate worker lcore stack memory");
> +               rte_errno = ENOMEM;
> +               return -1;
> +       }
> +
> +       if (pthread_attr_setstack(&attr, stack_ptr, stack_size) != 0) {
> +               rte_eal_init_alert("Cannot set pthread stack attributes");
> +               rte_errno = EFAULT;
> +               return -1;
> +       }
> +
> +       ret = pthread_create(&lcore_config[lcore_id].thread_id, &attr,
> +                            eal_thread_loop,
> +                            (void *)(uintptr_t)lcore_id);
> +
> +       if (pthread_attr_destroy(&attr) != 0) {
> +               rte_eal_init_alert("Cannot destroy pthread attributes");
> +               rte_errno = EFAULT;
> +               return -1;
> +       }
> +       return ret;
> +}

With previous comments applied, this could look like:

+static int
+eal_worker_thread_create(unsigned int lcore_id)
+{
+       pthread_attr_t *attrp = NULL;
+       void *stack_ptr = NULL;
+       pthread_attr_t attr;
+       size_t stack_size;
+       int ret = -1;
+
+       stack_size = eal_get_internal_configuration()->huge_worker_stack_size;
+       if (stack_size != 0) {
+
+               /* Allocate NUMA aware stack memory and set pthread
attributes */
+               stack_ptr = rte_zmalloc_socket("lcore_stack", stack_size,
+                       RTE_CACHE_LINE_SIZE, rte_lcore_to_socket_id(lcore_id));
+               if (stack_ptr == NULL) {
+                       rte_eal_init_alert("Cannot allocate worker
lcore stack memory");
+                       rte_errno = ENOMEM;
+                       goto out;
+               }
+
+               if (pthread_attr_init(&attr) != 0) {
+                       rte_eal_init_alert("Cannot init pthread attributes");
+                       rte_errno = EFAULT;
+                       goto out;
+               }
+               attrp = &attr;
+
+               if (pthread_attr_setstack(attrp, stack_ptr, stack_size) != 0) {
+                       rte_eal_init_alert("Cannot set pthread stack
attributes");
+                       rte_errno = EFAULT;
+                       goto out;
+               }
+       }
+
+       if (pthread_create(&lcore_config[lcore_id].thread_id, attrp,
+                       eal_thread_loop, (void *)(uintptr_t)lcore_id) == 0)
+               ret = 0;
+
+out:
+       if (ret != 0)
+               rte_free(stack_ptr);
+       if (attrp != NULL)
+               pthread_attr_destroy(attrp);
+       return ret;
+}


> +
>  /* Launch threads, called at application init(). */
>  int
>  rte_eal_init(int argc, char **argv)
> @@ -1144,8 +1202,7 @@ rte_eal_init(int argc, char **argv)
>                 lcore_config[i].state = WAIT;
>
>                 /* create a thread for each lcore */
> -               ret = pthread_create(&lcore_config[i].thread_id, NULL,
> -                                    eal_thread_loop, (void *)(uintptr_t)i);
> +               ret = eal_worker_thread_create(internal_conf, i);
>                 if (ret != 0)
>                         rte_panic("Cannot create thread\n");
>
> diff --git a/lib/eal/windows/eal.c b/lib/eal/windows/eal.c
> index 122de2a319..5cd4a45872 100644
> --- a/lib/eal/windows/eal.c
> +++ b/lib/eal/windows/eal.c
> @@ -416,6 +416,12 @@ rte_eal_init(int argc, char **argv)
>                 config->main_lcore, (uintptr_t)pthread_self(), cpuset,
>                 ret == 0 ? "" : "...");
>
> +       if (internal_conf->huge_worker_stack_size != 0) {
> +               rte_eal_init_alert("Hugepage worker stacks not supported");
> +               rte_errno = ENOTSUP;
> +               return -1;
> +       }
> +

Same as FreeBSD, this is unneeded if option is parsed only in
lib/eal/linux/eal.c.

>         RTE_LCORE_FOREACH_WORKER(i) {
>
>                 /*
> --
> 2.17.1
>
  
Thomas Monjalon June 21, 2022, 10:37 a.m. UTC | #3
20/06/2022 10:35, David Marchand:
> On Tue, May 24, 2022 at 9:52 PM Don Wallwork <donw@xsightlabs.com> wrote:
> >
> > Add support for using hugepages for worker lcore stack memory.  The
> > intent is to improve performance by reducing stack memory related TLB
> > misses and also by using memory local to the NUMA node of each lcore.
> > EAL option '--huge-worker-stack [stack-size-in-kbytes]' is added to allow
> > the feature to be enabled at runtime.  If the size is not specified,
> > the system pthread stack size will be used.
> 
> - About the name of the option... I don't have a better name.
> 
> Just want to highlight, that what this patch does is use the DPDK
> memory allocator for the stack memory.
> It happens that DPDK memory allocator is primarily used with
> hugepages, but this is not systematic for example with the "no-huge"
> mode of the DPDK memory allocator.
> 
> IOW, in this patch current form, you can still run as:
> 
> # dpdk-testpmd -c 3 --no-huge --huge-worker-stack=16 -m 40 -- etc...
> 
> Opinions?

The name of the option should not include "huge".
What about "--worker-stack" ?
If disabled (equal zero), the workers should use the default stack memory.
  
Don Wallwork June 21, 2022, 12:31 p.m. UTC | #4
On 6/21/2022 6:37 AM, Thomas Monjalon wrote:
> 20/06/2022 10:35, David Marchand:
>> On Tue, May 24, 2022 at 9:52 PM Don Wallwork <donw@xsightlabs.com> wrote:
>>> Add support for using hugepages for worker lcore stack memory.  The
>>> intent is to improve performance by reducing stack memory related TLB
>>> misses and also by using memory local to the NUMA node of each lcore.
>>> EAL option '--huge-worker-stack [stack-size-in-kbytes]' is added to allow
>>> the feature to be enabled at runtime.  If the size is not specified,
>>> the system pthread stack size will be used.
>> - About the name of the option... I don't have a better name.
>>
>> Just want to highlight, that what this patch does is use the DPDK
>> memory allocator for the stack memory.
>> It happens that DPDK memory allocator is primarily used with
>> hugepages, but this is not systematic for example with the "no-huge"
>> mode of the DPDK memory allocator.
>>
>> IOW, in this patch current form, you can still run as:
>>
>> # dpdk-testpmd -c 3 --no-huge --huge-worker-stack=16 -m 40 -- etc...
>>
>> Opinions?
> The name of the option should not include "huge".
> What about "--worker-stack" ?
> If disabled (equal zero), the workers should use the default stack memory.
>
>
Wouldn't that have the potential to create confusion?  The point of this
change is to allocate worker stacks from hugepages.  Removing huge
from the option name could give the impression that the command is
simply to control worker stack size.

Regarding your other comments, I'm working on another patch that will
address those.
  
Thomas Monjalon June 21, 2022, 2:42 p.m. UTC | #5
21/06/2022 14:31, Don Wallwork:
> On 6/21/2022 6:37 AM, Thomas Monjalon wrote:
> > 20/06/2022 10:35, David Marchand:
> >> On Tue, May 24, 2022 at 9:52 PM Don Wallwork <donw@xsightlabs.com> wrote:
> >>> Add support for using hugepages for worker lcore stack memory.  The
> >>> intent is to improve performance by reducing stack memory related TLB
> >>> misses and also by using memory local to the NUMA node of each lcore.
> >>> EAL option '--huge-worker-stack [stack-size-in-kbytes]' is added to allow
> >>> the feature to be enabled at runtime.  If the size is not specified,
> >>> the system pthread stack size will be used.
> >> - About the name of the option... I don't have a better name.
> >>
> >> Just want to highlight, that what this patch does is use the DPDK
> >> memory allocator for the stack memory.
> >> It happens that DPDK memory allocator is primarily used with
> >> hugepages, but this is not systematic for example with the "no-huge"
> >> mode of the DPDK memory allocator.
> >>
> >> IOW, in this patch current form, you can still run as:
> >>
> >> # dpdk-testpmd -c 3 --no-huge --huge-worker-stack=16 -m 40 -- etc...
> >>
> >> Opinions?
> > The name of the option should not include "huge".
> > What about "--worker-stack" ?
> > If disabled (equal zero), the workers should use the default stack memory.
> >
> >
> Wouldn't that have the potential to create confusion?  The point of this
> change is to allocate worker stacks from hugepages.  Removing huge
> from the option name could give the impression that the command is
> simply to control worker stack size.

It means if we control the worker stack size with a DPDK option,
DPDK memory will be used.
But we cannot force hugepage with this option.
Hugepage is not always available and it can be disabled in DPDK.

> Regarding your other comments, I'm working on another patch that will
> address those.
  
Don Wallwork June 21, 2022, 2:52 p.m. UTC | #6
On 6/21/2022 10:42 AM, Thomas Monjalon wrote:
> 21/06/2022 14:31, Don Wallwork:
>> On 6/21/2022 6:37 AM, Thomas Monjalon wrote:
>>> 20/06/2022 10:35, David Marchand:
>>>> On Tue, May 24, 2022 at 9:52 PM Don Wallwork <donw@xsightlabs.com> wrote:
>>>>> Add support for using hugepages for worker lcore stack memory.  The
>>>>> intent is to improve performance by reducing stack memory related TLB
>>>>> misses and also by using memory local to the NUMA node of each lcore.
>>>>> EAL option '--huge-worker-stack [stack-size-in-kbytes]' is added to allow
>>>>> the feature to be enabled at runtime.  If the size is not specified,
>>>>> the system pthread stack size will be used.
>>>> - About the name of the option... I don't have a better name.
>>>>
>>>> Just want to highlight, that what this patch does is use the DPDK
>>>> memory allocator for the stack memory.
>>>> It happens that DPDK memory allocator is primarily used with
>>>> hugepages, but this is not systematic for example with the "no-huge"
>>>> mode of the DPDK memory allocator.
>>>>
>>>> IOW, in this patch current form, you can still run as:
>>>>
>>>> # dpdk-testpmd -c 3 --no-huge --huge-worker-stack=16 -m 40 -- etc...
>>>>
>>>> Opinions?
>>> The name of the option should not include "huge".
>>> What about "--worker-stack" ?
>>> If disabled (equal zero), the workers should use the default stack memory.
>>>
>>>
>> Wouldn't that have the potential to create confusion?  The point of this
>> change is to allocate worker stacks from hugepages.  Removing huge
>> from the option name could give the impression that the command is
>> simply to control worker stack size.
> It means if we control the worker stack size with a DPDK option,
> DPDK memory will be used.
> But we cannot force hugepage with this option.
> Hugepage is not always available and it can be disabled in DPDK.

The command could be rejected if hugepages are not available.
That's not in the patch currently, but can be added.
  
Thomas Monjalon June 21, 2022, 3 p.m. UTC | #7
21/06/2022 16:52, Don Wallwork:
> On 6/21/2022 10:42 AM, Thomas Monjalon wrote:
> > 21/06/2022 14:31, Don Wallwork:
> >> On 6/21/2022 6:37 AM, Thomas Monjalon wrote:
> >>> 20/06/2022 10:35, David Marchand:
> >>>> On Tue, May 24, 2022 at 9:52 PM Don Wallwork <donw@xsightlabs.com> wrote:
> >>>>> Add support for using hugepages for worker lcore stack memory.  The
> >>>>> intent is to improve performance by reducing stack memory related TLB
> >>>>> misses and also by using memory local to the NUMA node of each lcore.
> >>>>> EAL option '--huge-worker-stack [stack-size-in-kbytes]' is added to allow
> >>>>> the feature to be enabled at runtime.  If the size is not specified,
> >>>>> the system pthread stack size will be used.
> >>>> - About the name of the option... I don't have a better name.
> >>>>
> >>>> Just want to highlight, that what this patch does is use the DPDK
> >>>> memory allocator for the stack memory.
> >>>> It happens that DPDK memory allocator is primarily used with
> >>>> hugepages, but this is not systematic for example with the "no-huge"
> >>>> mode of the DPDK memory allocator.
> >>>>
> >>>> IOW, in this patch current form, you can still run as:
> >>>>
> >>>> # dpdk-testpmd -c 3 --no-huge --huge-worker-stack=16 -m 40 -- etc...
> >>>>
> >>>> Opinions?
> >>> The name of the option should not include "huge".
> >>> What about "--worker-stack" ?
> >>> If disabled (equal zero), the workers should use the default stack memory.
> >>>
> >>>
> >> Wouldn't that have the potential to create confusion?  The point of this
> >> change is to allocate worker stacks from hugepages.  Removing huge
> >> from the option name could give the impression that the command is
> >> simply to control worker stack size.
> > It means if we control the worker stack size with a DPDK option,
> > DPDK memory will be used.
> > But we cannot force hugepage with this option.
> > Hugepage is not always available and it can be disabled in DPDK.
> 
> The command could be rejected if hugepages are not available.
> That's not in the patch currently, but can be added.

David, Anatoly, Dmitry, what do you think?
  
Honnappa Nagarahalli June 21, 2022, 4:32 p.m. UTC | #8
<snip>
> 
> 21/06/2022 16:52, Don Wallwork:
> > On 6/21/2022 10:42 AM, Thomas Monjalon wrote:
> > > 21/06/2022 14:31, Don Wallwork:
> > >> On 6/21/2022 6:37 AM, Thomas Monjalon wrote:
> > >>> 20/06/2022 10:35, David Marchand:
> > >>>> On Tue, May 24, 2022 at 9:52 PM Don Wallwork
> <donw@xsightlabs.com> wrote:
> > >>>>> Add support for using hugepages for worker lcore stack memory.
> > >>>>> The intent is to improve performance by reducing stack memory
> > >>>>> related TLB misses and also by using memory local to the NUMA node
> of each lcore.
> > >>>>> EAL option '--huge-worker-stack [stack-size-in-kbytes]' is added
> > >>>>> to allow the feature to be enabled at runtime.  If the size is
> > >>>>> not specified, the system pthread stack size will be used.
> > >>>> - About the name of the option... I don't have a better name.
> > >>>>
> > >>>> Just want to highlight, that what this patch does is use the DPDK
> > >>>> memory allocator for the stack memory.
> > >>>> It happens that DPDK memory allocator is primarily used with
> > >>>> hugepages, but this is not systematic for example with the "no-huge"
> > >>>> mode of the DPDK memory allocator.
> > >>>>
> > >>>> IOW, in this patch current form, you can still run as:
> > >>>>
> > >>>> # dpdk-testpmd -c 3 --no-huge --huge-worker-stack=16 -m 40 -- etc...
> > >>>>
> > >>>> Opinions?
> > >>> The name of the option should not include "huge".
> > >>> What about "--worker-stack" ?
> > >>> If disabled (equal zero), the workers should use the default stack memory.
> > >>>
> > >>>
> > >> Wouldn't that have the potential to create confusion?  The point of
> > >> this change is to allocate worker stacks from hugepages.  Removing
> > >> huge from the option name could give the impression that the
> > >> command is simply to control worker stack size.
> > > It means if we control the worker stack size with a DPDK option,
> > > DPDK memory will be used.
> > > But we cannot force hugepage with this option.
> > > Hugepage is not always available and it can be disabled in DPDK.
> >
> > The command could be rejected if hugepages are not available.
> > That's not in the patch currently, but can be added.
> 
> David, Anatoly, Dmitry, what do you think?
> 
It should be a warning, but the application can continue to run

> 
>
  
David Marchand June 21, 2022, 7:33 p.m. UTC | #9
On Tue, Jun 21, 2022 at 5:00 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > >>> The name of the option should not include "huge".
> > >>> What about "--worker-stack" ?
> > >>> If disabled (equal zero), the workers should use the default stack memory.
> > >>>
> > >>>
> > >> Wouldn't that have the potential to create confusion?  The point of this
> > >> change is to allocate worker stacks from hugepages.  Removing huge
> > >> from the option name could give the impression that the command is
> > >> simply to control worker stack size.
> > > It means if we control the worker stack size with a DPDK option,
> > > DPDK memory will be used.
> > > But we cannot force hugepage with this option.
> > > Hugepage is not always available and it can be disabled in DPDK.
> >
> > The command could be rejected if hugepages are not available.
> > That's not in the patch currently, but can be added.
>
> David, Anatoly, Dmitry, what do you think?

We have other non compatible EAL options for which combining them
trigger an init failure.
This is acceptable for me.
  

Patch

diff --git a/doc/guides/linux_gsg/eal_args.include.rst b/doc/guides/linux_gsg/eal_args.include.rst
index 3549a0cf56..9cfbf7de84 100644
--- a/doc/guides/linux_gsg/eal_args.include.rst
+++ b/doc/guides/linux_gsg/eal_args.include.rst
@@ -116,6 +116,12 @@  Memory-related options
 
     Force IOVA mode to a specific value.
 
+*   ``--huge-worker-stack[=size]``
+
+    Allocate worker stack memory from hugepage memory. Stack size defaults
+    to system pthread stack size unless the optional size (in kbytes) is
+    specified.
+
 Debugging options
 ~~~~~~~~~~~~~~~~~
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 5f0748fba1..e74516f0cf 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -329,6 +329,27 @@  Another option is to use bigger page sizes. Since fewer pages are required to
 cover the same memory area, fewer file descriptors will be stored internally
 by EAL.
 
+.. _huge-worker-stack:
+
+Hugepage Worker Stacks
+^^^^^^^^^^^^^^^^^^^^^^
+
+When the ``--huge-worker-stack[=size]`` EAL option is specified, worker
+thread stacks are allocated from hugepage memory local to the NUMA node
+of the thread. Worker stack size defaults to system pthread stack size
+if the optional size parameter is not specified.
+
+.. warning::
+    Stacks allocated from hugepage memory are not protected by guard
+    pages. Worker stacks must be sufficiently sized to prevent stack
+    overflow when this option is used.
+
+    As with normal thread stacks, hugepage worker thread stack size is
+    fixed and is not dynamically resized. Therefore, an application that
+    is free of stack page faults under a given load should be safe with
+    hugepage worker thread stacks given the same thread stack size and
+    loading conditions.
+
 Support for Externally Allocated Memory
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index f247a42455..02e59051e8 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -103,6 +103,7 @@  eal_long_options[] = {
 	{OPT_TELEMETRY,         0, NULL, OPT_TELEMETRY_NUM        },
 	{OPT_NO_TELEMETRY,      0, NULL, OPT_NO_TELEMETRY_NUM     },
 	{OPT_FORCE_MAX_SIMD_BITWIDTH, 1, NULL, OPT_FORCE_MAX_SIMD_BITWIDTH_NUM},
+	{OPT_HUGE_WORKER_STACK, 2, NULL, OPT_HUGE_WORKER_STACK_NUM     },
 
 	{0,                     0, NULL, 0                        }
 };
@@ -1618,6 +1619,26 @@  eal_parse_huge_unlink(const char *arg, struct hugepage_file_discipline *out)
 	return -1;
 }
 
+static int
+eal_parse_huge_worker_stack(const char *arg, size_t *huge_worker_stack_size)
+{
+	size_t worker_stack_size;
+	char *end;
+
+	if (arg == NULL || arg[0] == '\0') {
+		*huge_worker_stack_size = WORKER_STACK_SIZE_FROM_OS;
+		return 0;
+	}
+	errno = 0;
+	worker_stack_size = strtoul(arg, &end, 10);
+	if (errno || end == NULL || worker_stack_size == 0 ||
+	    worker_stack_size >= (size_t)-1 / 1024)
+		return -1;
+
+	*huge_worker_stack_size = worker_stack_size * 1024;
+	return 0;
+}
+
 int
 eal_parse_common_option(int opt, const char *optarg,
 			struct internal_config *conf)
@@ -1921,6 +1942,15 @@  eal_parse_common_option(int opt, const char *optarg,
 		}
 		break;
 
+	case OPT_HUGE_WORKER_STACK_NUM:
+		if (eal_parse_huge_worker_stack(optarg,
+						&conf->huge_worker_stack_size) < 0) {
+			RTE_LOG(ERR, EAL, "invalid parameter for --"
+				OPT_HUGE_WORKER_STACK"\n");
+			return -1;
+		}
+		break;
+
 	/* don't know what to do, leave this to caller */
 	default:
 		return 1;
@@ -2235,5 +2265,10 @@  eal_common_usage(void)
 	       "  --"OPT_NO_PCI"            Disable PCI\n"
 	       "  --"OPT_NO_HPET"           Disable HPET\n"
 	       "  --"OPT_NO_SHCONF"         No shared config (mmap'd files)\n"
+	       "  --"OPT_HUGE_WORKER_STACK"[=size]\n"
+	       "                      Allocate worker thread stacks from\n"
+	       "                      hugepage memory. Size is in units of\n"
+	       "                      kbytes and defaults to system thread\n"
+	       "                      stack size if not specified.\n"
 	       "\n", RTE_MAX_LCORE);
 }
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index b71faadd18..5e154967e4 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -48,6 +48,9 @@  struct hugepage_file_discipline {
 	bool unlink_existing;
 };
 
+/** Worker hugepage stack size should default to OS value. */
+#define WORKER_STACK_SIZE_FROM_OS ((size_t)~0)
+
 /**
  * internal configuration
  */
@@ -102,6 +105,7 @@  struct internal_config {
 	unsigned int no_telemetry; /**< true to disable Telemetry */
 	struct simd_bitwidth max_simd_bitwidth;
 	/**< max simd bitwidth path to use */
+	size_t huge_worker_stack_size; /**< worker thread stack size */
 };
 
 void eal_reset_internal_config(struct internal_config *internal_cfg);
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 8e4f7202a2..3cc9cb6412 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -87,6 +87,8 @@  enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_HUGE_WORKER_STACK  "huge-worker-stack"
+	OPT_HUGE_WORKER_STACK_NUM,
 
 	OPT_LONG_MAX_NUM
 };
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index a6b20960f2..7368956649 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -795,6 +795,12 @@  rte_eal_init(int argc, char **argv)
 		config->main_lcore, (uintptr_t)pthread_self(), cpuset,
 		ret == 0 ? "" : "...");
 
+	if (internal_conf->huge_worker_stack_size != 0) {
+		rte_eal_init_alert("Hugepage worker stacks not supported");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
 	RTE_LCORE_FOREACH_WORKER(i) {
 
 		/*
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 1ef263434a..d28a0fdb78 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -857,6 +857,64 @@  is_iommu_enabled(void)
 	return n > 2;
 }
 
+static int
+eal_worker_thread_create(struct internal_config *internal_conf,
+			 int lcore_id)
+{
+	pthread_attr_t attr;
+	size_t stack_size;
+	void *stack_ptr;
+	int ret;
+
+	if (internal_conf->huge_worker_stack_size == 0)
+		return pthread_create(&lcore_config[lcore_id].thread_id,
+				      NULL,
+				      eal_thread_loop,
+				      (void *)(uintptr_t)lcore_id);
+
+	/* Allocate NUMA aware stack memory and set pthread attributes */
+	if (pthread_attr_init(&attr) != 0) {
+		rte_eal_init_alert("Cannot init pthread attributes");
+		rte_errno = EFAULT;
+		return -1;
+	}
+	if (internal_conf->huge_worker_stack_size == WORKER_STACK_SIZE_FROM_OS) {
+		if (pthread_attr_getstacksize(&attr, &stack_size) != 0) {
+			rte_errno = EFAULT;
+			return -1;
+		}
+	} else {
+		stack_size = internal_conf->huge_worker_stack_size;
+	}
+	stack_ptr = rte_zmalloc_socket("lcore_stack",
+				       stack_size,
+				       RTE_CACHE_LINE_SIZE,
+				       rte_lcore_to_socket_id(lcore_id));
+
+	if (stack_ptr == NULL) {
+		rte_eal_init_alert("Cannot allocate worker lcore stack memory");
+		rte_errno = ENOMEM;
+		return -1;
+	}
+
+	if (pthread_attr_setstack(&attr, stack_ptr, stack_size) != 0) {
+		rte_eal_init_alert("Cannot set pthread stack attributes");
+		rte_errno = EFAULT;
+		return -1;
+	}
+
+	ret = pthread_create(&lcore_config[lcore_id].thread_id, &attr,
+			     eal_thread_loop,
+			     (void *)(uintptr_t)lcore_id);
+
+	if (pthread_attr_destroy(&attr) != 0) {
+		rte_eal_init_alert("Cannot destroy pthread attributes");
+		rte_errno = EFAULT;
+		return -1;
+	}
+	return ret;
+}
+
 /* Launch threads, called at application init(). */
 int
 rte_eal_init(int argc, char **argv)
@@ -1144,8 +1202,7 @@  rte_eal_init(int argc, char **argv)
 		lcore_config[i].state = WAIT;
 
 		/* create a thread for each lcore */
-		ret = pthread_create(&lcore_config[i].thread_id, NULL,
-				     eal_thread_loop, (void *)(uintptr_t)i);
+		ret = eal_worker_thread_create(internal_conf, i);
 		if (ret != 0)
 			rte_panic("Cannot create thread\n");
 
diff --git a/lib/eal/windows/eal.c b/lib/eal/windows/eal.c
index 122de2a319..5cd4a45872 100644
--- a/lib/eal/windows/eal.c
+++ b/lib/eal/windows/eal.c
@@ -416,6 +416,12 @@  rte_eal_init(int argc, char **argv)
 		config->main_lcore, (uintptr_t)pthread_self(), cpuset,
 		ret == 0 ? "" : "...");
 
+	if (internal_conf->huge_worker_stack_size != 0) {
+		rte_eal_init_alert("Hugepage worker stacks not supported");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
 	RTE_LCORE_FOREACH_WORKER(i) {
 
 		/*