From patchwork Tue Sep 4 15:15:42 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 44276 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 27B8D326C; Tue, 4 Sep 2018 17:15:58 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 2565D2B92; Tue, 4 Sep 2018 17:15:53 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Sep 2018 08:15:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,329,1531810800"; d="scan'208";a="85917179" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga004.fm.intel.com with ESMTP; 04 Sep 2018 08:15:51 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w84FFoiB025466; Tue, 4 Sep 2018 16:15:50 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w84FFo8u022186; Tue, 4 Sep 2018 16:15:50 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w84FFoKN022172; Tue, 4 Sep 2018 16:15:50 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: tiwei.bie@intel.com, ray.kinsella@intel.com, zhihong.wang@intel.com, maxime.coquelin@redhat.com, kuralamudhan.ramakrishnan@intel.com, stable@dpdk.org Date: Tue, 4 Sep 2018 16:15:42 +0100 Message-Id: <3ab2d0a7cf61d800e1b4bc8571d79e5a113e0dd7.1536073997.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [PATCH v3 1/9] fbarray: fix detach in noshconf mode X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" In noshconf mode, no shared files are created, but we're still trying to unlink them, resulting in detach/destroy failure even though it should have succeeded. Fix it by exiting early in noshconf mode. Fixes: 3ee2cde248a7 ("fbarray: support --no-shconf mode") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov Reviewed-by: Maxime Coquelin --- lib/librte_eal/common/eal_common_fbarray.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c index 43caf3ced..ba6c4ae39 100644 --- a/lib/librte_eal/common/eal_common_fbarray.c +++ b/lib/librte_eal/common/eal_common_fbarray.c @@ -878,6 +878,10 @@ rte_fbarray_destroy(struct rte_fbarray *arr) if (ret) return ret; + /* with no shconf, there were never any files to begin with */ + if (internal_config.no_shconf) + return 0; + /* try deleting the file */ eal_get_fbarray_path(path, sizeof(path), arr->name); From patchwork Tue Sep 4 15:15:43 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 44281 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id A39F758FE; Tue, 4 Sep 2018 17:16:06 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 0F4392F4F; Tue, 4 Sep 2018 17:15:57 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Sep 2018 08:15:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,329,1531810800"; d="scan'208";a="70244541" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga007.jf.intel.com with ESMTP; 04 Sep 2018 08:15:51 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w84FFouc025469; Tue, 4 Sep 2018 16:15:50 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w84FFoaQ022232; Tue, 4 Sep 2018 16:15:50 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w84FFoxB022225; Tue, 4 Sep 2018 16:15:50 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: tiwei.bie@intel.com, ray.kinsella@intel.com, zhihong.wang@intel.com, maxime.coquelin@redhat.com, kuralamudhan.ramakrishnan@intel.com, stable@dpdk.org Date: Tue, 4 Sep 2018 16:15:43 +0100 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [PATCH v3 2/9] eal: don't allow legacy mode with in-memory mode X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" In-memory mode was never meant to support legacy mode, because we cannot sort anonymous pages anyway. Fixes: 72b49ff623c4 ("mem: support --in-memory mode") Cc: stable@dpdk.org Signed-off-by: Anatoly Burakov Reviewed-by: Maxime Coquelin --- lib/librte_eal/common/eal_common_options.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c index dd5f97402..873099acc 100644 --- a/lib/librte_eal/common/eal_common_options.c +++ b/lib/librte_eal/common/eal_common_options.c @@ -1390,6 +1390,12 @@ eal_check_common_options(struct internal_config *internal_cfg) "--"OPT_HUGE_UNLINK"\n"); return -1; } + if (internal_cfg->legacy_mem && + internal_cfg->in_memory) { + RTE_LOG(ERR, EAL, "Option --"OPT_LEGACY_MEM" is not compatible " + "with --"OPT_IN_MEMORY"\n"); + return -1; + } return 0; } From patchwork Tue Sep 4 15:15:44 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 44280 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id BFFAB532C; Tue, 4 Sep 2018 17:16:04 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 959072BF7 for ; Tue, 4 Sep 2018 17:15:57 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Sep 2018 08:15:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,329,1531810800"; d="scan'208";a="70244540" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga007.jf.intel.com with ESMTP; 04 Sep 2018 08:15:51 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w84FFpv6025472; Tue, 4 Sep 2018 16:15:51 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w84FFpnH022250; Tue, 4 Sep 2018 16:15:51 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w84FFomI022242; Tue, 4 Sep 2018 16:15:50 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: tiwei.bie@intel.com, ray.kinsella@intel.com, zhihong.wang@intel.com, maxime.coquelin@redhat.com, kuralamudhan.ramakrishnan@intel.com Date: Tue, 4 Sep 2018 16:15:44 +0100 Message-Id: <53a660ffd746c81f59fc3e1f35ac2222a7237641.1536073997.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [PATCH v3 3/9] mem: raise maximum fd limit unconditionally X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Previously, when we allocated hugepages, we closed the fd's corresponding to them after we've done our mappings. Since we did mmap(), we didn't actually lose the reference, but file descriptors used for mmap() do not count against the fd limit. Since we are going to store all of our fd's, we will hit the fd limit much more often when using smaller page sizes. Fix this to raise the fd limit to maximum unconditionally. Signed-off-by: Anatoly Burakov Reviewed-by: Maxime Coquelin --- lib/librte_eal/linuxapp/eal/eal_memory.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index dbf19499e..dfb537f59 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -2204,6 +2205,25 @@ memseg_secondary_init(void) int rte_eal_memseg_init(void) { + /* increase rlimit to maximum */ + struct rlimit lim; + + if (getrlimit(RLIMIT_NOFILE, &lim) == 0) { + /* set limit to maximum */ + lim.rlim_cur = lim.rlim_max; + + if (setrlimit(RLIMIT_NOFILE, &lim) < 0) { + RTE_LOG(DEBUG, EAL, "Setting maximum number of open files failed: %s\n", + strerror(errno)); + } else { + RTE_LOG(DEBUG, EAL, "Setting maximum number of open files to %" + PRIu64 "\n", + (uint64_t)lim.rlim_cur); + } + } else { + RTE_LOG(ERR, EAL, "Cannot get current resource limits\n"); + } + return rte_eal_process_type() == RTE_PROC_PRIMARY ? #ifndef RTE_ARCH_64 memseg_primary_init_32() : From patchwork Tue Sep 4 15:15:45 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 44284 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 567165B38; Tue, 4 Sep 2018 17:16:11 +0200 (CEST) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 9BF995B32 for ; Tue, 4 Sep 2018 17:16:09 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Sep 2018 08:16:08 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,329,1531810800"; d="scan'208";a="71510937" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga006.jf.intel.com with ESMTP; 04 Sep 2018 08:15:51 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w84FFpie025475; Tue, 4 Sep 2018 16:15:51 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w84FFpi7022264; Tue, 4 Sep 2018 16:15:51 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w84FFpq8022260; Tue, 4 Sep 2018 16:15:51 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: tiwei.bie@intel.com, ray.kinsella@intel.com, zhihong.wang@intel.com, maxime.coquelin@redhat.com, kuralamudhan.ramakrishnan@intel.com Date: Tue, 4 Sep 2018 16:15:45 +0100 Message-Id: <50837ea5ba781e6ec382d7159a1fcb83b8e5e124.1536073997.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [PATCH v3 4/9] memalloc: rename lock list to fd list X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Previously, we were only using lock lists to store per-page lock fd's because we cannot use modern fcntl() file description locks to lock parts of the page in single file segments mode. Now, we will be using this list to store either lock fd's (along with memseg list fd) in single file segments mode, or per-page fd's (and set memseg list fd to -1), so rename the list accordingly. Signed-off-by: Anatoly Burakov Reviewed-by: Maxime Coquelin --- lib/librte_eal/linuxapp/eal/eal_memalloc.c | 66 ++++++++++++---------- 1 file changed, 37 insertions(+), 29 deletions(-) diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index aa95551a8..14bc5dce9 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -57,25 +57,33 @@ const int anonymous_hugepages_supported = */ static int fallocate_supported = -1; /* unknown */ -/* for single-file segments, we need some kind of mechanism to keep track of +/* + * we have two modes - single file segments, and file-per-page mode. + * + * for single-file segments, we need some kind of mechanism to keep track of * which hugepages can be freed back to the system, and which cannot. we cannot * use flock() because they don't allow locking parts of a file, and we cannot * use fcntl() due to issues with their semantics, so we will have to rely on a - * bunch of lockfiles for each page. + * bunch of lockfiles for each page. so, we will use 'fds' array to keep track + * of per-page lockfiles. we will store the actual segment list fd in the + * 'memseg_list_fd' field. + * + * for file-per-page mode, each page will have its own fd, so 'memseg_list_fd' + * will be invalid (set to -1), and we'll use 'fds' to keep track of page fd's. * * we cannot know how many pages a system will have in advance, but we do know * that they come in lists, and we know lengths of these lists. so, simply store * a malloc'd array of fd's indexed by list and segment index. * * they will be initialized at startup, and filled as we allocate/deallocate - * segments. also, use this to track memseg list proper fd. + * segments. */ static struct { int *fds; /**< dynamically allocated array of segment lock fd's */ int memseg_list_fd; /**< memseg list fd */ int len; /**< total length of the array */ int count; /**< entries used in an array */ -} lock_fds[RTE_MAX_MEMSEG_LISTS]; +} fd_list[RTE_MAX_MEMSEG_LISTS]; /** local copy of a memory map, used to synchronize memory hotplug in MP */ static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS]; @@ -209,12 +217,12 @@ static int get_segment_lock_fd(int list_idx, int seg_idx) char path[PATH_MAX] = {0}; int fd; - if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds)) + if (list_idx < 0 || list_idx >= (int)RTE_DIM(fd_list)) return -1; - if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len) + if (seg_idx < 0 || seg_idx >= fd_list[list_idx].len) return -1; - fd = lock_fds[list_idx].fds[seg_idx]; + fd = fd_list[list_idx].fds[seg_idx]; /* does this lock already exist? */ if (fd >= 0) return fd; @@ -236,8 +244,8 @@ static int get_segment_lock_fd(int list_idx, int seg_idx) return -1; } /* store it for future reference */ - lock_fds[list_idx].fds[seg_idx] = fd; - lock_fds[list_idx].count++; + fd_list[list_idx].fds[seg_idx] = fd; + fd_list[list_idx].count++; return fd; } @@ -245,12 +253,12 @@ static int unlock_segment(int list_idx, int seg_idx) { int fd, ret; - if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds)) + if (list_idx < 0 || list_idx >= (int)RTE_DIM(fd_list)) return -1; - if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len) + if (seg_idx < 0 || seg_idx >= fd_list[list_idx].len) return -1; - fd = lock_fds[list_idx].fds[seg_idx]; + fd = fd_list[list_idx].fds[seg_idx]; /* upgrade lock to exclusive to see if we can remove the lockfile */ ret = lock(fd, LOCK_EX); @@ -270,8 +278,8 @@ static int unlock_segment(int list_idx, int seg_idx) * and remove it from list anyway. */ close(fd); - lock_fds[list_idx].fds[seg_idx] = -1; - lock_fds[list_idx].count--; + fd_list[list_idx].fds[seg_idx] = -1; + fd_list[list_idx].count--; if (ret < 0) return -1; @@ -288,7 +296,7 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, /* create a hugepage file path */ eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx); - fd = lock_fds[list_idx].memseg_list_fd; + fd = fd_list[list_idx].memseg_list_fd; if (fd < 0) { fd = open(path, O_CREAT | O_RDWR, 0600); @@ -304,7 +312,7 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, close(fd); return -1; } - lock_fds[list_idx].memseg_list_fd = fd; + fd_list[list_idx].memseg_list_fd = fd; } } else { /* create a hugepage file path */ @@ -410,9 +418,9 @@ resize_hugefile(int fd, char *path, int list_idx, int seg_idx, * page file fd, so that one of the processes * could then delete the file after shrinking. */ - if (ret < 1 && lock_fds[list_idx].count == 0) { + if (ret < 1 && fd_list[list_idx].count == 0) { close(fd); - lock_fds[list_idx].memseg_list_fd = -1; + fd_list[list_idx].memseg_list_fd = -1; } if (ret < 0) { @@ -448,13 +456,13 @@ resize_hugefile(int fd, char *path, int list_idx, int seg_idx, * more segments active in this segment list, * and remove the file if there aren't. */ - if (lock_fds[list_idx].count == 0) { + if (fd_list[list_idx].count == 0) { if (unlink(path)) RTE_LOG(ERR, EAL, "%s(): unlinking '%s' failed: %s\n", __func__, path, strerror(errno)); close(fd); - lock_fds[list_idx].memseg_list_fd = -1; + fd_list[list_idx].memseg_list_fd = -1; } } } @@ -1319,7 +1327,7 @@ secondary_msl_create_walk(const struct rte_memseg_list *msl, } static int -secondary_lock_list_create_walk(const struct rte_memseg_list *msl, +fd_list_create_walk(const struct rte_memseg_list *msl, void *arg __rte_unused) { struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; @@ -1330,20 +1338,20 @@ secondary_lock_list_create_walk(const struct rte_memseg_list *msl, msl_idx = msl - mcfg->memsegs; len = msl->memseg_arr.len; - /* ensure we have space to store lock fd per each possible segment */ + /* ensure we have space to store fd per each possible segment */ data = malloc(sizeof(int) * len); if (data == NULL) { - RTE_LOG(ERR, EAL, "Unable to allocate space for lock descriptors\n"); + RTE_LOG(ERR, EAL, "Unable to allocate space for file descriptors\n"); return -1; } /* set all fd's as invalid */ for (i = 0; i < len; i++) data[i] = -1; - lock_fds[msl_idx].fds = data; - lock_fds[msl_idx].len = len; - lock_fds[msl_idx].count = 0; - lock_fds[msl_idx].memseg_list_fd = -1; + fd_list[msl_idx].fds = data; + fd_list[msl_idx].len = len; + fd_list[msl_idx].count = 0; + fd_list[msl_idx].memseg_list_fd = -1; return 0; } @@ -1355,9 +1363,9 @@ eal_memalloc_init(void) if (rte_memseg_list_walk(secondary_msl_create_walk, NULL) < 0) return -1; - /* initialize all of the lock fd lists */ + /* initialize all of the fd lists */ if (internal_config.single_file_segments) - if (rte_memseg_list_walk(secondary_lock_list_create_walk, NULL)) + if (rte_memseg_list_walk(fd_list_create_walk, NULL)) return -1; return 0; } From patchwork Tue Sep 4 15:15:46 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 44277 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 5B4564CB3; Tue, 4 Sep 2018 17:16:00 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id 8DB1E29CB for ; Tue, 4 Sep 2018 17:15:54 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Sep 2018 08:15:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,329,1531810800"; d="scan'208";a="77880035" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by FMSMGA003.fm.intel.com with ESMTP; 04 Sep 2018 08:15:51 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w84FFp7Z025478; Tue, 4 Sep 2018 16:15:51 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w84FFpMZ022279; Tue, 4 Sep 2018 16:15:51 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w84FFpgY022273; Tue, 4 Sep 2018 16:15:51 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: tiwei.bie@intel.com, ray.kinsella@intel.com, zhihong.wang@intel.com, maxime.coquelin@redhat.com, kuralamudhan.ramakrishnan@intel.com Date: Tue, 4 Sep 2018 16:15:46 +0100 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [PATCH v3 5/9] memalloc: track page fd's in non-single file mode X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Previously, we were only tracking lock file fd's in single-file segments mode, but did not track fd's in non-single file mode because we didn't need to (mmap() call still kept the lock). Now that we are going to expose these fd's to the world, we need to have access to them, so track them even in non-single file segments mode. We don't need to close fd's after mmap() because we're still tracking them in an fd list. Also, for anonymous hugepages mode, fd will always be -1 so exit early on error. Signed-off-by: Anatoly Burakov Reviewed-by: Maxime Coquelin --- lib/librte_eal/linuxapp/eal/eal_memalloc.c | 44 ++++++++++++---------- 1 file changed, 25 insertions(+), 19 deletions(-) diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index 14bc5dce9..7d536350e 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -318,18 +318,24 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, /* create a hugepage file path */ eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); - fd = open(path, O_CREAT | O_RDWR, 0600); + + fd = fd_list[list_idx].fds[seg_idx]; + if (fd < 0) { - RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", __func__, - strerror(errno)); - return -1; - } - /* take out a read lock */ - if (lock(fd, LOCK_SH) < 0) { - RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", - __func__, strerror(errno)); - close(fd); - return -1; + fd = open(path, O_CREAT | O_RDWR, 0600); + if (fd < 0) { + RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", + __func__, strerror(errno)); + return -1; + } + /* take out a read lock */ + if (lock(fd, LOCK_SH) < 0) { + RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", + __func__, strerror(errno)); + close(fd); + return -1; + } + fd_list[list_idx].fds[seg_idx] = fd; } } return fd; @@ -601,10 +607,6 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, goto mapped; } #endif - /* for non-single file segments that aren't in-memory, we can close fd - * here */ - if (!internal_config.single_file_segments && !internal_config.in_memory) - close(fd); ms->addr = addr; ms->hugepage_sz = alloc_sz; @@ -634,7 +636,10 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, RTE_LOG(CRIT, EAL, "Can't mmap holes in our virtual address space\n"); } resized: - /* in-memory mode will never be single-file-segments mode */ + /* some codepaths will return negative fd, so exit early */ + if (fd < 0) + return -1; + if (internal_config.single_file_segments) { resize_hugefile(fd, path, list_idx, seg_idx, map_offset, alloc_sz, false); @@ -646,6 +651,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, lock(fd, LOCK_EX) == 1) unlink(path); close(fd); + fd_list[list_idx].fds[seg_idx] = -1; } return -1; } @@ -700,6 +706,7 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, } /* closing fd will drop the lock */ close(fd); + fd_list[list_idx].fds[seg_idx] = -1; } memset(ms, 0, sizeof(*ms)); @@ -1364,8 +1371,7 @@ eal_memalloc_init(void) return -1; /* initialize all of the fd lists */ - if (internal_config.single_file_segments) - if (rte_memseg_list_walk(fd_list_create_walk, NULL)) - return -1; + if (rte_memseg_list_walk(fd_list_create_walk, NULL)) + return -1; return 0; } From patchwork Tue Sep 4 15:15:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 44278 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id C7AF24CBD; Tue, 4 Sep 2018 17:16:01 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id EADC62B92 for ; Tue, 4 Sep 2018 17:15:54 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Sep 2018 08:15:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,329,1531810800"; d="scan'208";a="77880039" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by FMSMGA003.fm.intel.com with ESMTP; 04 Sep 2018 08:15:52 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w84FFpgO025481; Tue, 4 Sep 2018 16:15:51 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w84FFpoL022293; Tue, 4 Sep 2018 16:15:51 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w84FFphU022285; Tue, 4 Sep 2018 16:15:51 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: Bruce Richardson , tiwei.bie@intel.com, ray.kinsella@intel.com, zhihong.wang@intel.com, maxime.coquelin@redhat.com, kuralamudhan.ramakrishnan@intel.com Date: Tue, 4 Sep 2018 16:15:47 +0100 Message-Id: <6a24b2f23baccc78b7bab7ecfee84d3ed477ab80.1536073997.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [PATCH v3 6/9] memalloc: add EAL-internal API to get and set segment fd's X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Enable setting and retrieving segment fd's internally. For now, retrieving fd's will not be used anywhere until we get an external API, but it will be useful for things like virtio, where we wish to share segment fd's. Setting segment fd's will not be available as a public API at this time, but internally it is needed for legacy mode, because we're not allocating our hugepages in memalloc in legacy mode case, and we still need to store the fd. Another user of get segment fd API is memseg info dump, to show which pages use which fd's. Not supported on FreeBSD. Signed-off-by: Anatoly Burakov --- lib/librte_eal/bsdapp/eal/eal_memalloc.c | 12 +++++ lib/librte_eal/common/eal_common_memory.c | 8 +-- lib/librte_eal/common/eal_memalloc.h | 6 +++ lib/librte_eal/linuxapp/eal/eal_memalloc.c | 60 +++++++++++++++++----- lib/librte_eal/linuxapp/eal/eal_memory.c | 44 +++++++++++++--- 5 files changed, 109 insertions(+), 21 deletions(-) diff --git a/lib/librte_eal/bsdapp/eal/eal_memalloc.c b/lib/librte_eal/bsdapp/eal/eal_memalloc.c index f7f07abd6..a5fb09f71 100644 --- a/lib/librte_eal/bsdapp/eal/eal_memalloc.c +++ b/lib/librte_eal/bsdapp/eal/eal_memalloc.c @@ -47,6 +47,18 @@ eal_memalloc_sync_with_primary(void) return -1; } +int +eal_memalloc_get_seg_fd(int list_idx, int seg_idx) +{ + return -1; +} + +int +eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd) +{ + return -1; +} + int eal_memalloc_init(void) { diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c index fbfb1b055..034c2026a 100644 --- a/lib/librte_eal/common/eal_common_memory.c +++ b/lib/librte_eal/common/eal_common_memory.c @@ -294,7 +294,7 @@ dump_memseg(const struct rte_memseg_list *msl, const struct rte_memseg *ms, void *arg) { struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; - int msl_idx, ms_idx; + int msl_idx, ms_idx, fd; FILE *f = arg; msl_idx = msl - mcfg->memsegs; @@ -305,10 +305,11 @@ dump_memseg(const struct rte_memseg_list *msl, const struct rte_memseg *ms, if (ms_idx < 0) return -1; + fd = eal_memalloc_get_seg_fd(msl_idx, ms_idx); fprintf(f, "Segment %i-%i: IOVA:0x%"PRIx64", len:%zu, " "virt:%p, socket_id:%"PRId32", " "hugepage_sz:%"PRIu64", nchannel:%"PRIx32", " - "nrank:%"PRIx32"\n", + "nrank:%"PRIx32" fd:%i\n", msl_idx, ms_idx, ms->iova, ms->len, @@ -316,7 +317,8 @@ dump_memseg(const struct rte_memseg_list *msl, const struct rte_memseg *ms, ms->socket_id, ms->hugepage_sz, ms->nchannel, - ms->nrank); + ms->nrank, + fd); return 0; } diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h index 36bb1a027..a46c69c72 100644 --- a/lib/librte_eal/common/eal_memalloc.h +++ b/lib/librte_eal/common/eal_memalloc.h @@ -76,6 +76,12 @@ eal_memalloc_mem_alloc_validator_unregister(const char *name, int socket_id); int eal_memalloc_mem_alloc_validate(int socket_id, size_t new_len); +int +eal_memalloc_get_seg_fd(int list_idx, int seg_idx); + +int +eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd); + int eal_memalloc_init(void); diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index 7d536350e..b820989e9 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -1334,16 +1334,10 @@ secondary_msl_create_walk(const struct rte_memseg_list *msl, } static int -fd_list_create_walk(const struct rte_memseg_list *msl, - void *arg __rte_unused) +alloc_list(int list_idx, int len) { - struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; - unsigned int i, len; - int msl_idx; int *data; - - msl_idx = msl - mcfg->memsegs; - len = msl->memseg_arr.len; + int i; /* ensure we have space to store fd per each possible segment */ data = malloc(sizeof(int) * len); @@ -1355,14 +1349,56 @@ fd_list_create_walk(const struct rte_memseg_list *msl, for (i = 0; i < len; i++) data[i] = -1; - fd_list[msl_idx].fds = data; - fd_list[msl_idx].len = len; - fd_list[msl_idx].count = 0; - fd_list[msl_idx].memseg_list_fd = -1; + fd_list[list_idx].fds = data; + fd_list[list_idx].len = len; + fd_list[list_idx].count = 0; + fd_list[list_idx].memseg_list_fd = -1; return 0; } +static int +fd_list_create_walk(const struct rte_memseg_list *msl, + void *arg __rte_unused) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + unsigned int len; + int msl_idx; + + msl_idx = msl - mcfg->memsegs; + len = msl->memseg_arr.len; + + return alloc_list(msl_idx, len); +} + +int +eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + + /* if list is not allocated, allocate it */ + if (fd_list[list_idx].len == 0) { + int len = mcfg->memsegs[list_idx].memseg_arr.len; + + if (alloc_list(list_idx, len) < 0) + return -1; + } + fd_list[list_idx].fds[seg_idx] = fd; + + return 0; +} + +int +eal_memalloc_get_seg_fd(int list_idx, int seg_idx) +{ + if (internal_config.single_file_segments) + return fd_list[list_idx].memseg_list_fd; + /* list not initialized */ + if (fd_list[list_idx].len == 0) + return -1; + return fd_list[list_idx].fds[seg_idx]; +} + int eal_memalloc_init(void) { diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c index dfb537f59..e3ac24815 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memory.c +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c @@ -772,7 +772,10 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end) rte_fbarray_set_used(arr, ms_idx); - close(fd); + /* store segment fd internally */ + if (eal_memalloc_set_seg_fd(msl_idx, ms_idx, fd) < 0) + RTE_LOG(ERR, EAL, "Could not store segment fd: %s\n", + rte_strerror(rte_errno)); } RTE_LOG(DEBUG, EAL, "Allocated %" PRIu64 "M on socket %i\n", (seg_len * page_sz) >> 20, socket_id); @@ -1771,6 +1774,7 @@ getFileSize(int fd) static int eal_legacy_hugepage_attach(void) { + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; struct hugepage_file *hp = NULL; unsigned int num_hp = 0; unsigned int i = 0; @@ -1814,6 +1818,9 @@ eal_legacy_hugepage_attach(void) struct hugepage_file *hf = &hp[i]; size_t map_sz = hf->size; void *map_addr = hf->final_va; + int msl_idx, ms_idx; + struct rte_memseg_list *msl; + struct rte_memseg *ms; /* if size is zero, no more pages left */ if (map_sz == 0) @@ -1831,25 +1838,50 @@ eal_legacy_hugepage_attach(void) if (map_addr == MAP_FAILED) { RTE_LOG(ERR, EAL, "Could not map %s: %s\n", hf->filepath, strerror(errno)); - close(fd); - goto error; + goto fd_error; } /* set shared lock on the file. */ if (flock(fd, LOCK_SH) < 0) { RTE_LOG(DEBUG, EAL, "%s(): Locking file failed: %s\n", __func__, strerror(errno)); - close(fd); - goto error; + goto fd_error; } - close(fd); + /* find segment data */ + msl = rte_mem_virt2memseg_list(map_addr); + if (msl == NULL) { + RTE_LOG(DEBUG, EAL, "%s(): Cannot find memseg list\n", + __func__); + goto fd_error; + } + ms = rte_mem_virt2memseg(map_addr, msl); + if (ms == NULL) { + RTE_LOG(DEBUG, EAL, "%s(): Cannot find memseg\n", + __func__); + goto fd_error; + } + + msl_idx = msl - mcfg->memsegs; + ms_idx = rte_fbarray_find_idx(&msl->memseg_arr, ms); + if (ms_idx < 0) { + RTE_LOG(DEBUG, EAL, "%s(): Cannot find memseg idx\n", + __func__); + goto fd_error; + } + + /* store segment fd internally */ + if (eal_memalloc_set_seg_fd(msl_idx, ms_idx, fd) < 0) + RTE_LOG(ERR, EAL, "Could not store segment fd: %s\n", + rte_strerror(rte_errno)); } /* unmap the hugepage config file, since we are done using it */ munmap(hp, size); close(fd_hugepage); return 0; +fd_error: + close(fd); error: /* map all segments into memory to make sure we get the addrs */ cur_seg = 0; From patchwork Tue Sep 4 15:15:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 44282 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id C39015B1C; Tue, 4 Sep 2018 17:16:07 +0200 (CEST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 46EA5326D for ; Tue, 4 Sep 2018 17:15:58 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Sep 2018 08:15:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,329,1531810800"; d="scan'208";a="70483517" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga008.jf.intel.com with ESMTP; 04 Sep 2018 08:15:52 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w84FFp4f025484; Tue, 4 Sep 2018 16:15:51 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w84FFp36022303; Tue, 4 Sep 2018 16:15:51 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w84FFpd1022299; Tue, 4 Sep 2018 16:15:51 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: Bruce Richardson , tiwei.bie@intel.com, ray.kinsella@intel.com, zhihong.wang@intel.com, maxime.coquelin@redhat.com, kuralamudhan.ramakrishnan@intel.com Date: Tue, 4 Sep 2018 16:15:48 +0100 Message-Id: <2d926d5508a98457acf7a4b2b004284bdbd22097.1536073997.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [PATCH v3 7/9] mem: add external API to retrieve page fd from EAL X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Now that we can retrieve page fd's internally, we can expose it as an external API. This will add two flavors of API - thread-safe and non-thread-safe. Fix up internal API's to return values we need without modifying rte_errno internally if called from within EAL. We do not want calling code to accidentally close an internal fd, so we make a duplicate of it before we return it to the user. Caller is therefore responsible for closing this fd. Signed-off-by: Anatoly Burakov Reviewed-by: Maxime Coquelin --- lib/librte_eal/bsdapp/eal/eal_memalloc.c | 5 ++- lib/librte_eal/common/eal_common_memory.c | 49 ++++++++++++++++++++++ lib/librte_eal/common/eal_memalloc.h | 2 + lib/librte_eal/common/include/rte_memory.h | 48 +++++++++++++++++++++ lib/librte_eal/linuxapp/eal/eal_memalloc.c | 21 ++++++---- lib/librte_eal/rte_eal_version.map | 2 + 6 files changed, 118 insertions(+), 9 deletions(-) diff --git a/lib/librte_eal/bsdapp/eal/eal_memalloc.c b/lib/librte_eal/bsdapp/eal/eal_memalloc.c index a5fb09f71..80e4c3d4f 100644 --- a/lib/librte_eal/bsdapp/eal/eal_memalloc.c +++ b/lib/librte_eal/bsdapp/eal/eal_memalloc.c @@ -4,6 +4,7 @@ #include +#include #include #include @@ -50,13 +51,13 @@ eal_memalloc_sync_with_primary(void) int eal_memalloc_get_seg_fd(int list_idx, int seg_idx) { - return -1; + return -ENOTSUP; } int eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd) { - return -1; + return -ENOTSUP; } int diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c index 034c2026a..4a80deaf5 100644 --- a/lib/librte_eal/common/eal_common_memory.c +++ b/lib/librte_eal/common/eal_common_memory.c @@ -550,6 +550,55 @@ rte_memseg_list_walk(rte_memseg_list_walk_t func, void *arg) return ret; } +int __rte_experimental +rte_memseg_get_fd_thread_unsafe(const struct rte_memseg *ms) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl; + struct rte_fbarray *arr; + int msl_idx, seg_idx, ret; + + if (ms == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl = rte_mem_virt2memseg_list(ms->addr); + if (msl == NULL) { + rte_errno = EINVAL; + return -1; + } + arr = &msl->memseg_arr; + + msl_idx = msl - mcfg->memsegs; + seg_idx = rte_fbarray_find_idx(arr, ms); + + if (!rte_fbarray_is_used(arr, seg_idx)) { + rte_errno = ENOENT; + return -1; + } + + ret = eal_memalloc_get_seg_fd(msl_idx, seg_idx); + if (ret < 0) { + rte_errno = -ret; + ret = -1; + } + return ret; +} + +int __rte_experimental +rte_memseg_get_fd(const struct rte_memseg *ms) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int ret; + + rte_rwlock_read_lock(&mcfg->memory_hotplug_lock); + ret = rte_memseg_get_fd_thread_unsafe(ms); + rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock); + + return ret; +} + /* init memory subsystem */ int rte_eal_memory_init(void) diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h index a46c69c72..70a214de4 100644 --- a/lib/librte_eal/common/eal_memalloc.h +++ b/lib/librte_eal/common/eal_memalloc.h @@ -76,9 +76,11 @@ eal_memalloc_mem_alloc_validator_unregister(const char *name, int socket_id); int eal_memalloc_mem_alloc_validate(int socket_id, size_t new_len); +/* returns fd or -errno */ int eal_memalloc_get_seg_fd(int list_idx, int seg_idx); +/* returns 0 or -errno */ int eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd); diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h index c4b7f4cff..0d2a30056 100644 --- a/lib/librte_eal/common/include/rte_memory.h +++ b/lib/librte_eal/common/include/rte_memory.h @@ -317,6 +317,54 @@ rte_memseg_contig_walk_thread_unsafe(rte_memseg_contig_walk_t func, void *arg); int __rte_experimental rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg); +/** + * Return file descriptor associated with a particular memseg (if available). + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @note This returns an internal file descriptor. Performing any operations on + * this file descriptor is inherently dangerous, so it should be treated + * as read-only for all intents and purposes. + * + * @param ms + * A pointer to memseg for which to get file descriptor. + * + * @return + * Valid file descriptor in case of success. + * -1 in case of error, with ``rte_errno`` set to the following values: + * - EINVAL - ``ms`` pointer was NULL or did not point to a valid memseg + * - ENODEV - ``ms`` fd is not available + * - ENOENT - ``ms`` is an unused segment + * - ENOTSUP - segment fd's are not supported + */ +int __rte_experimental +rte_memseg_get_fd(const struct rte_memseg *ms); + +/** + * Return file descriptor associated with a particular memseg (if available). + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @note This returns an internal file descriptor. Performing any operations on + * this file descriptor is inherently dangerous, so it should be treated + * as read-only for all intents and purposes. + * + * @param ms + * A pointer to memseg for which to get file descriptor. + * + * @return + * Valid file descriptor in case of success. + * -1 in case of error, with ``rte_errno`` set to the following values: + * - EINVAL - ``ms`` pointer was NULL or did not point to a valid memseg + * - ENODEV - ``ms`` fd is not available + * - ENOENT - ``ms`` is an unused segment + * - ENOTSUP - segment fd's are not supported + */ +int __rte_experimental +rte_memseg_get_fd_thread_unsafe(const struct rte_memseg *ms); + /** * Dump the physical memory layout to a file. * diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index b820989e9..21f842753 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -1381,7 +1382,7 @@ eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd) int len = mcfg->memsegs[list_idx].memseg_arr.len; if (alloc_list(list_idx, len) < 0) - return -1; + return -ENOMEM; } fd_list[list_idx].fds[seg_idx] = fd; @@ -1391,12 +1392,18 @@ eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd) int eal_memalloc_get_seg_fd(int list_idx, int seg_idx) { - if (internal_config.single_file_segments) - return fd_list[list_idx].memseg_list_fd; - /* list not initialized */ - if (fd_list[list_idx].len == 0) - return -1; - return fd_list[list_idx].fds[seg_idx]; + int fd; + if (internal_config.single_file_segments) { + fd = fd_list[list_idx].memseg_list_fd; + } else if (fd_list[list_idx].len == 0) { + /* list not initialized */ + fd = -1; + } else { + fd = fd_list[list_idx].fds[seg_idx]; + } + if (fd < 0) + return -ENODEV; + return fd; } int diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map index 344a43d32..e659e19d6 100644 --- a/lib/librte_eal/rte_eal_version.map +++ b/lib/librte_eal/rte_eal_version.map @@ -320,6 +320,8 @@ EXPERIMENTAL { rte_mem_virt2memseg_list; rte_memseg_contig_walk; rte_memseg_contig_walk_thread_unsafe; + rte_memseg_get_fd; + rte_memseg_get_fd_thread_unsafe; rte_memseg_list_walk; rte_memseg_list_walk_thread_unsafe; rte_memseg_walk; From patchwork Tue Sep 4 15:15:49 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 44283 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id EE1545B2E; Tue, 4 Sep 2018 17:16:08 +0200 (CEST) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 91709326D for ; Tue, 4 Sep 2018 17:15:59 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Sep 2018 08:15:58 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,329,1531810800"; d="scan'208";a="82907296" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga002.fm.intel.com with ESMTP; 04 Sep 2018 08:15:52 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w84FFpew025487; Tue, 4 Sep 2018 16:15:51 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w84FFpRY022325; Tue, 4 Sep 2018 16:15:51 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w84FFpJf022317; Tue, 4 Sep 2018 16:15:51 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: Bruce Richardson , tiwei.bie@intel.com, ray.kinsella@intel.com, zhihong.wang@intel.com, maxime.coquelin@redhat.com, kuralamudhan.ramakrishnan@intel.com Date: Tue, 4 Sep 2018 16:15:49 +0100 Message-Id: <155676c123bd8b8770dcd966c7aa347177a70660.1536073997.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [PATCH v3 8/9] mem: allow querying offset into segment fd X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" In a few cases, user may need to query offset into fd for a particular memory segment (for example, to selectively map pages). This commit adds a new API to do that. Signed-off-by: Anatoly Burakov Reviewed-by: Maxime Coquelin --- Notes: v3: - Fix single file segments mode not working lib/librte_eal/bsdapp/eal/eal_memalloc.c | 6 +++ lib/librte_eal/common/eal_common_memory.c | 50 ++++++++++++++++++++++ lib/librte_eal/common/eal_memalloc.h | 3 ++ lib/librte_eal/common/include/rte_memory.h | 49 +++++++++++++++++++++ lib/librte_eal/linuxapp/eal/eal_memalloc.c | 24 +++++++++++ lib/librte_eal/rte_eal_version.map | 2 + 6 files changed, 134 insertions(+) diff --git a/lib/librte_eal/bsdapp/eal/eal_memalloc.c b/lib/librte_eal/bsdapp/eal/eal_memalloc.c index 80e4c3d4f..06afbcc99 100644 --- a/lib/librte_eal/bsdapp/eal/eal_memalloc.c +++ b/lib/librte_eal/bsdapp/eal/eal_memalloc.c @@ -60,6 +60,12 @@ eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd) return -ENOTSUP; } +int +eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset) +{ + return -ENOTSUP; +} + int eal_memalloc_init(void) { diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c index 4a80deaf5..0b69804ff 100644 --- a/lib/librte_eal/common/eal_common_memory.c +++ b/lib/librte_eal/common/eal_common_memory.c @@ -599,6 +599,56 @@ rte_memseg_get_fd(const struct rte_memseg *ms) return ret; } +int __rte_experimental +rte_memseg_get_fd_offset_thread_unsafe(const struct rte_memseg *ms, + size_t *offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_memseg_list *msl; + struct rte_fbarray *arr; + int msl_idx, seg_idx, ret; + + if (ms == NULL || offset == NULL) { + rte_errno = EINVAL; + return -1; + } + + msl = rte_mem_virt2memseg_list(ms->addr); + if (msl == NULL) { + rte_errno = EINVAL; + return -1; + } + arr = &msl->memseg_arr; + + msl_idx = msl - mcfg->memsegs; + seg_idx = rte_fbarray_find_idx(arr, ms); + + if (!rte_fbarray_is_used(arr, seg_idx)) { + rte_errno = ENOENT; + return -1; + } + + ret = eal_memalloc_get_seg_fd_offset(msl_idx, seg_idx, offset); + if (ret < 0) { + rte_errno = -ret; + ret = -1; + } + return ret; +} + +int __rte_experimental +rte_memseg_get_fd_offset(const struct rte_memseg *ms, size_t *offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + int ret; + + rte_rwlock_read_lock(&mcfg->memory_hotplug_lock); + ret = rte_memseg_get_fd_offset_thread_unsafe(ms, offset); + rte_rwlock_read_unlock(&mcfg->memory_hotplug_lock); + + return ret; +} + /* init memory subsystem */ int rte_eal_memory_init(void) diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h index 70a214de4..af917c2f9 100644 --- a/lib/librte_eal/common/eal_memalloc.h +++ b/lib/librte_eal/common/eal_memalloc.h @@ -84,6 +84,9 @@ eal_memalloc_get_seg_fd(int list_idx, int seg_idx); int eal_memalloc_set_seg_fd(int list_idx, int seg_idx, int fd); +int +eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset); + int eal_memalloc_init(void); diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h index 0d2a30056..14bd277a4 100644 --- a/lib/librte_eal/common/include/rte_memory.h +++ b/lib/librte_eal/common/include/rte_memory.h @@ -365,6 +365,55 @@ rte_memseg_get_fd(const struct rte_memseg *ms); int __rte_experimental rte_memseg_get_fd_thread_unsafe(const struct rte_memseg *ms); +/** + * Get offset into segment file descriptor associated with a particular memseg + * (if available). + * + * @note This function read-locks the memory hotplug subsystem, and thus cannot + * be used within memory-related callback functions. + * + * @param ms + * A pointer to memseg for which to get file descriptor. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * Valid file descriptor in case of success. + * -1 in case of error, with ``rte_errno`` set to the following values: + * - EINVAL - ``ms`` pointer was NULL or did not point to a valid memseg + * - EINVAL - ``offset`` pointer was NULL + * - ENODEV - ``ms`` fd is not available + * - ENOENT - ``ms`` is an unused segment + * - ENOTSUP - segment fd's are not supported + */ +int __rte_experimental +rte_memseg_get_fd_offset(const struct rte_memseg *ms, size_t *offset); + +/** + * Get offset into segment file descriptor associated with a particular memseg + * (if available). + * + * @note This function does not perform any locking, and is only safe to call + * from within memory-related callback functions. + * + * @param ms + * A pointer to memseg for which to get file descriptor. + * @param offset + * A pointer to offset value where the result will be stored. + * + * @return + * Valid file descriptor in case of success. + * -1 in case of error, with ``rte_errno`` set to the following values: + * - EINVAL - ``ms`` pointer was NULL or did not point to a valid memseg + * - EINVAL - ``offset`` pointer was NULL + * - ENODEV - ``ms`` fd is not available + * - ENOENT - ``ms`` is an unused segment + * - ENOTSUP - segment fd's are not supported + */ +int __rte_experimental +rte_memseg_get_fd_offset_thread_unsafe(const struct rte_memseg *ms, + size_t *offset); + /** * Dump the physical memory layout to a file. * diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index 21f842753..6fc9dc0ca 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -1406,6 +1406,30 @@ eal_memalloc_get_seg_fd(int list_idx, int seg_idx) return fd; } +int +eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + + /* fd_list not initialized? */ + if (fd_list[list_idx].len == 0) + return -ENODEV; + if (internal_config.single_file_segments) { + size_t pgsz = mcfg->memsegs[list_idx].page_sz; + + /* segment not active? */ + if (fd_list[list_idx].memseg_list_fd < 0) + return -ENOENT; + *offset = pgsz * seg_idx; + } else { + /* segment not active? */ + if (fd_list[list_idx].fds[seg_idx] < 0) + return -ENOENT; + *offset = 0; + } + return 0; +} + int eal_memalloc_init(void) { diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map index e659e19d6..ba159f7cd 100644 --- a/lib/librte_eal/rte_eal_version.map +++ b/lib/librte_eal/rte_eal_version.map @@ -321,7 +321,9 @@ EXPERIMENTAL { rte_memseg_contig_walk; rte_memseg_contig_walk_thread_unsafe; rte_memseg_get_fd; + rte_memseg_get_fd_offset; rte_memseg_get_fd_thread_unsafe; + rte_memseg_get_fd_offset_thread_unsafe; rte_memseg_list_walk; rte_memseg_list_walk_thread_unsafe; rte_memseg_walk; From patchwork Tue Sep 4 15:15:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Burakov, Anatoly" X-Patchwork-Id: 44279 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 78A874F91; Tue, 4 Sep 2018 17:16:03 +0200 (CEST) Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id 31CEE29CB for ; Tue, 4 Sep 2018 17:15:55 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Sep 2018 08:15:54 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,329,1531810800"; d="scan'208";a="68296158" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga008.fm.intel.com with ESMTP; 04 Sep 2018 08:15:52 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w84FFpTV025490; Tue, 4 Sep 2018 16:15:51 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w84FFppV022376; Tue, 4 Sep 2018 16:15:51 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w84FFp9t022362; Tue, 4 Sep 2018 16:15:51 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: tiwei.bie@intel.com, ray.kinsella@intel.com, zhihong.wang@intel.com, maxime.coquelin@redhat.com, kuralamudhan.ramakrishnan@intel.com Date: Tue, 4 Sep 2018 16:15:50 +0100 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [PATCH v3 9/9] mem: support using memfd segments for in-memory mode X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Enable using memfd-created segments if supported by the system. This will allow having real fd's for pages but without hugetlbfs mounts, which will enable in-memory mode to be used with virtio. The implementation is mostly piggy-backing on existing real-fd code, except that we no longer need to unlink any files or track per-page locks in single-file segments mode, because in-memory mode does not support secondary processes anyway. We move some checks from EAL command-line parsing code to memalloc because it is now possible to use single-file segments mode with in-memory mode, but only if memfd is supported. Signed-off-by: Anatoly Burakov Reviewed-by: Maxime Coquelin --- lib/librte_eal/common/eal_common_options.c | 6 +- lib/librte_eal/linuxapp/eal/eal_memalloc.c | 265 ++++++++++++++++++--- 2 files changed, 235 insertions(+), 36 deletions(-) diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c index 873099acc..ddd624110 100644 --- a/lib/librte_eal/common/eal_common_options.c +++ b/lib/librte_eal/common/eal_common_options.c @@ -1384,10 +1384,10 @@ eal_check_common_options(struct internal_config *internal_cfg) " is only supported in non-legacy memory mode\n"); } if (internal_cfg->single_file_segments && - internal_cfg->hugepage_unlink) { + internal_cfg->hugepage_unlink && + !internal_cfg->in_memory) { RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE_SEGMENTS" is " - "not compatible with neither --"OPT_IN_MEMORY" nor " - "--"OPT_HUGE_UNLINK"\n"); + "not compatible with --"OPT_HUGE_UNLINK"\n"); return -1; } if (internal_cfg->legacy_mem && diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c index 6fc9dc0ca..b2e2a9599 100644 --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c @@ -52,6 +52,23 @@ const int anonymous_hugepages_supported = #define RTE_MAP_HUGE_SHIFT 26 #endif +/* + * we don't actually care if memfd itself is supported - we only need to check + * if memfd supports hugetlbfs, as that already implies memfd support. + * + * also, this is not a constant, because while we may be *compiled* with memfd + * hugetlbfs support, we might not be *running* on a system that supports memfd + * and/or memfd with hugetlbfs, so we need to be able to adjust this flag at + * runtime, and fall back to anonymous memory. + */ +int memfd_create_supported = +#ifdef MFD_HUGETLB +#define MEMFD_SUPPORTED + 1; +#else + 0; +#endif + /* * not all kernel version support fallocate on hugetlbfs, so fall back to * ftruncate and disallow deallocation if fallocate is not supported. @@ -191,6 +208,31 @@ get_file_size(int fd) return st.st_size; } +static inline uint32_t +bsf64(uint64_t v) +{ + return (uint32_t)__builtin_ctzll(v); +} + +static inline uint32_t +log2_u64(uint64_t v) +{ + if (v == 0) + return 0; + v = rte_align64pow2(v); + return bsf64(v); +} + +static int +pagesz_flags(uint64_t page_sz) +{ + /* as per mmap() manpage, all page sizes are log2 of page size + * shifted by MAP_HUGE_SHIFT + */ + int log2 = log2_u64(page_sz); + return log2 << RTE_MAP_HUGE_SHIFT; +} + /* returns 1 on successful lock, 0 on unsuccessful lock, -1 on error */ static int lock(int fd, int type) { @@ -287,12 +329,64 @@ static int unlock_segment(int list_idx, int seg_idx) return 0; } +static int +get_seg_memfd(struct hugepage_info *hi __rte_unused, + unsigned int list_idx __rte_unused, + unsigned int seg_idx __rte_unused) +{ +#ifdef MEMFD_SUPPORTED + int fd; + char segname[250]; /* as per manpage, limit is 249 bytes plus null */ + + if (internal_config.single_file_segments) { + fd = fd_list[list_idx].memseg_list_fd; + + if (fd < 0) { + int flags = MFD_HUGETLB | pagesz_flags(hi->hugepage_sz); + + snprintf(segname, sizeof(segname), "seg_%i", list_idx); + fd = memfd_create(segname, flags); + if (fd < 0) { + RTE_LOG(DEBUG, EAL, "%s(): memfd create failed: %s\n", + __func__, strerror(errno)); + return -1; + } + fd_list[list_idx].memseg_list_fd = fd; + } + } else { + fd = fd_list[list_idx].fds[seg_idx]; + + if (fd < 0) { + int flags = MFD_HUGETLB | pagesz_flags(hi->hugepage_sz); + + snprintf(segname, sizeof(segname), "seg_%i-%i", + list_idx, seg_idx); + fd = memfd_create(segname, flags); + if (fd < 0) { + RTE_LOG(DEBUG, EAL, "%s(): memfd create failed: %s\n", + __func__, strerror(errno)); + return -1; + } + fd_list[list_idx].fds[seg_idx] = fd; + } + } + return fd; +#endif + return -1; +} + static int get_seg_fd(char *path, int buflen, struct hugepage_info *hi, unsigned int list_idx, unsigned int seg_idx) { int fd; + /* for in-memory mode, we only make it here when we're sure we support + * memfd, and this is a special case. + */ + if (internal_config.in_memory) + return get_seg_memfd(hi, list_idx, seg_idx); + if (internal_config.single_file_segments) { /* create a hugepage file path */ eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx); @@ -347,6 +441,33 @@ resize_hugefile(int fd, char *path, int list_idx, int seg_idx, uint64_t fa_offset, uint64_t page_sz, bool grow) { bool again = false; + + /* in-memory mode is a special case, because we don't need to perform + * any locking, and we can be sure that fallocate() is supported. + */ + if (internal_config.in_memory) { + int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE | + FALLOC_FL_KEEP_SIZE; + int ret; + + /* grow or shrink the file */ + ret = fallocate(fd, flags, fa_offset, page_sz); + + if (ret < 0) { + RTE_LOG(DEBUG, EAL, "%s(): fallocate() failed: %s\n", + __func__, + strerror(errno)); + return -1; + } + /* increase/decrease total segment count */ + fd_list[list_idx].count += (grow ? 1 : -1); + if (!grow && fd_list[list_idx].count == 0) { + close(fd_list[list_idx].memseg_list_fd); + fd_list[list_idx].memseg_list_fd = -1; + } + return 0; + } + do { if (fallocate_supported == 0) { /* we cannot deallocate memory if fallocate() is not @@ -496,26 +617,34 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, void *new_addr; alloc_sz = hi->hugepage_sz; - if (!internal_config.single_file_segments && - internal_config.in_memory && - anonymous_hugepages_supported) { - int log2, flags; - - log2 = rte_log2_u32(alloc_sz); - /* as per mmap() manpage, all page sizes are log2 of page size - * shifted by MAP_HUGE_SHIFT - */ - flags = (log2 << RTE_MAP_HUGE_SHIFT) | MAP_HUGETLB | MAP_FIXED | + + /* these are checked at init, but code analyzers don't know that */ + if (internal_config.in_memory && !anonymous_hugepages_supported) { + RTE_LOG(ERR, EAL, "Anonymous hugepages not supported, in-memory mode cannot allocate memory\n"); + return -1; + } + if (internal_config.in_memory && !memfd_create_supported && + internal_config.single_file_segments) { + RTE_LOG(ERR, EAL, "Single-file segments are not supported without memfd support\n"); + return -1; + } + + /* in-memory without memfd is a special case */ + int mmap_flags; + + if (internal_config.in_memory && !memfd_create_supported) { + int pagesz_flag, flags; + + pagesz_flag = pagesz_flags(alloc_sz); + flags = pagesz_flag | MAP_HUGETLB | MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS; fd = -1; - va = mmap(addr, alloc_sz, PROT_READ | PROT_WRITE, flags, -1, 0); + mmap_flags = flags; - /* single-file segments codepath will never be active because - * in-memory mode is incompatible with it and it's stopped at - * EAL initialization stage, however the compiler doesn't know - * that and complains about map_offset being used uninitialized - * on failure codepaths while having in-memory mode enabled. so, - * assign a value here. + /* single-file segments codepath will never be active + * here because in-memory mode is incompatible with the + * fallback path, and it's stopped at EAL initialization + * stage. */ map_offset = 0; } else { @@ -539,7 +668,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, __func__, strerror(errno)); goto resized; } - if (internal_config.hugepage_unlink) { + if (internal_config.hugepage_unlink && + !internal_config.in_memory) { if (unlink(path)) { RTE_LOG(DEBUG, EAL, "%s(): unlink() failed: %s\n", __func__, strerror(errno)); @@ -547,16 +677,16 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, } } } - - /* - * map the segment, and populate page tables, the kernel fills - * this segment with zeros if it's a new page. - */ - va = mmap(addr, alloc_sz, PROT_READ | PROT_WRITE, - MAP_SHARED | MAP_POPULATE | MAP_FIXED, fd, - map_offset); + mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED; } + /* + * map the segment, and populate page tables, the kernel fills + * this segment with zeros if it's a new page. + */ + va = mmap(addr, alloc_sz, PROT_READ | PROT_WRITE, mmap_flags, fd, + map_offset); + if (va == MAP_FAILED) { RTE_LOG(DEBUG, EAL, "%s(): mmap() failed: %s\n", __func__, strerror(errno)); @@ -663,7 +793,8 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, { uint64_t map_offset; char path[PATH_MAX]; - int fd, ret; + int fd, ret = 0; + bool exit_early; /* erase page data */ memset(ms->addr, 0, ms->len); @@ -675,8 +806,17 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, return -1; } + exit_early = false; + + /* if we're using anonymous hugepages, nothing to be done */ + if (internal_config.in_memory && !memfd_create_supported) + exit_early = true; + /* if we've already unlinked the page, nothing needs to be done */ - if (internal_config.hugepage_unlink) { + if (!internal_config.in_memory && internal_config.hugepage_unlink) + exit_early = true; + + if (exit_early) { memset(ms, 0, sizeof(*ms)); return 0; } @@ -699,11 +839,13 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, /* if we're able to take out a write lock, we're the last one * holding onto this page. */ - ret = lock(fd, LOCK_EX); - if (ret >= 0) { - /* no one else is using this page */ - if (ret == 1) - unlink(path); + if (!internal_config.in_memory) { + ret = lock(fd, LOCK_EX); + if (ret >= 0) { + /* no one else is using this page */ + if (ret == 1) + unlink(path); + } } /* closing fd will drop the lock */ close(fd); @@ -1406,6 +1548,35 @@ eal_memalloc_get_seg_fd(int list_idx, int seg_idx) return fd; } +static int +test_memfd_create(void) +{ +#ifdef MEMFD_SUPPORTED + unsigned int i; + for (i = 0; i < internal_config.num_hugepage_sizes; i++) { + uint64_t pagesz = internal_config.hugepage_info[i].hugepage_sz; + int pagesz_flag = pagesz_flags(pagesz); + int flags; + + flags = pagesz_flag | MFD_HUGETLB; + int fd = memfd_create("test", flags); + if (fd < 0) { + /* we failed - let memalloc know this isn't working */ + if (errno == EINVAL) { + memfd_create_supported = 0; + return 0; /* not supported */ + } + + /* we got other error - something's wrong */ + return -1; /* error */ + } + close(fd); + return 1; /* supported */ + } +#endif + return 0; /* not supported */ +} + int eal_memalloc_get_seg_fd_offset(int list_idx, int seg_idx, size_t *offset) { @@ -1436,6 +1607,34 @@ eal_memalloc_init(void) if (rte_eal_process_type() == RTE_PROC_SECONDARY) if (rte_memseg_list_walk(secondary_msl_create_walk, NULL) < 0) return -1; + if (rte_eal_process_type() == RTE_PROC_PRIMARY && + internal_config.in_memory) { + int mfd_res = test_memfd_create(); + + if (mfd_res < 0) { + RTE_LOG(ERR, EAL, "Unable to check if memfd is supported\n"); + return -1; + } + if (mfd_res == 1) + RTE_LOG(DEBUG, EAL, "Using memfd for anonymous memory\n"); + else + RTE_LOG(INFO, EAL, "Using memfd is not supported, falling back to anonymous hugepages\n"); + + /* we only support single-file segments mode with in-memory mode + * if we support hugetlbfs with memfd_create. this code will + * test if we do. + */ + if (internal_config.single_file_segments && + mfd_res != 1) { + RTE_LOG(ERR, EAL, "Single-file segments mode cannot be used without memfd support\n"); + return -1; + } + /* this cannot ever happen but better safe than sorry */ + if (!anonymous_hugepages_supported) { + RTE_LOG(ERR, EAL, "Using anonymous memory is not supported\n"); + return -1; + } + } /* initialize all of the fd lists */ if (rte_memseg_list_walk(fd_list_create_walk, NULL))