From patchwork Fri Jan 17 04:25:55 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Takeshi Yoshimura X-Patchwork-Id: 64822 X-Patchwork-Delegate: david.marchand@redhat.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id E412CA051A; Fri, 17 Jan 2020 05:26:24 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 7DB881D416; Fri, 17 Jan 2020 05:26:24 +0100 (CET) Received: from nh604-vm4.bullet.mail.ssk.yahoo.co.jp (nh604-vm4.bullet.mail.ssk.yahoo.co.jp [182.22.90.61]) by dpdk.org (Postfix) with SMTP id 9899A152A for ; Fri, 17 Jan 2020 05:26:21 +0100 (CET) Received: from [182.22.66.106] by nh604.bullet.mail.ssk.yahoo.co.jp with NNFMP; 17 Jan 2020 04:26:19 -0000 Received: from [182.22.91.128] by t604.bullet.mail.ssk.yahoo.co.jp with NNFMP; 17 Jan 2020 04:26:19 -0000 Received: from [127.0.0.1] by omp601.mail.ssk.yahoo.co.jp with NNFMP; 17 Jan 2020 04:26:19 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 832344.56976.bm@omp601.mail.ssk.yahoo.co.jp Received: (qmail 90890 invoked by alias); 17 Jan 2020 04:26:19 -0000 Received: from unknown (HELO localhost.localdomain) (153.226.206.167 with ) by ymobsmtp5006.mail.kks.ynwp.yahoo.co.jp with SMTP; 17 Jan 2020 04:26:19 -0000 X-YMail-JAS: vrK59e0VM1l87h8uW931fh1CHT5CIEg8g7ITwNB.2UeZsGG9J5mKB6o5_vaMB3AtWXou5wBr4ZYOEpMy81a61XC.eD7iiRkQNqmI_fJBWTZ45v_fAHpn8LZ.6KrFy2ZsmKllgD0nBaanpDo- X-Apparently-From: X-YMail-OSG: I04KtRkVM1le1NH9aZaxF_rVXABnDPeULIfgpXenUoxvDZd MFqiMd8NVxRqPWT1j6CK6CAikMeYuoq5dbCLQu0cNByei_dH_CqswU81W6.5 V6PL5bDPQPTj3bhZwfgeEdHHTqSjRUwGzoeTmPYNyyW2d1S_k1edVFd5dVCF BsgG_B5Z1dTZG6_DVMhoPkvykAb6IDo_iFKIu59gPmn_kAhinwkQh3SgFBHQ VVM6y2tlug0nZ_RfSM2LwwkdPtT4j0w1j45ZfOoZkfFFEINxeew5O4c5N9zS JK1Ucuvj.EelDNOpUAWQftKTMyWS1IG6MWYeYuZjwmqLkyWv8feXXpGPcKBv dV9t0yCUc5rYheM1e2JDbStDPkbkNNGMqEeTcfDxNRUXEHe71XDa_kh63tgk G6J11M65qnyAcg8Zz6pOEMnLx6aFEdAxC3jqYlaf783grudpCy58AYhyALe6 TUKsdtTVhDT7WH0E0dnxwk7sP.1Y9hh8PTJzLKrCfDtAQ8WJ6MIMnNX9UPrJ Kn_q2jxprOwTD4_AWZ393y3m2dlsDcmalVZ127reUk7sv2wPw0NWilcJIXUs o83jOHoSU9W0Bj1C59DIYJN3cmSv.n4CGY3g51.opuPXjxPBepjlvIhxbHFj FGgpS0dxmGjRshWRtxPHmFJiMlUVH2uDloKC3229IpdGTj3eeiaMrq.6QF50 IqOIKNPIvGw9DKHIzlcgh2xRWfTInSxb6JDTqxaYtckQjoz7m7u5Oq2HN1BP CHV3hVKvCE18- From: Takeshi Yoshimura To: dev@dpdk.org Cc: drc@ibm.com, Takeshi Yoshimura Date: Fri, 17 Jan 2020 13:25:55 +0900 Message-Id: <20200117042555.22567-1-tyos@jp.ibm.com> X-Mailer: git-send-email 2.20.1 (Apple Git-117) MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH] vfio: fix VFIO mapping failures in ppc64le X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" ppc64le failed when using large physical memory. I found problems in my two commits in the past. In commit e072d16f8920 ("vfio: fix expanding DMA area in ppc64le"), I added a sanity check using a mapped address to resolve an issue around expanding IOMMU window, but this was not enough, since memory allocation can return memory anywhere dependent on memory fragmentation. DPDK may still skip DMA mapping and attempts to unmap non-mapped DMA during expanding IOMMU window. As a result, SPDK apps using large physical memory frequently failed to proceed the communication with NVMe and/or went into an infinite loop. The root cause of the bug was in a gap between memory segments managed by DPDK and firmware-level DMA mapping. DPDK's memory segments don't contain the state of DMA mapping, and so, the memesg_walk cannot determine if an iterated memory segment is mapped or not. This resulted in incorrect DMA maps and unmaps. At this time, I added the code to avoid iterating non-mapped memory segments during DMA mapping. The memseg_walk iterates over memory segments marked as "used", and so, the code sets memory segments that will be mapped or unmapped as "free" transiently. The commit db90b4969e2e ("vfio: retry creating sPAPR DMA window") allows retring different page levels and sizes to create DMA window. However, this allows page sizes different from hugepage sizes. This inconsistency caused failures at the time of DMA mapping after the window creation. This patch fixes to retry only different page levels. Fixes: e072d16f8920 ("vfio: fix expanding DMA area in ppc64le") Fixes: db90b4969e2e ("vfio: retry creating sPAPR DMA window") Signed-off-by: Takeshi Yoshimura Reviewed-by: David Christensen Reviewed-by: David Christensen --- lib/librte_eal/linux/eal/eal_vfio.c | 76 +++++++++++++---------------- 1 file changed, 33 insertions(+), 43 deletions(-) diff --git a/lib/librte_eal/linux/eal/eal_vfio.c b/lib/librte_eal/linux/eal/eal_vfio.c index 95f615c2e..01b5ef3f4 100644 --- a/lib/librte_eal/linux/eal/eal_vfio.c +++ b/lib/librte_eal/linux/eal/eal_vfio.c @@ -532,6 +532,17 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len, return; } +#ifdef RTE_ARCH_PPC_64 + ms = rte_mem_virt2memseg(addr, msl); + while (cur_len < len) { + int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms); + + rte_fbarray_set_free(&msl->memseg_arr, idx); + cur_len += ms->len; + ++ms; + } + cur_len = 0; +#endif /* memsegs are contiguous in memory */ ms = rte_mem_virt2memseg(addr, msl); while (cur_len < len) { @@ -551,6 +562,17 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len, cur_len += ms->len; ++ms; } +#ifdef RTE_ARCH_PPC_64 + cur_len = 0; + ms = rte_mem_virt2memseg(addr, msl); + while (cur_len < len) { + int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms); + + rte_fbarray_set_used(&msl->memseg_arr, idx); + cur_len += ms->len; + ++ms; + } +#endif } static int @@ -1416,16 +1438,11 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova, return 0; } -struct spapr_remap_walk_param { - int vfio_container_fd; - uint64_t addr_64; -}; - static int vfio_spapr_map_walk(const struct rte_memseg_list *msl, const struct rte_memseg *ms, void *arg) { - struct spapr_remap_walk_param *param = arg; + int *vfio_container_fd = arg; /* skip external memory that isn't a heap */ if (msl->external && !msl->heap) @@ -1435,10 +1452,7 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl, if (ms->iova == RTE_BAD_IOVA) return 0; - if (ms->addr_64 == param->addr_64) - return 0; - - return vfio_spapr_dma_do_map(param->vfio_container_fd, ms->addr_64, ms->iova, + return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova, ms->len, 1); } @@ -1446,7 +1460,7 @@ static int vfio_spapr_unmap_walk(const struct rte_memseg_list *msl, const struct rte_memseg *ms, void *arg) { - struct spapr_remap_walk_param *param = arg; + int *vfio_container_fd = arg; /* skip external memory that isn't a heap */ if (msl->external && !msl->heap) @@ -1456,17 +1470,13 @@ vfio_spapr_unmap_walk(const struct rte_memseg_list *msl, if (ms->iova == RTE_BAD_IOVA) return 0; - if (ms->addr_64 == param->addr_64) - return 0; - - return vfio_spapr_dma_do_map(param->vfio_container_fd, ms->addr_64, ms->iova, + return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova, ms->len, 0); } struct spapr_walk_param { uint64_t window_size; uint64_t hugepage_sz; - uint64_t addr_64; }; static int @@ -1484,10 +1494,6 @@ vfio_spapr_window_size_walk(const struct rte_memseg_list *msl, if (ms->iova == RTE_BAD_IOVA) return 0; - /* do not iterate ms we haven't mapped yet */ - if (param->addr_64 && ms->addr_64 == param->addr_64) - return 0; - if (max > param->window_size) { param->hugepage_sz = ms->hugepage_sz; param->window_size = max; @@ -1531,20 +1537,11 @@ vfio_spapr_create_new_dma_window(int vfio_container_fd, /* try possible page_shift and levels for workaround */ uint32_t levels; - for (levels = 1; levels <= info.ddw.levels; levels++) { - uint32_t pgsizes = info.ddw.pgsizes; - - while (pgsizes != 0) { - create->page_shift = 31 - __builtin_clz(pgsizes); - create->levels = levels; - ret = ioctl(vfio_container_fd, - VFIO_IOMMU_SPAPR_TCE_CREATE, create); - if (!ret) - break; - pgsizes &= ~(1 << create->page_shift); - } - if (!ret) - break; + for (levels = create->levels + 1; + ret && levels <= info.ddw.levels; levels++) { + create->levels = levels; + ret = ioctl(vfio_container_fd, + VFIO_IOMMU_SPAPR_TCE_CREATE, create); } #endif if (ret) { @@ -1585,7 +1582,6 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova, /* check if window size needs to be adjusted */ memset(¶m, 0, sizeof(param)); - param.addr_64 = vaddr; /* we're inside a callback so use thread-unsafe version */ if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk, @@ -1610,14 +1606,9 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova, if (do_map) { /* re-create window and remap the entire memory */ if (iova + len > create.window_size) { - struct spapr_remap_walk_param remap_param = { - .vfio_container_fd = vfio_container_fd, - .addr_64 = vaddr, - }; - /* release all maps before recreating the window */ if (rte_memseg_walk_thread_unsafe(vfio_spapr_unmap_walk, - &remap_param) < 0) { + &vfio_container_fd) < 0) { RTE_LOG(ERR, EAL, "Could not release DMA maps\n"); ret = -1; goto out; @@ -1644,7 +1635,7 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova, /* we're inside a callback, so use thread-unsafe version */ if (rte_memseg_walk_thread_unsafe(vfio_spapr_map_walk, - &remap_param) < 0) { + &vfio_container_fd) < 0) { RTE_LOG(ERR, EAL, "Could not recreate DMA maps\n"); ret = -1; goto out; @@ -1691,7 +1682,6 @@ vfio_spapr_dma_map(int vfio_container_fd) struct spapr_walk_param param; memset(¶m, 0, sizeof(param)); - param.addr_64 = 0UL; /* create DMA window from 0 to max(phys_addr + len) */ rte_memseg_walk(vfio_spapr_window_size_walk, ¶m);