diff mbox series

[2/2] vfio: fix partial DMA unmapping for VFIO type1

Message ID	20201012081106.10610-3-ndabilpuram@marvell.com (mailing list archive)
State	Superseded, archived
Delegated to:	David Marchand
Headers	From: Nithin Dabilpuram <ndabilpuram@marvell.com> To: <anatoly.burakov@intel.com> CC: <jerinj@marvell.com>, <dev@dpdk.org>, Nithin Dabilpuram <ndabilpuram@marvell.com>, <stable@dpdk.org> Date: Mon, 12 Oct 2020 13:41:06 +0530 Message-ID: <20201012081106.10610-3-ndabilpuram@marvell.com> In-Reply-To: <20201012081106.10610-1-ndabilpuram@marvell.com> References: <20201012081106.10610-1-ndabilpuram@marvell.com> MIME-Version: 1.0 Content-Type: text/plain Subject: [dpdk-dev] [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1 Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>
Series	fix issue with partial DMA unmap \| [0/2] fix issue with partial DMA unmap [1/2] test: add test case to validate VFIO DMA map/unmap [2/2] vfio: fix partial DMA unmapping for VFIO type1

Checks

Context	Check	Description
ci/checkpatch	success	coding style OK
ci/iol-broadcom-Performance	success	Performance Testing PASS
ci/iol-broadcom-Functional	success	Functional Testing PASS
ci/iol-intel-Performance	success	Performance Testing PASS
ci/iol-testing	success	Testing PASS
ci/iol-mellanox-Performance	success	Performance Testing PASS
ci/travis-robot	success	Travis build: passed
ci/Intel-compilation	success	Compilation OK

Commit Message

Nithin Dabilpuram Oct. 12, 2020, 8:11 a.m. UTC

  Partial unmapping is not supported for VFIO IOMMU type1
by kernel. Though kernel gives return as zero, the unmapped size
returned will not be same as expected. So check for
returned unmap size and return error.

For case of DMA map/unmap triggered by heap allocations,
maintain granularity of memseg page size so that heap
expansion and contraction does not have this issue.

For user requested DMA map/unmap disallow partial unmapping
for VFIO type1.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")
Cc: anatoly.burakov@intel.com
Cc: stable@dpdk.org

Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
 lib/librte_eal/linux/eal_vfio.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

Comments

Burakov, Anatoly Oct. 14, 2020, 3:07 p.m. UTC | #1

On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> Partial unmapping is not supported for VFIO IOMMU type1
> by kernel. Though kernel gives return as zero, the unmapped size
> returned will not be same as expected. So check for
> returned unmap size and return error.
> 
> For case of DMA map/unmap triggered by heap allocations,
> maintain granularity of memseg page size so that heap
> expansion and contraction does not have this issue.

This is quite unfortunate, because there was a different bug that had to 
do with kernel having a very limited number of mappings available [1], 
as a result of which the page concatenation code was added.

It should therefore be documented that the dma_entry_limit parameter 
should be adjusted should the user run out of the DMA entries.

[1] 
https://lore.kernel.org/lkml/155414977872.12780.13728555131525362206.stgit@gimli.home/T/

> 
> For user requested DMA map/unmap disallow partial unmapping
> for VFIO type1.
> 
> Fixes: 73a639085938 ("vfio: allow to map other memory regions")
> Cc: anatoly.burakov@intel.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> ---
>   lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
>   lib/librte_eal/linux/eal_vfio.h |  1 +
>   2 files changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
> index d26e164..ef95259 100644
> --- a/lib/librte_eal/linux/eal_vfio.c
> +++ b/lib/librte_eal/linux/eal_vfio.c
> @@ -69,6 +69,7 @@ static const struct vfio_iommu_type iommu_types[] = {
>   	{
>   		.type_id = RTE_VFIO_TYPE1,
>   		.name = "Type 1",
> +		.partial_unmap = false,
>   		.dma_map_func = &vfio_type1_dma_map,
>   		.dma_user_map_func = &vfio_type1_dma_mem_map
>   	},
> @@ -76,6 +77,7 @@ static const struct vfio_iommu_type iommu_types[] = {
>   	{
>   		.type_id = RTE_VFIO_SPAPR,
>   		.name = "sPAPR",
> +		.partial_unmap = true,
>   		.dma_map_func = &vfio_spapr_dma_map,
>   		.dma_user_map_func = &vfio_spapr_dma_mem_map
>   	},
> @@ -83,6 +85,7 @@ static const struct vfio_iommu_type iommu_types[] = {
>   	{
>   		.type_id = RTE_VFIO_NOIOMMU,
>   		.name = "No-IOMMU",
> +		.partial_unmap = true,
>   		.dma_map_func = &vfio_noiommu_dma_map,
>   		.dma_user_map_func = &vfio_noiommu_dma_mem_map
>   	},
> @@ -525,12 +528,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
>   	/* for IOVA as VA mode, no need to care for IOVA addresses */
>   	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
>   		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
> -		if (type == RTE_MEM_EVENT_ALLOC)
> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> -					len, 1);
> -		else
> -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> -					len, 0);
> +		uint64_t page_sz = msl->page_sz;
> +
> +		/* Maintain granularity of DMA map/unmap to memseg size */
> +		for (; cur_len < len; cur_len += page_sz) {
> +			if (type == RTE_MEM_EVENT_ALLOC)
> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> +						 vfio_va, page_sz, 1);
> +			else
> +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> +						 vfio_va, page_sz, 0);
> +			vfio_va += page_sz;
> +		}
> +

You'd also have to revert d1c7c0cdf7bac5eb40d3a2a690453aefeee5887b 
because currently the PA path will opportunistically concantenate 
contiguous segments into single mapping too.

>   		return;
>   	}
>   
> @@ -1383,6 +1393,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
>   			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>   					errno, strerror(errno));
>   			return -1;
> +		} else if (dma_unmap.size != len) {
> +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> +				"remapping cleared instead of %"PRIu64"\n",
> +				(uint64_t)dma_unmap.size, len);
> +			rte_errno = EIO;
> +			return -1;
>   		}
>   	}
>   
> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>   		/* we're partially unmapping a previously mapped region, so we
>   		 * need to split entry into two.
>   		 */
> +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> +			rte_errno = ENOTSUP;
> +			ret = -1;
> +			goto out;
> +		}

How would we ever arrive here if we never do more than 1 page worth of 
memory anyway? I don't think this is needed.

>   		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
>   			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
>   			rte_errno = ENOMEM;
> diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
> index cb2d35f..6ebaca6 100644
> --- a/lib/librte_eal/linux/eal_vfio.h
> +++ b/lib/librte_eal/linux/eal_vfio.h
> @@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
>   struct vfio_iommu_type {
>   	int type_id;
>   	const char *name;
> +	bool partial_unmap;
>   	vfio_dma_user_func_t dma_user_map_func;
>   	vfio_dma_func_t dma_map_func;
>   };
>

Nithin Dabilpuram Oct. 15, 2020, 6:09 a.m. UTC | #2

On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> External Email
> 
> ----------------------------------------------------------------------
> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > Partial unmapping is not supported for VFIO IOMMU type1
> > by kernel. Though kernel gives return as zero, the unmapped size
> > returned will not be same as expected. So check for
> > returned unmap size and return error.
> > 
> > For case of DMA map/unmap triggered by heap allocations,
> > maintain granularity of memseg page size so that heap
> > expansion and contraction does not have this issue.
> 
> This is quite unfortunate, because there was a different bug that had to do
> with kernel having a very limited number of mappings available [1], as a
> result of which the page concatenation code was added.
> 
> It should therefore be documented that the dma_entry_limit parameter should
> be adjusted should the user run out of the DMA entries.
> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=

Ack, I'll document it in guides/linux_gsg/linux_drivers.rst in vfio section.
> 
> > 
> > For user requested DMA map/unmap disallow partial unmapping
> > for VFIO type1.
> > 
> > Fixes: 73a639085938 ("vfio: allow to map other memory regions")
> > Cc: anatoly.burakov@intel.com
> > Cc: stable@dpdk.org
> > 
> > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > ---
> >   lib/librte_eal/linux/eal_vfio.c | 34 ++++++++++++++++++++++++++++------
> >   lib/librte_eal/linux/eal_vfio.h |  1 +
> >   2 files changed, 29 insertions(+), 6 deletions(-)
> > 
> > diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
> > index d26e164..ef95259 100644
> > --- a/lib/librte_eal/linux/eal_vfio.c
> > +++ b/lib/librte_eal/linux/eal_vfio.c
> > @@ -69,6 +69,7 @@ static const struct vfio_iommu_type iommu_types[] = {
> >   	{
> >   		.type_id = RTE_VFIO_TYPE1,
> >   		.name = "Type 1",
> > +		.partial_unmap = false,
> >   		.dma_map_func = &vfio_type1_dma_map,
> >   		.dma_user_map_func = &vfio_type1_dma_mem_map
> >   	},
> > @@ -76,6 +77,7 @@ static const struct vfio_iommu_type iommu_types[] = {
> >   	{
> >   		.type_id = RTE_VFIO_SPAPR,
> >   		.name = "sPAPR",
> > +		.partial_unmap = true,
> >   		.dma_map_func = &vfio_spapr_dma_map,
> >   		.dma_user_map_func = &vfio_spapr_dma_mem_map
> >   	},
> > @@ -83,6 +85,7 @@ static const struct vfio_iommu_type iommu_types[] = {
> >   	{
> >   		.type_id = RTE_VFIO_NOIOMMU,
> >   		.name = "No-IOMMU",
> > +		.partial_unmap = true,
> >   		.dma_map_func = &vfio_noiommu_dma_map,
> >   		.dma_user_map_func = &vfio_noiommu_dma_mem_map
> >   	},
> > @@ -525,12 +528,19 @@ vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
> >   	/* for IOVA as VA mode, no need to care for IOVA addresses */
> >   	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
> >   		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
> > -		if (type == RTE_MEM_EVENT_ALLOC)
> > -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> > -					len, 1);
> > -		else
> > -			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
> > -					len, 0);
> > +		uint64_t page_sz = msl->page_sz;
> > +
> > +		/* Maintain granularity of DMA map/unmap to memseg size */
> > +		for (; cur_len < len; cur_len += page_sz) {
> > +			if (type == RTE_MEM_EVENT_ALLOC)
> > +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> > +						 vfio_va, page_sz, 1);
> > +			else
> > +				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
> > +						 vfio_va, page_sz, 0);
> > +			vfio_va += page_sz;
> > +		}
> > +
> 
> You'd also have to revert d1c7c0cdf7bac5eb40d3a2a690453aefeee5887b because
> currently the PA path will opportunistically concantenate contiguous
> segments into single mapping too.

Ack, I'll change it even for IOVA as PA mode. I missed that.
> 
> >   		return;
> >   	}
> > @@ -1383,6 +1393,12 @@ vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
> >   			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> >   					errno, strerror(errno));
> >   			return -1;
> > +		} else if (dma_unmap.size != len) {
> > +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > +				"remapping cleared instead of %"PRIu64"\n",
> > +				(uint64_t)dma_unmap.size, len);
> > +			rte_errno = EIO;
> > +			return -1;
> >   		}
> >   	}
> > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> >   		/* we're partially unmapping a previously mapped region, so we
> >   		 * need to split entry into two.
> >   		 */
> > +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > +			rte_errno = ENOTSUP;
> > +			ret = -1;
> > +			goto out;
> > +		}
> 
> How would we ever arrive here if we never do more than 1 page worth of
> memory anyway? I don't think this is needed.

container_dma_unmap() is called by user via rte_vfio_container_dma_unmap() 
and when he maps we don't split it as we don't about his memory.
So if he maps multiple pages and tries to unmap partially, then we should fail.

> 
> >   		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
> >   			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
> >   			rte_errno = ENOMEM;
> > diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
> > index cb2d35f..6ebaca6 100644
> > --- a/lib/librte_eal/linux/eal_vfio.h
> > +++ b/lib/librte_eal/linux/eal_vfio.h
> > @@ -113,6 +113,7 @@ typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
> >   struct vfio_iommu_type {
> >   	int type_id;
> >   	const char *name;
> > +	bool partial_unmap;
> >   	vfio_dma_user_func_t dma_user_map_func;
> >   	vfio_dma_func_t dma_map_func;
> >   };
> > 
> 
> 
> -- 
> Thanks,
> Anatoly

Burakov, Anatoly Oct. 15, 2020, 10 a.m. UTC | #3

On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>> External Email
>>
>> ----------------------------------------------------------------------
>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>> Partial unmapping is not supported for VFIO IOMMU type1
>>> by kernel. Though kernel gives return as zero, the unmapped size
>>> returned will not be same as expected. So check for
>>> returned unmap size and return error.
>>>
>>> For case of DMA map/unmap triggered by heap allocations,
>>> maintain granularity of memseg page size so that heap
>>> expansion and contraction does not have this issue.
>>
>> This is quite unfortunate, because there was a different bug that had to do
>> with kernel having a very limited number of mappings available [1], as a
>> result of which the page concatenation code was added.
>>
>> It should therefore be documented that the dma_entry_limit parameter should
>> be adjusted should the user run out of the DMA entries.
>>
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=

<snip>

>>>    			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>    					errno, strerror(errno));
>>>    			return -1;
>>> +		} else if (dma_unmap.size != len) {
>>> +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>> +				"remapping cleared instead of %"PRIu64"\n",
>>> +				(uint64_t)dma_unmap.size, len);
>>> +			rte_errno = EIO;
>>> +			return -1;
>>>    		}
>>>    	}
>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>    		/* we're partially unmapping a previously mapped region, so we
>>>    		 * need to split entry into two.
>>>    		 */
>>> +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>> +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>> +			rte_errno = ENOTSUP;
>>> +			ret = -1;
>>> +			goto out;
>>> +		}
>>
>> How would we ever arrive here if we never do more than 1 page worth of
>> memory anyway? I don't think this is needed.
> 
> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> and when he maps we don't split it as we don't about his memory.
> So if he maps multiple pages and tries to unmap partially, then we should fail.

Should we map it in page granularity then, instead of adding this 
discrepancy between EAL and user mapping? I.e. instead of adding a 
workaround, how about we just do the same thing for user mem mappings?

Nithin Dabilpuram Oct. 15, 2020, 11:38 a.m. UTC | #4

On Thu, Oct 15, 2020 at 11:00:59AM +0100, Burakov, Anatoly wrote:
> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > External Email
> > > 
> > > ----------------------------------------------------------------------
> > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > returned will not be same as expected. So check for
> > > > returned unmap size and return error.
> > > > 
> > > > For case of DMA map/unmap triggered by heap allocations,
> > > > maintain granularity of memseg page size so that heap
> > > > expansion and contraction does not have this issue.
> > > 
> > > This is quite unfortunate, because there was a different bug that had to do
> > > with kernel having a very limited number of mappings available [1], as a
> > > result of which the page concatenation code was added.
> > > 
> > > It should therefore be documented that the dma_entry_limit parameter should
> > > be adjusted should the user run out of the DMA entries.
> > > 
> > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> 
> <snip>
> 
> > > >    			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > >    					errno, strerror(errno));
> > > >    			return -1;
> > > > +		} else if (dma_unmap.size != len) {
> > > > +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > +				"remapping cleared instead of %"PRIu64"\n",
> > > > +				(uint64_t)dma_unmap.size, len);
> > > > +			rte_errno = EIO;
> > > > +			return -1;
> > > >    		}
> > > >    	}
> > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > >    		/* we're partially unmapping a previously mapped region, so we
> > > >    		 * need to split entry into two.
> > > >    		 */
> > > > +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > +			rte_errno = ENOTSUP;
> > > > +			ret = -1;
> > > > +			goto out;
> > > > +		}
> > > 
> > > How would we ever arrive here if we never do more than 1 page worth of
> > > memory anyway? I don't think this is needed.
> > 
> > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > and when he maps we don't split it as we don't about his memory.
> > So if he maps multiple pages and tries to unmap partially, then we should fail.
> 
> Should we map it in page granularity then, instead of adding this
> discrepancy between EAL and user mapping? I.e. instead of adding a
> workaround, how about we just do the same thing for user mem mappings?

In heap mapping's we map and unmap it at huge page granularity as we will always
maintain that.

But here I think we don't know if user's allocation is huge page or collection of system
pages. Only thing we can do here is map it at system page granularity which
could waste entries if he say really is working with hugepages. Isn't ?

> 
> -- 
> Thanks,
> Anatoly

Nithin Dabilpuram Oct. 15, 2020, 11:50 a.m. UTC | #5

On Thu, Oct 15, 2020 at 11:00:59AM +0100, Burakov, Anatoly wrote:
> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > External Email
> > > 
> > > ----------------------------------------------------------------------
> > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > returned will not be same as expected. So check for
> > > > returned unmap size and return error.
> > > > 
> > > > For case of DMA map/unmap triggered by heap allocations,
> > > > maintain granularity of memseg page size so that heap
> > > > expansion and contraction does not have this issue.
> > > 
> > > This is quite unfortunate, because there was a different bug that had to do
> > > with kernel having a very limited number of mappings available [1], as a
> > > result of which the page concatenation code was added.
> > > 
> > > It should therefore be documented that the dma_entry_limit parameter should
> > > be adjusted should the user run out of the DMA entries.
> > > 
> > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> 
> <snip>
> 
> > > >    			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > >    					errno, strerror(errno));
> > > >    			return -1;
> > > > +		} else if (dma_unmap.size != len) {
> > > > +			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > +				"remapping cleared instead of %"PRIu64"\n",
> > > > +				(uint64_t)dma_unmap.size, len);
> > > > +			rte_errno = EIO;
> > > > +			return -1;
> > > >    		}
> > > >    	}
> > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > >    		/* we're partially unmapping a previously mapped region, so we
> > > >    		 * need to split entry into two.
> > > >    		 */
> > > > +		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > +			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > +			rte_errno = ENOTSUP;
> > > > +			ret = -1;
> > > > +			goto out;
> > > > +		}
> > > 
> > > How would we ever arrive here if we never do more than 1 page worth of
> > > memory anyway? I don't think this is needed.
> > 
> > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > and when he maps we don't split it as we don't about his memory.
> > So if he maps multiple pages and tries to unmap partially, then we should fail.
> 
> Should we map it in page granularity then, instead of adding this
> discrepancy between EAL and user mapping? I.e. instead of adding a
> workaround, how about we just do the same thing for user mem mappings?

In heap mapping's we map and unmap it at huge page granularity as we will always
maintain that.

But here I think we don't know if user's allocation is huge page or collection
of system
pages. Only thing we can do here is map it at system page granularity which
could waste entries if he say really is working with hugepages. Isn't ?

> 
> -- 
> Thanks,
> Anatoly

Nithin Dabilpuram Oct. 15, 2020, 11:57 a.m. UTC | #6

On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> >> External Email
> >>
> >> ----------------------------------------------------------------------
> >> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> >>> Partial unmapping is not supported for VFIO IOMMU type1
> >>> by kernel. Though kernel gives return as zero, the unmapped size
> >>> returned will not be same as expected. So check for
> >>> returned unmap size and return error.
> >>>
> >>> For case of DMA map/unmap triggered by heap allocations,
> >>> maintain granularity of memseg page size so that heap
> >>> expansion and contraction does not have this issue.
> >>
> >> This is quite unfortunate, because there was a different bug that had to do
> >> with kernel having a very limited number of mappings available [1], as a
> >> result of which the page concatenation code was added.
> >>
> >> It should therefore be documented that the dma_entry_limit parameter should
> >> be adjusted should the user run out of the DMA entries.
> >>
> >> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>
> <snip>
>
> >>>                     RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> >>>                                     errno, strerror(errno));
> >>>                     return -1;
> >>> +           } else if (dma_unmap.size != len) {
> >>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> >>> +                           "remapping cleared instead of %"PRIu64"\n",
> >>> +                           (uint64_t)dma_unmap.size, len);
> >>> +                   rte_errno = EIO;
> >>> +                   return -1;
> >>>             }
> >>>     }
> >>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> >>>             /* we're partially unmapping a previously mapped region, so we
> >>>              * need to split entry into two.
> >>>              */
> >>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> >>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> >>> +                   rte_errno = ENOTSUP;
> >>> +                   ret = -1;
> >>> +                   goto out;
> >>> +           }
> >>
> >> How would we ever arrive here if we never do more than 1 page worth of
> >> memory anyway? I don't think this is needed.
> >
> > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > and when he maps we don't split it as we don't about his memory.
> > So if he maps multiple pages and tries to unmap partially, then we should fail.
>
> Should we map it in page granularity then, instead of adding this
> discrepancy between EAL and user mapping? I.e. instead of adding a
> workaround, how about we just do the same thing for user mem mappings?
>
In heap mapping's we map and unmap it at huge page granularity as we will always
maintain that.

But here I think we don't know if user's allocation is huge page or
collection of system
pages. Only thing we can do here is map it at system page granularity which
could waste entries if he say really is working with hugepages. Isn't ?


>
> --
> Thanks,
> Anatoly

Burakov, Anatoly Oct. 15, 2020, 3:10 p.m. UTC | #7

On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
>>
>> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
>>> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>>>> External Email
>>>>
>>>> ----------------------------------------------------------------------
>>>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>>>> Partial unmapping is not supported for VFIO IOMMU type1
>>>>> by kernel. Though kernel gives return as zero, the unmapped size
>>>>> returned will not be same as expected. So check for
>>>>> returned unmap size and return error.
>>>>>
>>>>> For case of DMA map/unmap triggered by heap allocations,
>>>>> maintain granularity of memseg page size so that heap
>>>>> expansion and contraction does not have this issue.
>>>>
>>>> This is quite unfortunate, because there was a different bug that had to do
>>>> with kernel having a very limited number of mappings available [1], as a
>>>> result of which the page concatenation code was added.
>>>>
>>>> It should therefore be documented that the dma_entry_limit parameter should
>>>> be adjusted should the user run out of the DMA entries.
>>>>
>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>>
>> <snip>
>>
>>>>>                      RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>>>                                      errno, strerror(errno));
>>>>>                      return -1;
>>>>> +           } else if (dma_unmap.size != len) {
>>>>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>>>> +                           "remapping cleared instead of %"PRIu64"\n",
>>>>> +                           (uint64_t)dma_unmap.size, len);
>>>>> +                   rte_errno = EIO;
>>>>> +                   return -1;
>>>>>              }
>>>>>      }
>>>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>>              /* we're partially unmapping a previously mapped region, so we
>>>>>               * need to split entry into two.
>>>>>               */
>>>>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>>>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>>>> +                   rte_errno = ENOTSUP;
>>>>> +                   ret = -1;
>>>>> +                   goto out;
>>>>> +           }
>>>>
>>>> How would we ever arrive here if we never do more than 1 page worth of
>>>> memory anyway? I don't think this is needed.
>>>
>>> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
>>> and when he maps we don't split it as we don't about his memory.
>>> So if he maps multiple pages and tries to unmap partially, then we should fail.
>>
>> Should we map it in page granularity then, instead of adding this
>> discrepancy between EAL and user mapping? I.e. instead of adding a
>> workaround, how about we just do the same thing for user mem mappings?
>>
> In heap mapping's we map and unmap it at huge page granularity as we will always
> maintain that.
> 
> But here I think we don't know if user's allocation is huge page or
> collection of system
> pages. Only thing we can do here is map it at system page granularity which
> could waste entries if he say really is working with hugepages. Isn't ?
> 

Yeah we do. The API mandates the pages granularity, and it will check 
against page size and number of IOVA entries, so yes, we do enforce the 
fact that the IOVA addresses supplied by the user have to be page addresses.

Nithin Dabilpuram Oct. 16, 2020, 7:10 a.m. UTC | #8

On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > <anatoly.burakov@intel.com> wrote:
> > > 
> > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > External Email
> > > > > 
> > > > > ----------------------------------------------------------------------
> > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > returned will not be same as expected. So check for
> > > > > > returned unmap size and return error.
> > > > > > 
> > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > maintain granularity of memseg page size so that heap
> > > > > > expansion and contraction does not have this issue.
> > > > > 
> > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > result of which the page concatenation code was added.
> > > > > 
> > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > be adjusted should the user run out of the DMA entries.
> > > > > 
> > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > 
> > > <snip>
> > > 
> > > > > >                      RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > >                                      errno, strerror(errno));
> > > > > >                      return -1;
> > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > +                   rte_errno = EIO;
> > > > > > +                   return -1;
> > > > > >              }
> > > > > >      }
> > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > >              /* we're partially unmapping a previously mapped region, so we
> > > > > >               * need to split entry into two.
> > > > > >               */
> > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > +                   rte_errno = ENOTSUP;
> > > > > > +                   ret = -1;
> > > > > > +                   goto out;
> > > > > > +           }
> > > > > 
> > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > memory anyway? I don't think this is needed.
> > > > 
> > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > and when he maps we don't split it as we don't about his memory.
> > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > 
> > > Should we map it in page granularity then, instead of adding this
> > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > workaround, how about we just do the same thing for user mem mappings?
> > > 
> > In heap mapping's we map and unmap it at huge page granularity as we will always
> > maintain that.
> > 
> > But here I think we don't know if user's allocation is huge page or
> > collection of system
> > pages. Only thing we can do here is map it at system page granularity which
> > could waste entries if he say really is working with hugepages. Isn't ?
> > 
> 
> Yeah we do. The API mandates the pages granularity, and it will check
> against page size and number of IOVA entries, so yes, we do enforce the fact
> that the IOVA addresses supplied by the user have to be page addresses.

If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
user is providing or we computing. He can call rte_vfio_container_dma_map()
with 1GB huge page or 4K system page.

Am I missing something ?
> 
> -- 
> Thanks,
> Anatoly

Burakov, Anatoly Oct. 17, 2020, 4:14 p.m. UTC | #9

On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
>> On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
>>> On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
>>> <anatoly.burakov@intel.com> wrote:
>>>>
>>>> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
>>>>> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>>>>>> External Email
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>>>>>> Partial unmapping is not supported for VFIO IOMMU type1
>>>>>>> by kernel. Though kernel gives return as zero, the unmapped size
>>>>>>> returned will not be same as expected. So check for
>>>>>>> returned unmap size and return error.
>>>>>>>
>>>>>>> For case of DMA map/unmap triggered by heap allocations,
>>>>>>> maintain granularity of memseg page size so that heap
>>>>>>> expansion and contraction does not have this issue.
>>>>>>
>>>>>> This is quite unfortunate, because there was a different bug that had to do
>>>>>> with kernel having a very limited number of mappings available [1], as a
>>>>>> result of which the page concatenation code was added.
>>>>>>
>>>>>> It should therefore be documented that the dma_entry_limit parameter should
>>>>>> be adjusted should the user run out of the DMA entries.
>>>>>>
>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>>>>
>>>> <snip>
>>>>
>>>>>>>                       RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>>>>>                                       errno, strerror(errno));
>>>>>>>                       return -1;
>>>>>>> +           } else if (dma_unmap.size != len) {
>>>>>>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>>>>>> +                           "remapping cleared instead of %"PRIu64"\n",
>>>>>>> +                           (uint64_t)dma_unmap.size, len);
>>>>>>> +                   rte_errno = EIO;
>>>>>>> +                   return -1;
>>>>>>>               }
>>>>>>>       }
>>>>>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>>>>               /* we're partially unmapping a previously mapped region, so we
>>>>>>>                * need to split entry into two.
>>>>>>>                */
>>>>>>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>>>>>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>>>>>> +                   rte_errno = ENOTSUP;
>>>>>>> +                   ret = -1;
>>>>>>> +                   goto out;
>>>>>>> +           }
>>>>>>
>>>>>> How would we ever arrive here if we never do more than 1 page worth of
>>>>>> memory anyway? I don't think this is needed.
>>>>>
>>>>> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
>>>>> and when he maps we don't split it as we don't about his memory.
>>>>> So if he maps multiple pages and tries to unmap partially, then we should fail.
>>>>
>>>> Should we map it in page granularity then, instead of adding this
>>>> discrepancy between EAL and user mapping? I.e. instead of adding a
>>>> workaround, how about we just do the same thing for user mem mappings?
>>>>
>>> In heap mapping's we map and unmap it at huge page granularity as we will always
>>> maintain that.
>>>
>>> But here I think we don't know if user's allocation is huge page or
>>> collection of system
>>> pages. Only thing we can do here is map it at system page granularity which
>>> could waste entries if he say really is working with hugepages. Isn't ?
>>>
>>
>> Yeah we do. The API mandates the pages granularity, and it will check
>> against page size and number of IOVA entries, so yes, we do enforce the fact
>> that the IOVA addresses supplied by the user have to be page addresses.
> 
> If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> user is providing or we computing. He can call rte_vfio_container_dma_map()
> with 1GB huge page or 4K system page.
> 
> Am I missing something ?

Are you suggesting that a DMA mapping for hugepage-backed memory will be 
made at system page size granularity? E.g. will a 1GB page-backed 
segment be mapped for DMA as a contiguous 4K-based block?

>>
>> -- 
>> Thanks,
>> Anatoly

Nithin Dabilpuram Oct. 19, 2020, 9:43 a.m. UTC | #10

On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
> On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> > On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> > > On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > > > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > > > <anatoly.burakov@intel.com> wrote:
> > > > > 
> > > > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > > > External Email
> > > > > > > 
> > > > > > > ----------------------------------------------------------------------
> > > > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > > > returned will not be same as expected. So check for
> > > > > > > > returned unmap size and return error.
> > > > > > > > 
> > > > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > > > maintain granularity of memseg page size so that heap
> > > > > > > > expansion and contraction does not have this issue.
> > > > > > > 
> > > > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > > > result of which the page concatenation code was added.
> > > > > > > 
> > > > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > > > be adjusted should the user run out of the DMA entries.
> > > > > > > 
> > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > > >                       RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > > > >                                       errno, strerror(errno));
> > > > > > > >                       return -1;
> > > > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > > > +                   rte_errno = EIO;
> > > > > > > > +                   return -1;
> > > > > > > >               }
> > > > > > > >       }
> > > > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > > > >               /* we're partially unmapping a previously mapped region, so we
> > > > > > > >                * need to split entry into two.
> > > > > > > >                */
> > > > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > > > +                   rte_errno = ENOTSUP;
> > > > > > > > +                   ret = -1;
> > > > > > > > +                   goto out;
> > > > > > > > +           }
> > > > > > > 
> > > > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > > > memory anyway? I don't think this is needed.
> > > > > > 
> > > > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > > > and when he maps we don't split it as we don't about his memory.
> > > > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > > > 
> > > > > Should we map it in page granularity then, instead of adding this
> > > > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > > > workaround, how about we just do the same thing for user mem mappings?
> > > > > 
> > > > In heap mapping's we map and unmap it at huge page granularity as we will always
> > > > maintain that.
> > > > 
> > > > But here I think we don't know if user's allocation is huge page or
> > > > collection of system
> > > > pages. Only thing we can do here is map it at system page granularity which
> > > > could waste entries if he say really is working with hugepages. Isn't ?
> > > > 
> > > 
> > > Yeah we do. The API mandates the pages granularity, and it will check
> > > against page size and number of IOVA entries, so yes, we do enforce the fact
> > > that the IOVA addresses supplied by the user have to be page addresses.
> > 
> > If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> > user is providing or we computing. He can call rte_vfio_container_dma_map()
> > with 1GB huge page or 4K system page.
> > 
> > Am I missing something ?
> 
> Are you suggesting that a DMA mapping for hugepage-backed memory will be
> made at system page size granularity? E.g. will a 1GB page-backed segment be
> mapped for DMA as a contiguous 4K-based block?

I'm not suggesting anything. My only thought is how to solve below problem.
Say application does the following.

#1 Allocate 1GB memory from huge page or some external mem.
#2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
   In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
   don't know where this memory is coming from or backed by what.
#3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
 
Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
can we allow #3 ?


static int
container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
                uint64_t len) 
{
        struct user_mem_map *map, *new_map = NULL;
        struct user_mem_maps *user_mem_maps;
        int ret = 0; 

        user_mem_maps = &vfio_cfg->mem_maps;
        rte_spinlock_recursive_lock(&user_mem_maps->lock);

        /* find our mapping */
        map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
        if (!map) {
                RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
                rte_errno = EINVAL;
                ret = -1;
                goto out; 
        }
        if (map->addr != vaddr || map->iova != iova || map->len != len) {
                /* we're partially unmapping a previously mapped region, so we
                 * need to split entry into two.
                 */


> 
> > > 
> > > -- 
> > > Thanks,
> > > Anatoly
> 
> 
> -- 
> Thanks,
> Anatoly

Nithin Dabilpuram Oct. 22, 2020, 12:13 p.m. UTC | #11

Ping.

On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
> On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
> > On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> > > On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> > > > On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > > > > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > 
> > > > > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > > > > External Email
> > > > > > > > 
> > > > > > > > ----------------------------------------------------------------------
> > > > > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > > > > returned will not be same as expected. So check for
> > > > > > > > > returned unmap size and return error.
> > > > > > > > > 
> > > > > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > > > > maintain granularity of memseg page size so that heap
> > > > > > > > > expansion and contraction does not have this issue.
> > > > > > > > 
> > > > > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > > > > result of which the page concatenation code was added.
> > > > > > > > 
> > > > > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > > > > be adjusted should the user run out of the DMA entries.
> > > > > > > > 
> > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > > > > 
> > > > > > <snip>
> > > > > > 
> > > > > > > > >                       RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > > > > >                                       errno, strerror(errno));
> > > > > > > > >                       return -1;
> > > > > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > > > > +                   rte_errno = EIO;
> > > > > > > > > +                   return -1;
> > > > > > > > >               }
> > > > > > > > >       }
> > > > > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > > > > >               /* we're partially unmapping a previously mapped region, so we
> > > > > > > > >                * need to split entry into two.
> > > > > > > > >                */
> > > > > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > > > > +                   rte_errno = ENOTSUP;
> > > > > > > > > +                   ret = -1;
> > > > > > > > > +                   goto out;
> > > > > > > > > +           }
> > > > > > > > 
> > > > > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > > > > memory anyway? I don't think this is needed.
> > > > > > > 
> > > > > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > > > > and when he maps we don't split it as we don't about his memory.
> > > > > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > > > > 
> > > > > > Should we map it in page granularity then, instead of adding this
> > > > > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > > > > workaround, how about we just do the same thing for user mem mappings?
> > > > > > 
> > > > > In heap mapping's we map and unmap it at huge page granularity as we will always
> > > > > maintain that.
> > > > > 
> > > > > But here I think we don't know if user's allocation is huge page or
> > > > > collection of system
> > > > > pages. Only thing we can do here is map it at system page granularity which
> > > > > could waste entries if he say really is working with hugepages. Isn't ?
> > > > > 
> > > > 
> > > > Yeah we do. The API mandates the pages granularity, and it will check
> > > > against page size and number of IOVA entries, so yes, we do enforce the fact
> > > > that the IOVA addresses supplied by the user have to be page addresses.
> > > 
> > > If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> > > user is providing or we computing. He can call rte_vfio_container_dma_map()
> > > with 1GB huge page or 4K system page.
> > > 
> > > Am I missing something ?
> > 
> > Are you suggesting that a DMA mapping for hugepage-backed memory will be
> > made at system page size granularity? E.g. will a 1GB page-backed segment be
> > mapped for DMA as a contiguous 4K-based block?
> 
> I'm not suggesting anything. My only thought is how to solve below problem.
> Say application does the following.
> 
> #1 Allocate 1GB memory from huge page or some external mem.
> #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
>    In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
>    don't know where this memory is coming from or backed by what.
> #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
>  
> Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
> In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
> can we allow #3 ?
> 
> 
> static int
> container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>                 uint64_t len) 
> {
>         struct user_mem_map *map, *new_map = NULL;
>         struct user_mem_maps *user_mem_maps;
>         int ret = 0; 
> 
>         user_mem_maps = &vfio_cfg->mem_maps;
>         rte_spinlock_recursive_lock(&user_mem_maps->lock);
> 
>         /* find our mapping */
>         map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
>         if (!map) {
>                 RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
>                 rte_errno = EINVAL;
>                 ret = -1;
>                 goto out; 
>         }
>         if (map->addr != vaddr || map->iova != iova || map->len != len) {
>                 /* we're partially unmapping a previously mapped region, so we
>                  * need to split entry into two.
>                  */
> 
> 
> > 
> > > > 
> > > > -- 
> > > > Thanks,
> > > > Anatoly
> > 
> > 
> > -- 
> > Thanks,
> > Anatoly

Burakov, Anatoly Oct. 28, 2020, 1:04 p.m. UTC | #12

On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote:
> Ping.
> 
> On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
>> On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
>>> On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
>>>> On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
>>>>> On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
>>>>>> On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>
>>>>>>> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
>>>>>>>> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>>>>>>>>> External Email
>>>>>>>>>
>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>>>>>>>>> Partial unmapping is not supported for VFIO IOMMU type1
>>>>>>>>>> by kernel. Though kernel gives return as zero, the unmapped size
>>>>>>>>>> returned will not be same as expected. So check for
>>>>>>>>>> returned unmap size and return error.
>>>>>>>>>>
>>>>>>>>>> For case of DMA map/unmap triggered by heap allocations,
>>>>>>>>>> maintain granularity of memseg page size so that heap
>>>>>>>>>> expansion and contraction does not have this issue.
>>>>>>>>>
>>>>>>>>> This is quite unfortunate, because there was a different bug that had to do
>>>>>>>>> with kernel having a very limited number of mappings available [1], as a
>>>>>>>>> result of which the page concatenation code was added.
>>>>>>>>>
>>>>>>>>> It should therefore be documented that the dma_entry_limit parameter should
>>>>>>>>> be adjusted should the user run out of the DMA entries.
>>>>>>>>>
>>>>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>>>>                        RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>>>>>>>>                                        errno, strerror(errno));
>>>>>>>>>>                        return -1;
>>>>>>>>>> +           } else if (dma_unmap.size != len) {
>>>>>>>>>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>>>>>>>>> +                           "remapping cleared instead of %"PRIu64"\n",
>>>>>>>>>> +                           (uint64_t)dma_unmap.size, len);
>>>>>>>>>> +                   rte_errno = EIO;
>>>>>>>>>> +                   return -1;
>>>>>>>>>>                }
>>>>>>>>>>        }
>>>>>>>>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>>>>>>>                /* we're partially unmapping a previously mapped region, so we
>>>>>>>>>>                 * need to split entry into two.
>>>>>>>>>>                 */
>>>>>>>>>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>>>>>>>>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>>>>>>>>> +                   rte_errno = ENOTSUP;
>>>>>>>>>> +                   ret = -1;
>>>>>>>>>> +                   goto out;
>>>>>>>>>> +           }
>>>>>>>>>
>>>>>>>>> How would we ever arrive here if we never do more than 1 page worth of
>>>>>>>>> memory anyway? I don't think this is needed.
>>>>>>>>
>>>>>>>> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
>>>>>>>> and when he maps we don't split it as we don't about his memory.
>>>>>>>> So if he maps multiple pages and tries to unmap partially, then we should fail.
>>>>>>>
>>>>>>> Should we map it in page granularity then, instead of adding this
>>>>>>> discrepancy between EAL and user mapping? I.e. instead of adding a
>>>>>>> workaround, how about we just do the same thing for user mem mappings?
>>>>>>>
>>>>>> In heap mapping's we map and unmap it at huge page granularity as we will always
>>>>>> maintain that.
>>>>>>
>>>>>> But here I think we don't know if user's allocation is huge page or
>>>>>> collection of system
>>>>>> pages. Only thing we can do here is map it at system page granularity which
>>>>>> could waste entries if he say really is working with hugepages. Isn't ?
>>>>>>
>>>>>
>>>>> Yeah we do. The API mandates the pages granularity, and it will check
>>>>> against page size and number of IOVA entries, so yes, we do enforce the fact
>>>>> that the IOVA addresses supplied by the user have to be page addresses.
>>>>
>>>> If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
>>>> user is providing or we computing. He can call rte_vfio_container_dma_map()
>>>> with 1GB huge page or 4K system page.
>>>>
>>>> Am I missing something ?
>>>
>>> Are you suggesting that a DMA mapping for hugepage-backed memory will be
>>> made at system page size granularity? E.g. will a 1GB page-backed segment be
>>> mapped for DMA as a contiguous 4K-based block?
>>
>> I'm not suggesting anything. My only thought is how to solve below problem.
>> Say application does the following.
>>
>> #1 Allocate 1GB memory from huge page or some external mem.
>> #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
>>     In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
>>     don't know where this memory is coming from or backed by what.
>> #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
>>   
>> Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
>> In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
>> can we allow #3 ?
>>
>>
>> static int
>> container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>                  uint64_t len)
>> {
>>          struct user_mem_map *map, *new_map = NULL;
>>          struct user_mem_maps *user_mem_maps;
>>          int ret = 0;
>>
>>          user_mem_maps = &vfio_cfg->mem_maps;
>>          rte_spinlock_recursive_lock(&user_mem_maps->lock);
>>
>>          /* find our mapping */
>>          map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
>>          if (!map) {
>>                  RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
>>                  rte_errno = EINVAL;
>>                  ret = -1;
>>                  goto out;
>>          }
>>          if (map->addr != vaddr || map->iova != iova || map->len != len) {
>>                  /* we're partially unmapping a previously mapped region, so we
>>                   * need to split entry into two.
>>                   */

Hi,

Apologies, i was on vacation.

Yes, I can see the problem now. Does VFIO even support non-system page 
sizes? Like, if i allocated a 1GB page, would i be able to map *this 
page* for DMA, as opposed to first 4K of this page? I suspect that the 
mapping doesn't support page sizes other than the system page size.

Nithin Dabilpuram Oct. 28, 2020, 2:17 p.m. UTC | #13

On Wed, Oct 28, 2020 at 01:04:26PM +0000, Burakov, Anatoly wrote:
> On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote:
> > Ping.
> > 
> > On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
> > > On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
> > > > On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> > > > > On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> > > > > > On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > > > > > > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > > 
> > > > > > > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > > > > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > > > > > > External Email
> > > > > > > > > > 
> > > > > > > > > > ----------------------------------------------------------------------
> > > > > > > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > > > > > > returned will not be same as expected. So check for
> > > > > > > > > > > returned unmap size and return error.
> > > > > > > > > > > 
> > > > > > > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > > > > > > maintain granularity of memseg page size so that heap
> > > > > > > > > > > expansion and contraction does not have this issue.
> > > > > > > > > > 
> > > > > > > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > > > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > > > > > > result of which the page concatenation code was added.
> > > > > > > > > > 
> > > > > > > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > > > > > > be adjusted should the user run out of the DMA entries.
> > > > > > > > > > 
> > > > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > > > > > > 
> > > > > > > > <snip>
> > > > > > > > 
> > > > > > > > > > >                        RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > > > > > > >                                        errno, strerror(errno));
> > > > > > > > > > >                        return -1;
> > > > > > > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > > > > > > +                   rte_errno = EIO;
> > > > > > > > > > > +                   return -1;
> > > > > > > > > > >                }
> > > > > > > > > > >        }
> > > > > > > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > > > > > > >                /* we're partially unmapping a previously mapped region, so we
> > > > > > > > > > >                 * need to split entry into two.
> > > > > > > > > > >                 */
> > > > > > > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > > > > > > +                   rte_errno = ENOTSUP;
> > > > > > > > > > > +                   ret = -1;
> > > > > > > > > > > +                   goto out;
> > > > > > > > > > > +           }
> > > > > > > > > > 
> > > > > > > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > > > > > > memory anyway? I don't think this is needed.
> > > > > > > > > 
> > > > > > > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > > > > > > and when he maps we don't split it as we don't about his memory.
> > > > > > > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > > > > > > 
> > > > > > > > Should we map it in page granularity then, instead of adding this
> > > > > > > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > > > > > > workaround, how about we just do the same thing for user mem mappings?
> > > > > > > > 
> > > > > > > In heap mapping's we map and unmap it at huge page granularity as we will always
> > > > > > > maintain that.
> > > > > > > 
> > > > > > > But here I think we don't know if user's allocation is huge page or
> > > > > > > collection of system
> > > > > > > pages. Only thing we can do here is map it at system page granularity which
> > > > > > > could waste entries if he say really is working with hugepages. Isn't ?
> > > > > > > 
> > > > > > 
> > > > > > Yeah we do. The API mandates the pages granularity, and it will check
> > > > > > against page size and number of IOVA entries, so yes, we do enforce the fact
> > > > > > that the IOVA addresses supplied by the user have to be page addresses.
> > > > > 
> > > > > If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> > > > > user is providing or we computing. He can call rte_vfio_container_dma_map()
> > > > > with 1GB huge page or 4K system page.
> > > > > 
> > > > > Am I missing something ?
> > > > 
> > > > Are you suggesting that a DMA mapping for hugepage-backed memory will be
> > > > made at system page size granularity? E.g. will a 1GB page-backed segment be
> > > > mapped for DMA as a contiguous 4K-based block?
> > > 
> > > I'm not suggesting anything. My only thought is how to solve below problem.
> > > Say application does the following.
> > > 
> > > #1 Allocate 1GB memory from huge page or some external mem.
> > > #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
> > >     In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
> > >     don't know where this memory is coming from or backed by what.
> > > #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
> > > Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
> > > In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
> > > can we allow #3 ?
> > > 
> > > 
> > > static int
> > > container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > >                  uint64_t len)
> > > {
> > >          struct user_mem_map *map, *new_map = NULL;
> > >          struct user_mem_maps *user_mem_maps;
> > >          int ret = 0;
> > > 
> > >          user_mem_maps = &vfio_cfg->mem_maps;
> > >          rte_spinlock_recursive_lock(&user_mem_maps->lock);
> > > 
> > >          /* find our mapping */
> > >          map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
> > >          if (!map) {
> > >                  RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
> > >                  rte_errno = EINVAL;
> > >                  ret = -1;
> > >                  goto out;
> > >          }
> > >          if (map->addr != vaddr || map->iova != iova || map->len != len) {
> > >                  /* we're partially unmapping a previously mapped region, so we
> > >                   * need to split entry into two.
> > >                   */
> 
> Hi,
> 
> Apologies, i was on vacation.
> 
> Yes, I can see the problem now. Does VFIO even support non-system page
> sizes? Like, if i allocated a 1GB page, would i be able to map *this page*
> for DMA, as opposed to first 4K of this page? I suspect that the mapping
> doesn't support page sizes other than the system page size.

It does support mapping any multiple of system page size.
See vfio/vfio_iommu_type1.c vfio_pin_map_dma(). Also
./driver-api/vfio.rst doesn't mention any such restrictions even in its
example.

Also my test case is passing so that confirms the behavior.


> 
> -- 
> Thanks,
> Anatoly

Burakov, Anatoly Oct. 28, 2020, 4:07 p.m. UTC | #14

On 28-Oct-20 2:17 PM, Nithin Dabilpuram wrote:
> On Wed, Oct 28, 2020 at 01:04:26PM +0000, Burakov, Anatoly wrote:
>> On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote:
>>> Ping.
>>>
>>> On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
>>>> On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
>>>>> On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
>>>>>> On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
>>>>>>> On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
>>>>>>>> On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
>>>>>>>> <anatoly.burakov@intel.com> wrote:
>>>>>>>>>
>>>>>>>>> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
>>>>>>>>>> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
>>>>>>>>>>> External Email
>>>>>>>>>>>
>>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
>>>>>>>>>>>> Partial unmapping is not supported for VFIO IOMMU type1
>>>>>>>>>>>> by kernel. Though kernel gives return as zero, the unmapped size
>>>>>>>>>>>> returned will not be same as expected. So check for
>>>>>>>>>>>> returned unmap size and return error.
>>>>>>>>>>>>
>>>>>>>>>>>> For case of DMA map/unmap triggered by heap allocations,
>>>>>>>>>>>> maintain granularity of memseg page size so that heap
>>>>>>>>>>>> expansion and contraction does not have this issue.
>>>>>>>>>>>
>>>>>>>>>>> This is quite unfortunate, because there was a different bug that had to do
>>>>>>>>>>> with kernel having a very limited number of mappings available [1], as a
>>>>>>>>>>> result of which the page concatenation code was added.
>>>>>>>>>>>
>>>>>>>>>>> It should therefore be documented that the dma_entry_limit parameter should
>>>>>>>>>>> be adjusted should the user run out of the DMA entries.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
>>>>>>>>>
>>>>>>>>> <snip>
>>>>>>>>>
>>>>>>>>>>>>                         RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
>>>>>>>>>>>>                                         errno, strerror(errno));
>>>>>>>>>>>>                         return -1;
>>>>>>>>>>>> +           } else if (dma_unmap.size != len) {
>>>>>>>>>>>> +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
>>>>>>>>>>>> +                           "remapping cleared instead of %"PRIu64"\n",
>>>>>>>>>>>> +                           (uint64_t)dma_unmap.size, len);
>>>>>>>>>>>> +                   rte_errno = EIO;
>>>>>>>>>>>> +                   return -1;
>>>>>>>>>>>>                 }
>>>>>>>>>>>>         }
>>>>>>>>>>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>>>>>>>>>                 /* we're partially unmapping a previously mapped region, so we
>>>>>>>>>>>>                  * need to split entry into two.
>>>>>>>>>>>>                  */
>>>>>>>>>>>> +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
>>>>>>>>>>>> +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
>>>>>>>>>>>> +                   rte_errno = ENOTSUP;
>>>>>>>>>>>> +                   ret = -1;
>>>>>>>>>>>> +                   goto out;
>>>>>>>>>>>> +           }
>>>>>>>>>>>
>>>>>>>>>>> How would we ever arrive here if we never do more than 1 page worth of
>>>>>>>>>>> memory anyway? I don't think this is needed.
>>>>>>>>>>
>>>>>>>>>> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
>>>>>>>>>> and when he maps we don't split it as we don't about his memory.
>>>>>>>>>> So if he maps multiple pages and tries to unmap partially, then we should fail.
>>>>>>>>>
>>>>>>>>> Should we map it in page granularity then, instead of adding this
>>>>>>>>> discrepancy between EAL and user mapping? I.e. instead of adding a
>>>>>>>>> workaround, how about we just do the same thing for user mem mappings?
>>>>>>>>>
>>>>>>>> In heap mapping's we map and unmap it at huge page granularity as we will always
>>>>>>>> maintain that.
>>>>>>>>
>>>>>>>> But here I think we don't know if user's allocation is huge page or
>>>>>>>> collection of system
>>>>>>>> pages. Only thing we can do here is map it at system page granularity which
>>>>>>>> could waste entries if he say really is working with hugepages. Isn't ?
>>>>>>>>
>>>>>>>
>>>>>>> Yeah we do. The API mandates the pages granularity, and it will check
>>>>>>> against page size and number of IOVA entries, so yes, we do enforce the fact
>>>>>>> that the IOVA addresses supplied by the user have to be page addresses.
>>>>>>
>>>>>> If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
>>>>>> user is providing or we computing. He can call rte_vfio_container_dma_map()
>>>>>> with 1GB huge page or 4K system page.
>>>>>>
>>>>>> Am I missing something ?
>>>>>
>>>>> Are you suggesting that a DMA mapping for hugepage-backed memory will be
>>>>> made at system page size granularity? E.g. will a 1GB page-backed segment be
>>>>> mapped for DMA as a contiguous 4K-based block?
>>>>
>>>> I'm not suggesting anything. My only thought is how to solve below problem.
>>>> Say application does the following.
>>>>
>>>> #1 Allocate 1GB memory from huge page or some external mem.
>>>> #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
>>>>      In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
>>>>      don't know where this memory is coming from or backed by what.
>>>> #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
>>>> Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
>>>> In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
>>>> can we allow #3 ?
>>>>
>>>>
>>>> static int
>>>> container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
>>>>                   uint64_t len)
>>>> {
>>>>           struct user_mem_map *map, *new_map = NULL;
>>>>           struct user_mem_maps *user_mem_maps;
>>>>           int ret = 0;
>>>>
>>>>           user_mem_maps = &vfio_cfg->mem_maps;
>>>>           rte_spinlock_recursive_lock(&user_mem_maps->lock);
>>>>
>>>>           /* find our mapping */
>>>>           map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
>>>>           if (!map) {
>>>>                   RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
>>>>                   rte_errno = EINVAL;
>>>>                   ret = -1;
>>>>                   goto out;
>>>>           }
>>>>           if (map->addr != vaddr || map->iova != iova || map->len != len) {
>>>>                   /* we're partially unmapping a previously mapped region, so we
>>>>                    * need to split entry into two.
>>>>                    */
>>
>> Hi,
>>
>> Apologies, i was on vacation.
>>
>> Yes, I can see the problem now. Does VFIO even support non-system page
>> sizes? Like, if i allocated a 1GB page, would i be able to map *this page*
>> for DMA, as opposed to first 4K of this page? I suspect that the mapping
>> doesn't support page sizes other than the system page size.
> 
> It does support mapping any multiple of system page size.
> See vfio/vfio_iommu_type1.c vfio_pin_map_dma(). Also
> ./driver-api/vfio.rst doesn't mention any such restrictions even in its
> example.
> 
> Also my test case is passing so that confirms the behavior.

Can we perhaps make it so that the API mandates mapping/unmapping the 
same chunks? That would be the easiest solution here.

> 
> 
>>
>> -- 
>> Thanks,
>> Anatoly

Nithin Dabilpuram Oct. 28, 2020, 4:31 p.m. UTC | #15

On Wed, Oct 28, 2020 at 04:07:17PM +0000, Burakov, Anatoly wrote:
> On 28-Oct-20 2:17 PM, Nithin Dabilpuram wrote:
> > On Wed, Oct 28, 2020 at 01:04:26PM +0000, Burakov, Anatoly wrote:
> > > On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote:
> > > > Ping.
> > > > 
> > > > On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote:
> > > > > On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote:
> > > > > > On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote:
> > > > > > > On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote:
> > > > > > > > On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote:
> > > > > > > > > On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly
> > > > > > > > > <anatoly.burakov@intel.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote:
> > > > > > > > > > > On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote:
> > > > > > > > > > > > External Email
> > > > > > > > > > > > 
> > > > > > > > > > > > ----------------------------------------------------------------------
> > > > > > > > > > > > On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote:
> > > > > > > > > > > > > Partial unmapping is not supported for VFIO IOMMU type1
> > > > > > > > > > > > > by kernel. Though kernel gives return as zero, the unmapped size
> > > > > > > > > > > > > returned will not be same as expected. So check for
> > > > > > > > > > > > > returned unmap size and return error.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > For case of DMA map/unmap triggered by heap allocations,
> > > > > > > > > > > > > maintain granularity of memseg page size so that heap
> > > > > > > > > > > > > expansion and contraction does not have this issue.
> > > > > > > > > > > > 
> > > > > > > > > > > > This is quite unfortunate, because there was a different bug that had to do
> > > > > > > > > > > > with kernel having a very limited number of mappings available [1], as a
> > > > > > > > > > > > result of which the page concatenation code was added.
> > > > > > > > > > > > 
> > > > > > > > > > > > It should therefore be documented that the dma_entry_limit parameter should
> > > > > > > > > > > > be adjusted should the user run out of the DMA entries.
> > > > > > > > > > > > 
> > > > > > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e=
> > > > > > > > > > 
> > > > > > > > > > <snip>
> > > > > > > > > > 
> > > > > > > > > > > > >                         RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
> > > > > > > > > > > > >                                         errno, strerror(errno));
> > > > > > > > > > > > >                         return -1;
> > > > > > > > > > > > > +           } else if (dma_unmap.size != len) {
> > > > > > > > > > > > > +                   RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
> > > > > > > > > > > > > +                           "remapping cleared instead of %"PRIu64"\n",
> > > > > > > > > > > > > +                           (uint64_t)dma_unmap.size, len);
> > > > > > > > > > > > > +                   rte_errno = EIO;
> > > > > > > > > > > > > +                   return -1;
> > > > > > > > > > > > >                 }
> > > > > > > > > > > > >         }
> > > > > > > > > > > > > @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > > > > > > > > > >                 /* we're partially unmapping a previously mapped region, so we
> > > > > > > > > > > > >                  * need to split entry into two.
> > > > > > > > > > > > >                  */
> > > > > > > > > > > > > +           if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
> > > > > > > > > > > > > +                   RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
> > > > > > > > > > > > > +                   rte_errno = ENOTSUP;
> > > > > > > > > > > > > +                   ret = -1;
> > > > > > > > > > > > > +                   goto out;
> > > > > > > > > > > > > +           }
> > > > > > > > > > > > 
> > > > > > > > > > > > How would we ever arrive here if we never do more than 1 page worth of
> > > > > > > > > > > > memory anyway? I don't think this is needed.
> > > > > > > > > > > 
> > > > > > > > > > > container_dma_unmap() is called by user via rte_vfio_container_dma_unmap()
> > > > > > > > > > > and when he maps we don't split it as we don't about his memory.
> > > > > > > > > > > So if he maps multiple pages and tries to unmap partially, then we should fail.
> > > > > > > > > > 
> > > > > > > > > > Should we map it in page granularity then, instead of adding this
> > > > > > > > > > discrepancy between EAL and user mapping? I.e. instead of adding a
> > > > > > > > > > workaround, how about we just do the same thing for user mem mappings?
> > > > > > > > > > 
> > > > > > > > > In heap mapping's we map and unmap it at huge page granularity as we will always
> > > > > > > > > maintain that.
> > > > > > > > > 
> > > > > > > > > But here I think we don't know if user's allocation is huge page or
> > > > > > > > > collection of system
> > > > > > > > > pages. Only thing we can do here is map it at system page granularity which
> > > > > > > > > could waste entries if he say really is working with hugepages. Isn't ?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yeah we do. The API mandates the pages granularity, and it will check
> > > > > > > > against page size and number of IOVA entries, so yes, we do enforce the fact
> > > > > > > > that the IOVA addresses supplied by the user have to be page addresses.
> > > > > > > 
> > > > > > > If I see rte_vfio_container_dma_map(), there is no mention of Huge page size
> > > > > > > user is providing or we computing. He can call rte_vfio_container_dma_map()
> > > > > > > with 1GB huge page or 4K system page.
> > > > > > > 
> > > > > > > Am I missing something ?
> > > > > > 
> > > > > > Are you suggesting that a DMA mapping for hugepage-backed memory will be
> > > > > > made at system page size granularity? E.g. will a 1GB page-backed segment be
> > > > > > mapped for DMA as a contiguous 4K-based block?
> > > > > 
> > > > > I'm not suggesting anything. My only thought is how to solve below problem.
> > > > > Say application does the following.
> > > > > 
> > > > > #1 Allocate 1GB memory from huge page or some external mem.
> > > > > #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB)
> > > > >      In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we
> > > > >      don't know where this memory is coming from or backed by what.
> > > > > #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB)
> > > > > Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below,
> > > > > In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how
> > > > > can we allow #3 ?
> > > > > 
> > > > > 
> > > > > static int
> > > > > container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
> > > > >                   uint64_t len)
> > > > > {
> > > > >           struct user_mem_map *map, *new_map = NULL;
> > > > >           struct user_mem_maps *user_mem_maps;
> > > > >           int ret = 0;
> > > > > 
> > > > >           user_mem_maps = &vfio_cfg->mem_maps;
> > > > >           rte_spinlock_recursive_lock(&user_mem_maps->lock);
> > > > > 
> > > > >           /* find our mapping */
> > > > >           map = find_user_mem_map(user_mem_maps, vaddr, iova, len);
> > > > >           if (!map) {
> > > > >                   RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n");
> > > > >                   rte_errno = EINVAL;
> > > > >                   ret = -1;
> > > > >                   goto out;
> > > > >           }
> > > > >           if (map->addr != vaddr || map->iova != iova || map->len != len) {
> > > > >                   /* we're partially unmapping a previously mapped region, so we
> > > > >                    * need to split entry into two.
> > > > >                    */
> > > 
> > > Hi,
> > > 
> > > Apologies, i was on vacation.
> > > 
> > > Yes, I can see the problem now. Does VFIO even support non-system page
> > > sizes? Like, if i allocated a 1GB page, would i be able to map *this page*
> > > for DMA, as opposed to first 4K of this page? I suspect that the mapping
> > > doesn't support page sizes other than the system page size.
> > 
> > It does support mapping any multiple of system page size.
> > See vfio/vfio_iommu_type1.c vfio_pin_map_dma(). Also
> > ./driver-api/vfio.rst doesn't mention any such restrictions even in its
> > example.
> > 
> > Also my test case is passing so that confirms the behavior.
> 
> Can we perhaps make it so that the API mandates mapping/unmapping the same
> chunks? That would be the easiest solution here.

Ack, I was already doing that for type1 IOMMU with my above patch.
I didn't change the behavior for sPAPR or no-iommu mode.
> 
> > 
> > 
> > > 
> > > -- 
> > > Thanks,
> > > Anatoly
> 
> 
> -- 
> Thanks,
> Anatoly

diff mbox series

Patch

diff --git a/lib/librte_eal/linux/eal_vfio.c b/lib/librte_eal/linux/eal_vfio.c
index d26e164..ef95259 100644
--- a/lib/librte_eal/linux/eal_vfio.c
+++ b/lib/librte_eal/linux/eal_vfio.c
@@ -69,6 +69,7 @@  static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_TYPE1,
 		.name = "Type 1",
+		.partial_unmap = false,
 		.dma_map_func = &vfio_type1_dma_map,
 		.dma_user_map_func = &vfio_type1_dma_mem_map
 	},
@@ -76,6 +77,7 @@  static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_SPAPR,
 		.name = "sPAPR",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_spapr_dma_map,
 		.dma_user_map_func = &vfio_spapr_dma_mem_map
 	},
@@ -83,6 +85,7 @@  static const struct vfio_iommu_type iommu_types[] = {
 	{
 		.type_id = RTE_VFIO_NOIOMMU,
 		.name = "No-IOMMU",
+		.partial_unmap = true,
 		.dma_map_func = &vfio_noiommu_dma_map,
 		.dma_user_map_func = &vfio_noiommu_dma_mem_map
 	},
@@ -525,12 +528,19 @@  vfio_mem_event_callback(enum rte_mem_event type, const void *addr, size_t len,
 	/* for IOVA as VA mode, no need to care for IOVA addresses */
 	if (rte_eal_iova_mode() == RTE_IOVA_VA && msl->external == 0) {
 		uint64_t vfio_va = (uint64_t)(uintptr_t)addr;
-		if (type == RTE_MEM_EVENT_ALLOC)
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 1);
-		else
-			vfio_dma_mem_map(default_vfio_cfg, vfio_va, vfio_va,
-					len, 0);
+		uint64_t page_sz = msl->page_sz;
+
+		/* Maintain granularity of DMA map/unmap to memseg size */
+		for (; cur_len < len; cur_len += page_sz) {
+			if (type == RTE_MEM_EVENT_ALLOC)
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 1);
+			else
+				vfio_dma_mem_map(default_vfio_cfg, vfio_va,
+						 vfio_va, page_sz, 0);
+			vfio_va += page_sz;
+		}
+
 		return;
 	}
 
@@ -1383,6 +1393,12 @@  vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 			RTE_LOG(ERR, EAL, "  cannot clear DMA remapping, error %i (%s)\n",
 					errno, strerror(errno));
 			return -1;
+		} else if (dma_unmap.size != len) {
+			RTE_LOG(ERR, EAL, "  unexpected size %"PRIu64" of DMA "
+				"remapping cleared instead of %"PRIu64"\n",
+				(uint64_t)dma_unmap.size, len);
+			rte_errno = EIO;
+			return -1;
 		}
 	}
 
@@ -1853,6 +1869,12 @@  container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 		/* we're partially unmapping a previously mapped region, so we
 		 * need to split entry into two.
 		 */
+		if (!vfio_cfg->vfio_iommu_type->partial_unmap) {
+			RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n");
+			rte_errno = ENOTSUP;
+			ret = -1;
+			goto out;
+		}
 		if (user_mem_maps->n_maps == VFIO_MAX_USER_MEM_MAPS) {
 			RTE_LOG(ERR, EAL, "Not enough space to store partial mapping\n");
 			rte_errno = ENOMEM;
diff --git a/lib/librte_eal/linux/eal_vfio.h b/lib/librte_eal/linux/eal_vfio.h
index cb2d35f..6ebaca6 100644
--- a/lib/librte_eal/linux/eal_vfio.h
+++ b/lib/librte_eal/linux/eal_vfio.h
@@ -113,6 +113,7 @@  typedef int (*vfio_dma_user_func_t)(int fd, uint64_t vaddr, uint64_t iova,
 struct vfio_iommu_type {
 	int type_id;
 	const char *name;
+	bool partial_unmap;
 	vfio_dma_user_func_t dma_user_map_func;
 	vfio_dma_func_t dma_map_func;
 };