[v2] windows/virt2phys: fix block MDL not updated

Message ID 20230911130936.1485584-1-ming3.li@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v2] windows/virt2phys: fix block MDL not updated |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation warning apply issues
ci/iol-testing warning apply patch failure
ci/loongarch-compilation warning apply patch failure

Commit Message

Ric Li Sept. 11, 2023, 1:09 p.m. UTC
  The virt2phys_translate function previously scanned existing blocks,
returning the physical address from the stored MDL info if present.
This method was problematic when a virtual address pointed to a freed
and reallocated memory segment, potentially changing the physical
address mapping. Yet, virt2phys_translate would consistently return
the originally stored physical address, which could be invalid.

This issue surfaced when allocating a memory region larger than 2MB
using rte_malloc. This action would allocate a new memory segment
and use virt2phy to set the iova. The driver would store the MDL
and lock the pages initially. When this region was freed, the memory
segment used as a whole page could be freed, invalidating the virtual
to physical mapping. It then needed to be deleted from the driver's
block MDL info. Before this fix, the driver would only return the
initial physical address, leading to illegal iova for some pages when
allocating a new memory region of the same size.

To address this, a refresh function has been added. If a block with the
same base address is detected in the driver's context, the MDL's
physical address is compared with the real physical address.
If they don't match, the MDL within the block is released and rebuilt
to store the correct mapping.

Also fix the printing of PVOID type.

Signed-off-by: Ric Li <ming3.li@intel.com>
---
 windows/virt2phys/virt2phys.c       |  7 ++--
 windows/virt2phys/virt2phys_logic.c | 54 ++++++++++++++++++++++++++---
 2 files changed, 54 insertions(+), 7 deletions(-)
  

Comments

Dmitry Kozlyuk Sept. 11, 2023, 9:50 p.m. UTC | #1
Hi Ric,

2023-09-11 21:09 (UTC+0800), Ric Li:
> The virt2phys_translate function previously scanned existing blocks,
> returning the physical address from the stored MDL info if present.
> This method was problematic when a virtual address pointed to a freed
> and reallocated memory segment, potentially changing the physical
> address mapping. Yet, virt2phys_translate would consistently return
> the originally stored physical address, which could be invalid.

I missed this case completely :(

Is any if these bugs are related?
If so, please mention "Bugzilla ID: xxxx" in the commit message.

https://bugs.dpdk.org/show_bug.cgi?id=1201
https://bugs.dpdk.org/show_bug.cgi?id=1213

> 
> This issue surfaced when allocating a memory region larger than 2MB
> using rte_malloc. This action would allocate a new memory segment
> and use virt2phy to set the iova. The driver would store the MDL

Typo: "iova" -> "IOVA" here and below.

> and lock the pages initially. When this region was freed, the memory
> segment used as a whole page could be freed, invalidating the virtual
> to physical mapping. It then needed to be deleted from the driver's
> block MDL info. Before this fix, the driver would only return the
> initial physical address, leading to illegal iova for some pages when
> allocating a new memory region of the same size.
> 
> To address this, a refresh function has been added. If a block with the
> same base address is detected in the driver's context, the MDL's
> physical address is compared with the real physical address.
> If they don't match, the MDL within the block is released and rebuilt
> to store the correct mapping.

What if the size is different?
Should it be updated for the refreshed block along with the MDL?

[...]
> +static NTSTATUS
> +virt2phys_block_refresh(struct virt2phys_block *block, PVOID base, size_t size)
> +{
> +	NTSTATUS status;
> +	PMDL mdl = block->mdl;
> +
> +	/*
> +	 * Check if we need to refresh MDL in block.
> +	 * The virtual to physical memory mapping may be changed when the
> +	 * virtual memory region is freed by the user process and malloc again,
> +	 * then we need to unlock the physical memory and lock again to
> +	 * refresh the MDL information stored in block.
> +	 */
> +	PHYSICAL_ADDRESS block_phys, real_phys;
> +
> +	block_phys = virt2phys_block_translate(block, base);
> +	real_phys = MmGetPhysicalAddress(base);
> +
> +	if (block_phys.QuadPart == real_phys.QuadPart)
> +		/* No need to refresh block. */
> +		return STATUS_SUCCESS;
> +
> +	virt2phys_unlock_memory(mdl);

After this call the block's MDL is a dangling pointer.
If an error occurs below, the block with a dangling pointer
will remain in the list and will probably cause a crash later.
If a block can't be refreshed, it must be freed (it's invalid anyway).

> +	mdl = NULL;
> +
> +	status = virt2phys_lock_memory(base, size, &mdl);
> +	if (!NT_SUCCESS(status))
> +		return status;
> +
> +	status = virt2phys_check_memory(mdl);
> +	if (!NT_SUCCESS(status)) {
> +		virt2phys_unlock_memory(mdl);
> +		return status;
> +	}
> +	block->mdl = mdl;
> +
> +	TraceInfo("Block refreshed from %llx to %llx", block_phys.QuadPart, real_phys.QuadPart);

Please add process ID, block VA, and block size.
If refreshing fails, there is not way to tell what happened and why.
What do you think about logging like this?

	ID=... VA=... size=... requires physical address refresh
	ID=... VA=... size=... physical address refreshed from ... to ...

> +
> +	return STATUS_SUCCESS;
> +}
> +
>  NTSTATUS
>  virt2phys_translate(PVOID virt, PHYSICAL_ADDRESS *phys)
>  {
>  	PMDL mdl;
>  	HANDLE process_id;
> -	void *base;
> +	PVOID base;
>  	size_t size;
>  	struct virt2phys_process *process;
>  	struct virt2phys_block *block;
> @@ -371,6 +412,11 @@ virt2phys_translate(PVOID virt, PHYSICAL_ADDRESS *phys)
>  
>  	/* Don't lock the same memory twice. */
>  	if (block != NULL) {
> +		KeAcquireSpinLock(g_lock, &irql);
> +		status = virt2phys_block_refresh(block, base, size);
> +		KeReleaseSpinLock(g_lock, irql);

Is it safe to do all the external calls holding this spinlock?
I can't confirm from the doc that ZwQueryVirtualMemory(), for example,
does not access pageable data.
And virt2phys_lock_memory() raises exceptions, although it handles them.
Other stuff seems safe.

The rest of the code only takes the lock to access block and process lists,
which are allocated from the non-paged pool.
Now that I think of it, this may be insufficient because the code and the
static variables are not marked as non-paged.

The translation IOCTL performance is not critical,
so maybe it is worth replacing the spinlock with just a global mutex,
WDYT?
  
Ric Li Sept. 12, 2023, 11:13 a.m. UTC | #2
Hi Dmitry,

Thanks for the review, I'll send the next version patch.
Please see my comments below.

> -----Original Message-----
> From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> Sent: Tuesday, September 12, 2023 5:51 AM
> To: Li, Ming3 <ming3.li@intel.com>
> Cc: dev@dpdk.org; Tyler Retzlaff <roretzla@linux.microsoft.com>
> Subject: Re: [PATCH v2] windows/virt2phys: fix block MDL not updated
> 
> Hi Ric,
> 
> 2023-09-11 21:09 (UTC+0800), Ric Li:
> > The virt2phys_translate function previously scanned existing blocks,
> > returning the physical address from the stored MDL info if present.
> > This method was problematic when a virtual address pointed to a freed
> > and reallocated memory segment, potentially changing the physical
> > address mapping. Yet, virt2phys_translate would consistently return
> > the originally stored physical address, which could be invalid.
> 
> I missed this case completely :(
> 
> Is any if these bugs are related?
> If so, please mention "Bugzilla ID: xxxx" in the commit message.
> 
> https://bugs.dpdk.org/show_bug.cgi?id=1201
> https://bugs.dpdk.org/show_bug.cgi?id=1213
> 

Sure, will do.

I cannot reproduce them in my environment, but from the message,
they both mentioned that some pages not unlocked after exit. So they can be related.

For example, in Bug 1201, it only exists on Windows 2019, may it be caused by the
OS limitation so that some memory segment got freed and allocated same virtual address again?
Maybe someone can use this patch to check if there is 'refresh' behavior from TraceView logs.

> >
> > This issue surfaced when allocating a memory region larger than 2MB
> > using rte_malloc. This action would allocate a new memory segment and
> > use virt2phy to set the iova. The driver would store the MDL
> 
> Typo: "iova" -> "IOVA" here and below.
> 

Noted, will fix in v3.

> > and lock the pages initially. When this region was freed, the memory
> > segment used as a whole page could be freed, invalidating the virtual
> > to physical mapping. It then needed to be deleted from the driver's
> > block MDL info. Before this fix, the driver would only return the
> > initial physical address, leading to illegal iova for some pages when
> > allocating a new memory region of the same size.
> >
> > To address this, a refresh function has been added. If a block with
> > the same base address is detected in the driver's context, the MDL's
> > physical address is compared with the real physical address.
> > If they don't match, the MDL within the block is released and rebuilt
> > to store the correct mapping.
> 
> What if the size is different?
> Should it be updated for the refreshed block along with the MDL?
> 

The size of single MDL is always 2MB since it describes a hugepage here. 
(at least from my observation :)) For allocated buffer larger than 2MB, it has
serval mem segs (related to serval MDLs), most buggy mem segs are those
possess a whole hugepage, these segments are freed along with the buffer,
so their MDLs become invalid.

Since the block is just wrapper for MDL and list entry,
the refresh action should be applied to the whole block.

> [...]
> > +static NTSTATUS
> > +virt2phys_block_refresh(struct virt2phys_block *block, PVOID base,
> > +size_t size) {
> > +	NTSTATUS status;
> > +	PMDL mdl = block->mdl;
> > +
> > +	/*
> > +	 * Check if we need to refresh MDL in block.
> > +	 * The virtual to physical memory mapping may be changed when the
> > +	 * virtual memory region is freed by the user process and malloc again,
> > +	 * then we need to unlock the physical memory and lock again to
> > +	 * refresh the MDL information stored in block.
> > +	 */
> > +	PHYSICAL_ADDRESS block_phys, real_phys;
> > +
> > +	block_phys = virt2phys_block_translate(block, base);
> > +	real_phys = MmGetPhysicalAddress(base);
> > +
> > +	if (block_phys.QuadPart == real_phys.QuadPart)
> > +		/* No need to refresh block. */
> > +		return STATUS_SUCCESS;
> > +
> > +	virt2phys_unlock_memory(mdl);
> 
> After this call the block's MDL is a dangling pointer.
> If an error occurs below, the block with a dangling pointer will remain in the list
> and will probably cause a crash later.
> If a block can't be refreshed, it must be freed (it's invalid anyway).
> 

I will change the refresh logic here to just check the PA, and if it doesn't match,
the block will be removed from process's blocks list(after the check function).
To make it easy for block removal, the single linked list will be replaced with
a double linked list.

> > +	mdl = NULL;
> > +
> > +	status = virt2phys_lock_memory(base, size, &mdl);
> > +	if (!NT_SUCCESS(status))
> > +		return status;
> > +
> > +	status = virt2phys_check_memory(mdl);
> > +	if (!NT_SUCCESS(status)) {
> > +		virt2phys_unlock_memory(mdl);
> > +		return status;
> > +	}
> > +	block->mdl = mdl;
> > +
> > +	TraceInfo("Block refreshed from %llx to %llx", block_phys.QuadPart,
> > +real_phys.QuadPart);
> 
> Please add process ID, block VA, and block size.
> If refreshing fails, there is not way to tell what happened and why.
> What do you think about logging like this?
> 
> 	ID=... VA=... size=... requires physical address refresh
> 	ID=... VA=... size=... physical address refreshed from ... to ...
> 
> > +
> > +	return STATUS_SUCCESS;
> > +}
> > +
> >  NTSTATUS
> >  virt2phys_translate(PVOID virt, PHYSICAL_ADDRESS *phys)  {
> >  	PMDL mdl;
> >  	HANDLE process_id;
> > -	void *base;
> > +	PVOID base;
> >  	size_t size;
> >  	struct virt2phys_process *process;
> >  	struct virt2phys_block *block;
> > @@ -371,6 +412,11 @@ virt2phys_translate(PVOID virt, PHYSICAL_ADDRESS
> > *phys)
> >
> >  	/* Don't lock the same memory twice. */
> >  	if (block != NULL) {
> > +		KeAcquireSpinLock(g_lock, &irql);
> > +		status = virt2phys_block_refresh(block, base, size);
> > +		KeReleaseSpinLock(g_lock, irql);
> 
> Is it safe to do all the external calls holding this spinlock?
> I can't confirm from the doc that ZwQueryVirtualMemory(), for example, does
> not access pageable data.
> And virt2phys_lock_memory() raises exceptions, although it handles them.
> Other stuff seems safe.
> 
> The rest of the code only takes the lock to access block and process lists, which
> are allocated from the non-paged pool.
> Now that I think of it, this may be insufficient because the code and the static
> variables are not marked as non-paged.
> 
> The translation IOCTL performance is not critical, so maybe it is worth replacing
> the spinlock with just a global mutex, WDYT?

In the upcoming v3 patch, the lock will be used for block removal which won't fail.

I'm relatively new to Windows driver development. From my perspective, the use
of a spinlock seems appropriate in this driver. Maybe a read-write lock can be
more effective here?
  
Dmitry Kozlyuk Sept. 12, 2023, 12:18 p.m. UTC | #3
2023-09-12 11:13 (UTC+0000), Li, Ming3:
> > Is any if these bugs are related?
> > If so, please mention "Bugzilla ID: xxxx" in the commit message.
> > 
> > https://bugs.dpdk.org/show_bug.cgi?id=1201
> > https://bugs.dpdk.org/show_bug.cgi?id=1213
> >   
> 
> Sure, will do.
> 
> I cannot reproduce them in my environment, but from the message,
> they both mentioned that some pages not unlocked after exit. So they can be related.
> 
> For example, in Bug 1201, it only exists on Windows 2019, may it be caused by the
> OS limitation so that some memory segment got freed and allocated same virtual address again?
> Maybe someone can use this patch to check if there is 'refresh' behavior from TraceView logs.

I've posted a comment in BZ 1201 (the bugs are from the same user)
inviting to test your patch, let's see.

[...]
> > > To address this, a refresh function has been added. If a block with
> > > the same base address is detected in the driver's context, the MDL's
> > > physical address is compared with the real physical address.
> > > If they don't match, the MDL within the block is released and rebuilt
> > > to store the correct mapping.  
> > 
> > What if the size is different?
> > Should it be updated for the refreshed block along with the MDL?
> >   
> 
> The size of single MDL is always 2MB since it describes a hugepage here. 
> (at least from my observation :))

Your observation is correct, DPDK memalloc layer currently works this way.

> For allocated buffer larger than 2MB, it has
> serval mem segs (related to serval MDLs), most buggy mem segs are those
> possess a whole hugepage, these segments are freed along with the buffer,
> so their MDLs become invalid.
> 
> Since the block is just wrapper for MDL and list entry,
> the refresh action should be applied to the whole block.

There is always a single MDL per block, but it can describe multiple pages
(generally, if used beyond DPDK). Suppose there was a block for one page.
Then this page has been deallocated and allocated again but this time
in the middle of a multi-page region.
With your patch this will work, but that one-page block will be just lost
(never found because its MDL base VA does not match the region start VA).
The downside is that the memory remains locked.

The solution could be to check, when inserting a new block,
if there are existing blocks covered by the new one,
and if so, to free those blocks as they correspond to deallocated regions.
I think this can be done with another patch to limit the scope of this one.

Ideally virt2phys should not be doing this guesswork at all.
DPDK can just tell it when pages are allocated and freed,
but this requires some rework of the userspace part.
Just thinking out loud.

[...]
> > >  	/* Don't lock the same memory twice. */
> > >  	if (block != NULL) {
> > > +		KeAcquireSpinLock(g_lock, &irql);
> > > +		status = virt2phys_block_refresh(block, base, size);
> > > +		KeReleaseSpinLock(g_lock, irql);  
> > 
> > Is it safe to do all the external calls holding this spinlock?
> > I can't confirm from the doc that ZwQueryVirtualMemory(), for example, does
> > not access pageable data.
> > And virt2phys_lock_memory() raises exceptions, although it handles them.
> > Other stuff seems safe.
> > 
> > The rest of the code only takes the lock to access block and process lists, which
> > are allocated from the non-paged pool.
> > Now that I think of it, this may be insufficient because the code and the static
> > variables are not marked as non-paged.
> > 
> > The translation IOCTL performance is not critical, so maybe it is worth replacing
> > the spinlock with just a global mutex, WDYT?  
> 
> In the upcoming v3 patch, the lock will be used for block removal which won't fail.
> 
> I'm relatively new to Windows driver development. From my perspective, the use
> of a spinlock seems appropriate in this driver. Maybe a read-write lock can be
> more effective here?

It is correctness that I am concerned with, not efficiency.
Translating VA to IOVA is not performance-critical,
the spinlock is used just because it seemed sufficient.

Relating the code to the docs [1]:

* The code within a critical region guarded by an spin lock
  must neither be pageable nor make any references to pageable data.

  - Process and block structures are allocated from the non-paged pool - OK.
  - The code is not marked as non-pageable - FAIL, though never fired.

* The code within a critical region guarded by a spin lock can neither
  call any external function that might access pageable data...

  - MDL manipulation and page locking can run at "dispatch" IRQL - OK.
  - ZwQueryVirtualMemory() - unsure

  ... or raise an exception, nor can it generate any exceptions.

  - MmLockPagesInMemory() does generate an exception on failure,
    but it is handled - unsure

* The caller should release the spin lock with KeReleaseSpinLock as
  quickly as possible.

  - Before the patch, there was a fixed number of locked operations - OK.
  - After the patch, there's more work under the lock, although it seems to
    me that all of it can be done at "dispatch" IRQL - unsure.

I've added Tyler from Microsoft, he might know more.

[1]:
https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-keacquirespinlock
  

Patch

diff --git a/windows/virt2phys/virt2phys.c b/windows/virt2phys/virt2phys.c
index f4d5298..b64a13d 100644
--- a/windows/virt2phys/virt2phys.c
+++ b/windows/virt2phys/virt2phys.c
@@ -182,7 +182,7 @@  virt2phys_device_EvtIoInCallerContext(WDFDEVICE device, WDFREQUEST request)
 {
 	WDF_REQUEST_PARAMETERS params;
 	ULONG code;
-	PVOID *virt;
+	PVOID *pvirt, virt;
 	PHYSICAL_ADDRESS *phys;
 	size_t size;
 	NTSTATUS status;
@@ -207,12 +207,13 @@  virt2phys_device_EvtIoInCallerContext(WDFDEVICE device, WDFREQUEST request)
 	}
 
 	status = WdfRequestRetrieveInputBuffer(
-			request, sizeof(*virt), (PVOID *)&virt, &size);
+			request, sizeof(*pvirt), (PVOID *)&pvirt, &size);
 	if (!NT_SUCCESS(status)) {
 		TraceWarning("Retrieving input buffer: %!STATUS!", status);
 		WdfRequestComplete(request, status);
 		return;
 	}
+	virt = *pvirt;
 
 	status = WdfRequestRetrieveOutputBuffer(
 		request, sizeof(*phys), (PVOID *)&phys, &size);
@@ -222,7 +223,7 @@  virt2phys_device_EvtIoInCallerContext(WDFDEVICE device, WDFREQUEST request)
 		return;
 	}
 
-	status = virt2phys_translate(*virt, phys);
+	status = virt2phys_translate(virt, phys);
 	if (NT_SUCCESS(status))
 		WdfRequestSetInformation(request, sizeof(*phys));
 
diff --git a/windows/virt2phys/virt2phys_logic.c b/windows/virt2phys/virt2phys_logic.c
index e3ff293..8529a2b 100644
--- a/windows/virt2phys/virt2phys_logic.c
+++ b/windows/virt2phys/virt2phys_logic.c
@@ -182,7 +182,7 @@  virt2phys_process_cleanup(HANDLE process_id)
 }
 
 static struct virt2phys_block *
-virt2phys_find_block(HANDLE process_id, void *virt,
+virt2phys_find_block(HANDLE process_id, PVOID virt,
 	struct virt2phys_process **process)
 {
 	PLIST_ENTRY node;
@@ -250,7 +250,7 @@  virt2phys_add_block(struct virt2phys_process *process,
 }
 
 static NTSTATUS
-virt2phys_query_memory(void *virt, void **base, size_t *size)
+virt2phys_query_memory(PVOID virt, PVOID *base, size_t *size)
 {
 	MEMORY_BASIC_INFORMATION info;
 	SIZE_T info_size;
@@ -321,7 +321,7 @@  virt2phys_check_memory(PMDL mdl)
 }
 
 static NTSTATUS
-virt2phys_lock_memory(void *virt, size_t size, PMDL *mdl)
+virt2phys_lock_memory(PVOID virt, size_t size, PMDL *mdl)
 {
 	*mdl = IoAllocateMdl(virt, (ULONG)size, FALSE, FALSE, NULL);
 	if (*mdl == NULL)
@@ -346,12 +346,53 @@  virt2phys_unlock_memory(PMDL mdl)
 	IoFreeMdl(mdl);
 }
 
+static NTSTATUS
+virt2phys_block_refresh(struct virt2phys_block *block, PVOID base, size_t size)
+{
+	NTSTATUS status;
+	PMDL mdl = block->mdl;
+
+	/*
+	 * Check if we need to refresh MDL in block.
+	 * The virtual to physical memory mapping may be changed when the
+	 * virtual memory region is freed by the user process and malloc again,
+	 * then we need to unlock the physical memory and lock again to
+	 * refresh the MDL information stored in block.
+	 */
+	PHYSICAL_ADDRESS block_phys, real_phys;
+
+	block_phys = virt2phys_block_translate(block, base);
+	real_phys = MmGetPhysicalAddress(base);
+
+	if (block_phys.QuadPart == real_phys.QuadPart)
+		/* No need to refresh block. */
+		return STATUS_SUCCESS;
+
+	virt2phys_unlock_memory(mdl);
+	mdl = NULL;
+
+	status = virt2phys_lock_memory(base, size, &mdl);
+	if (!NT_SUCCESS(status))
+		return status;
+
+	status = virt2phys_check_memory(mdl);
+	if (!NT_SUCCESS(status)) {
+		virt2phys_unlock_memory(mdl);
+		return status;
+	}
+	block->mdl = mdl;
+
+	TraceInfo("Block refreshed from %llx to %llx", block_phys.QuadPart, real_phys.QuadPart);
+
+	return STATUS_SUCCESS;
+}
+
 NTSTATUS
 virt2phys_translate(PVOID virt, PHYSICAL_ADDRESS *phys)
 {
 	PMDL mdl;
 	HANDLE process_id;
-	void *base;
+	PVOID base;
 	size_t size;
 	struct virt2phys_process *process;
 	struct virt2phys_block *block;
@@ -371,6 +412,11 @@  virt2phys_translate(PVOID virt, PHYSICAL_ADDRESS *phys)
 
 	/* Don't lock the same memory twice. */
 	if (block != NULL) {
+		KeAcquireSpinLock(g_lock, &irql);
+		status = virt2phys_block_refresh(block, base, size);
+		KeReleaseSpinLock(g_lock, irql);
+		if (!NT_SUCCESS(status))
+			return status;
 		*phys = virt2phys_block_translate(block, virt);
 		return STATUS_SUCCESS;
 	}