[v3] vfio: fix workaround of BAR0 mapping

Message ID 20180713101145.4795-1-t.yoshimura8869@gmail.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v3] vfio: fix workaround of BAR0 mapping |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Takeshi Yoshimura July 13, 2018, 10:11 a.m. UTC
  The workaround of BAR0 mapping gives up and immediately returns an
error if it cannot map around the MSI-X. However, recent version
of VFIO allows MSIX mapping (*).

I fixed not to return immediately but try mapping. In old Linux, mmap
just fails and returns the same error as the code before my fix . In
recent Linux, mmap succeeds and this patch enables running DPDK in
specific environments (e.g., ppc64le with HGST NVMe)

(*): "vfio-pci: Allow mapping MSIX BAR",
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/id=a32295c612c57990d17fb0f41e7134394b2f35f6

Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")

Signed-off-by: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
---

Thanks, Anatoly.

I updated the patch not to affect behaviors of older Linux and
other environments as well as possible. This patch adds another
chance to mmap BAR0.

I noticed that the check at line 350 already includes the check
of page size, so this patch does not fix the check.

Regards,
Takeshi

drivers/bus/pci/linux/pci_vfio.c | 35 ++++++++++++++++++-----------------
 1 file changed, 18 insertions(+), 17 deletions(-)
  

Comments

Anatoly Burakov July 13, 2018, 11 a.m. UTC | #1
On 13-Jul-18 11:11 AM, Takeshi Yoshimura wrote:
> The workaround of BAR0 mapping gives up and immediately returns an
> error if it cannot map around the MSI-X. However, recent version
> of VFIO allows MSIX mapping (*).
> 
> I fixed not to return immediately but try mapping. In old Linux, mmap
> just fails and returns the same error as the code before my fix . In
> recent Linux, mmap succeeds and this patch enables running DPDK in
> specific environments (e.g., ppc64le with HGST NVMe)
> 
> (*): "vfio-pci: Allow mapping MSIX BAR",
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> commit/id=a32295c612c57990d17fb0f41e7134394b2f35f6
> 
> Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")
> 
> Signed-off-by: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
> ---
> 
> Thanks, Anatoly.
> 
> I updated the patch not to affect behaviors of older Linux and
> other environments as well as possible. This patch adds another
> chance to mmap BAR0.
> 
> I noticed that the check at line 350 already includes the check
> of page size, so this patch does not fix the check.
> 
> Regards,
> Takeshi

Hi Takeshi,

Please correct me if i'm wrong, but i'm not sure the old behavior is kept.

Let's say we're running an old kernel, which doesn't allow mapping MSI-X 
BARs. If MSI-X starts at beginning of the BAR (floor-aligned to page 
size), and ends at or beyond end of BAR (ceiling-aligned to page size). 
In that situation, old code just skipped the BAR and returned 0.

We then exited the function, and there's a check for return value right 
after pci_vfio_mmap_bar() that stop continuing if we fail to map 
something. In the old code, we would continue as we went, and finish the 
rest of our mappings. With your new code, you're attempting to map the 
BAR, it fails, and you will return -1 on older kernels.

I believe what we really need here is the following:

1) If this is a BAR containing MSI-X vector, first try mapping the 
entire BAR. If it succeeds, great - that would be your new kernel behavior.
2) If we failed on step 1), check to see if we can map around the BAR. 
If we can, try to map around it like the current code does. If we cannot 
map around it (i.e. if MSI-X vector, page aligned, occupies entire BAR), 
then we simply return 0 and skip the BAR.

That, i would think, would keep the old behavior and enable the new one.

Does that make sense?
  
Anatoly Burakov July 13, 2018, 11:08 a.m. UTC | #2
On 13-Jul-18 12:00 PM, Burakov, Anatoly wrote:
> On 13-Jul-18 11:11 AM, Takeshi Yoshimura wrote:
>> The workaround of BAR0 mapping gives up and immediately returns an
>> error if it cannot map around the MSI-X. However, recent version
>> of VFIO allows MSIX mapping (*).
>>
>> I fixed not to return immediately but try mapping. In old Linux, mmap
>> just fails and returns the same error as the code before my fix . In
>> recent Linux, mmap succeeds and this patch enables running DPDK in
>> specific environments (e.g., ppc64le with HGST NVMe)
>>
>> (*): "vfio-pci: Allow mapping MSIX BAR",
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
>> commit/id=a32295c612c57990d17fb0f41e7134394b2f35f6
>>
>> Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")
>>
>> Signed-off-by: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
>> ---
>>
>> Thanks, Anatoly.
>>
>> I updated the patch not to affect behaviors of older Linux and
>> other environments as well as possible. This patch adds another
>> chance to mmap BAR0.
>>
>> I noticed that the check at line 350 already includes the check
>> of page size, so this patch does not fix the check.
>>
>> Regards,
>> Takeshi
> 
> Hi Takeshi,
> 
> Please correct me if i'm wrong, but i'm not sure the old behavior is kept.
> 
> Let's say we're running an old kernel, which doesn't allow mapping MSI-X 
> BARs. If MSI-X starts at beginning of the BAR (floor-aligned to page 
> size), and ends at or beyond end of BAR (ceiling-aligned to page size). 
> In that situation, old code just skipped the BAR and returned 0.
> 
> We then exited the function, and there's a check for return value right 
> after pci_vfio_mmap_bar() that stop continuing if we fail to map 
> something. In the old code, we would continue as we went, and finish the 
> rest of our mappings. With your new code, you're attempting to map the 
> BAR, it fails, and you will return -1 on older kernels.
> 
> I believe what we really need here is the following:
> 
> 1) If this is a BAR containing MSI-X vector, first try mapping the 
> entire BAR. If it succeeds, great - that would be your new kernel behavior.
> 2) If we failed on step 1), check to see if we can map around the BAR. 
> If we can, try to map around it like the current code does. If we cannot 
> map around it (i.e. if MSI-X vector, page aligned, occupies entire BAR), 
> then we simply return 0 and skip the BAR.
> 
> That, i would think, would keep the old behavior and enable the new one.
> 
> Does that make sense?
> 

I envision this to look something like this:

bool again = false;
do {
	if (again) {
		// set up mmap-around
		if (cannot map around)
			return 0;
	}
	// try mapping
	if (map_failed && msix_table->bar_index == bar_index) {
		again = true;
		continue;
	}
	if (map_failed)
		return -1;
	break/return 0;
} while (again);
  
Takeshi Yoshimura July 17, 2018, 8:21 a.m. UTC | #3
2018-07-13 20:08 GMT+09:00 Burakov, Anatoly <anatoly.burakov@intel.com>:
> On 13-Jul-18 12:00 PM, Burakov, Anatoly wrote:
>>
>> On 13-Jul-18 11:11 AM, Takeshi Yoshimura wrote:
>>>
>>> The workaround of BAR0 mapping gives up and immediately returns an
>>> error if it cannot map around the MSI-X. However, recent version
>>> of VFIO allows MSIX mapping (*).
>>>
>>> I fixed not to return immediately but try mapping. In old Linux, mmap
>>> just fails and returns the same error as the code before my fix . In
>>> recent Linux, mmap succeeds and this patch enables running DPDK in
>>> specific environments (e.g., ppc64le with HGST NVMe)
>>>
>>> (*): "vfio-pci: Allow mapping MSIX BAR",
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
>>> commit/id=a32295c612c57990d17fb0f41e7134394b2f35f6
>>>
>>> Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")
>>>
>>> Signed-off-by: Takeshi Yoshimura <t.yoshimura8869@gmail.com>
>>> ---
>>>
>>> Thanks, Anatoly.
>>>
>>> I updated the patch not to affect behaviors of older Linux and
>>> other environments as well as possible. This patch adds another
>>> chance to mmap BAR0.
>>>
>>> I noticed that the check at line 350 already includes the check
>>> of page size, so this patch does not fix the check.
>>>
>>> Regards,
>>> Takeshi
>>
>>
>> Hi Takeshi,
>>
>> Please correct me if i'm wrong, but i'm not sure the old behavior is kept.
>>
>> Let's say we're running an old kernel, which doesn't allow mapping MSI-X
>> BARs. If MSI-X starts at beginning of the BAR (floor-aligned to page size),
>> and ends at or beyond end of BAR (ceiling-aligned to page size). In that
>> situation, old code just skipped the BAR and returned 0.
>>
>> We then exited the function, and there's a check for return value right
>> after pci_vfio_mmap_bar() that stop continuing if we fail to map something.
>> In the old code, we would continue as we went, and finish the rest of our
>> mappings. With your new code, you're attempting to map the BAR, it fails,
>> and you will return -1 on older kernels.
>>
>> I believe what we really need here is the following:
>>
>> 1) If this is a BAR containing MSI-X vector, first try mapping the entire
>> BAR. If it succeeds, great - that would be your new kernel behavior.
>> 2) If we failed on step 1), check to see if we can map around the BAR. If
>> we can, try to map around it like the current code does. If we cannot map
>> around it (i.e. if MSI-X vector, page aligned, occupies entire BAR), then we
>> simply return 0 and skip the BAR.
>>
>> That, i would think, would keep the old behavior and enable the new one.
>>
>> Does that make sense?
>>
>
> I envision this to look something like this:
>
> bool again = false;
> do {
>         if (again) {
>                 // set up mmap-around
>                 if (cannot map around)
>                         return 0;
>         }
>         // try mapping
>         if (map_failed && msix_table->bar_index == bar_index) {
>                 again = true;
>                 continue;
>         }
>         if (map_failed)
>                 return -1;
>         break/return 0;
> } while (again);
>
> --
> Thanks,
> Anatoly

That makes sense. The return code was not same as old one in some paths.

I wrote a code based on your idea. It works at least in my ppc64 and
x86 machines, but I am concerned that the error messages for
pci_map_resource() confuse users in old Linux. I saw a message like
this (even if I could mmap):
EAL: pci_map_resource(): cannot mmap(15, 0x728ee3a30000, 0x4000, 0x0):
Invalid argument (0xffffffffffffffff)

But anyway, I send it in the next email, and please check if there is
any other problems in the code.

Thanks,
Takeshi
  

Patch

diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index aeeaa9ed8..eb9b8031d 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -348,24 +348,25 @@  pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res,
 		table_start &= PAGE_MASK;
 
 		if (table_start == 0 && table_end >= bar->size) {
-			/* Cannot map this BAR */
-			RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n", bar_index);
-			bar->size = 0;
-			bar->addr = 0;
-			return 0;
+			/* Cannot map around this BAR, but try */
+			RTE_LOG(DEBUG, EAL,
+				"Trying to map BAR%d that contains the MSI-X\n",
+				bar_index);
+			memreg[0].offset = bar->offset;
+			memreg[0].size = bar->size;
+		} else {
+			memreg[0].offset = bar->offset;
+			memreg[0].size = table_start;
+			memreg[1].offset = bar->offset + table_end;
+			memreg[1].size = bar->size - table_end;
+
+			RTE_LOG(DEBUG, EAL,
+				"Trying to map BAR%d that contains the MSI-X "
+				"table. Trying offsets: "
+				"0x%04lx:0x%04lx, 0x%04lx:0x%04lx\n", bar_index,
+				memreg[0].offset, memreg[0].size,
+				memreg[1].offset, memreg[1].size);
 		}
-
-		memreg[0].offset = bar->offset;
-		memreg[0].size = table_start;
-		memreg[1].offset = bar->offset + table_end;
-		memreg[1].size = bar->size - table_end;
-
-		RTE_LOG(DEBUG, EAL,
-			"Trying to map BAR%d that contains the MSI-X "
-			"table. Trying offsets: "
-			"0x%04lx:0x%04lx, 0x%04lx:0x%04lx\n", bar_index,
-			memreg[0].offset, memreg[0].size,
-			memreg[1].offset, memreg[1].size);
 	} else {
 		memreg[0].offset = bar->offset;
 		memreg[0].size = bar->size;