[dpdk-dev,v2] igb_uio: prevent reset for a list of devices

Message ID	20171106184815.9953-1-ferruh.yigit@intel.com (mailing list archive)
State	Accepted, archived
Headers	From: Ferruh Yigit <ferruh.yigit@intel.com> To: Thomas Monjalon <thomas@monjalon.net> Cc: dev@dpdk.org, Ferruh Yigit <ferruh.yigit@intel.com>, stable@dpdk.org, Jianfeng Tan <jianfeng.tan@intel.com>, Jingjing Wu <jingjing.wu@intel.com>, Shijith Thotton <shijith.thotton@caviumnetworks.com>, Gregory Etelson <gregory@weka.io>, Harish Patil <harish.patil@cavium.com>, George Prekas <george.prekas@epfl.ch>, Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>, Rasesh Mody <rasesh.mody@cavium.com>, Lee Roberts <lee.roberts@hpe.com>, Stephen Hemminger <stephen@networkplumber.org> Date: Mon, 6 Nov 2017 18:48:15 +0000 Message-Id: <20171106184815.9953-1-ferruh.yigit@intel.com> In-Reply-To: <20171103223822.28852-1-ferruh.yigit@intel.com> References: <20171103223822.28852-1-ferruh.yigit@intel.com> Subject: [dpdk-dev] [PATCH v2] igb_uio: prevent reset for a list of devices Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>

Message ID

20171106184815.9953-1-ferruh.yigit@intel.com (mailing list archive)

State

Accepted, archived

Headers

From: Ferruh Yigit <ferruh.yigit@intel.com>
To: Thomas Monjalon <thomas@monjalon.net>
Cc: dev@dpdk.org, Ferruh Yigit <ferruh.yigit@intel.com>, stable@dpdk.org,
	Jianfeng Tan <jianfeng.tan@intel.com>,
	Jingjing Wu <jingjing.wu@intel.com>, 
	Shijith Thotton <shijith.thotton@caviumnetworks.com>,
	Gregory Etelson <gregory@weka.io>,
	Harish Patil <harish.patil@cavium.com>, 
	George Prekas <george.prekas@epfl.ch>,
	Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>,
	Rasesh Mody <rasesh.mody@cavium.com>, Lee Roberts <lee.roberts@hpe.com>, 
	Stephen Hemminger <stephen@networkplumber.org>
Date: Mon,  6 Nov 2017 18:48:15 +0000
Message-Id: <20171106184815.9953-1-ferruh.yigit@intel.com>
In-Reply-To: <20171103223822.28852-1-ferruh.yigit@intel.com>
References: <20171103223822.28852-1-ferruh.yigit@intel.com>
Subject: [dpdk-dev] [PATCH v2] igb_uio: prevent reset for a list of devices
Precedence: list
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Checks

Context	Check	Description
ci/checkpatch	success	coding style OK
ci/Intel-compilation	success	Compilation OK

Commit Message

Ferruh Yigit Nov. 6, 2017, 6:48 p.m. UTC

  Some devices are having problem on device reset that happens during DPDK
application exit [1].

Create a static list of devices and exclude them from device reset.

[1]
http://dpdk.org/ml/archives/dev/2017-November/080927.html

Fixes: b58eedfc7dd5 ("igb_uio: issue FLR during open and release of device file")
Cc: stable@dpdk.org

Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
---
Cc: Jianfeng Tan <jianfeng.tan@intel.com>
Cc: Jingjing Wu <jingjing.wu@intel.com>
Cc: Shijith Thotton <shijith.thotton@caviumnetworks.com>
Cc: Gregory Etelson <gregory@weka.io>
Cc: Harish Patil <harish.patil@cavium.com>
Cc: George Prekas <george.prekas@epfl.ch>
Cc: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
Cc: Rasesh Mody <rasesh.mody@cavium.com>
Cc: Lee Roberts <lee.roberts@hpe.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>

This is alternative approach to
http://dpdk.org/dev/patchwork/patch/31144/

v2:
* more concise function, no change in functionality
---
 lib/librte_eal/linuxapp/igb_uio/compat.h  | 19 ++++++++++++++++++-
 lib/librte_eal/linuxapp/igb_uio/igb_uio.c |  8 +++++++-
 2 files changed, 25 insertions(+), 2 deletions(-)

Comments

Thomas Monjalon Nov. 6, 2017, 11:55 p.m. UTC | #1

06/11/2017 19:48, Ferruh Yigit:
> Some devices are having problem on device reset that happens during DPDK
> application exit [1].
> 
> Create a static list of devices and exclude them from device reset.
> 
> [1]
> http://dpdk.org/ml/archives/dev/2017-November/080927.html
> 
> Fixes: b58eedfc7dd5 ("igb_uio: issue FLR during open and release of device file")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>

Applied, thanks

An option may be required to disable this exception
which may be a security hole.

Chas Williams Nov. 7, 2017, 11:50 a.m. UTC | #2

We still have an issue with this and PCI pass-through.  If a guest is
restarted while using PCI pass-through and igb_uio issues a
pci_reset_function(), this causes the host to crash.

On Mon, Nov 6, 2017 at 6:55 PM, Thomas Monjalon <thomas@monjalon.net> wrote:

> 06/11/2017 19:48, Ferruh Yigit:
> > Some devices are having problem on device reset that happens during DPDK
> > application exit [1].
> >
> > Create a static list of devices and exclude them from device reset.
> >
> > [1]
> > http://dpdk.org/ml/archives/dev/2017-November/080927.html
> >
> > Fixes: b58eedfc7dd5 ("igb_uio: issue FLR during open and release of
> device file")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
>
> Applied, thanks
>
> An option may be required to disable this exception
> which may be a security hole.
>
>

Thomas Monjalon Nov. 7, 2017, 1:02 p.m. UTC | #3

07/11/2017 12:50, Chas Williams:
> We still have an issue with this and PCI pass-through.  If a guest is
> restarted while using PCI pass-through and igb_uio issues a
> pci_reset_function(), this causes the host to crash.

Please, could you better explain the exact scenario and the cause of the crash?
Thanks

Stephen Hemminger Nov. 7, 2017, 1:13 p.m. UTC | #4

On Nov 7, 2017 20:50, "Chas Williams" <3chas3@gmail.com> wrote:

We still have an issue with this and PCI pass-through.  If a guest is
restarted while using PCI pass-through and igb_uio issues a
pci_reset_function(), this causes the host to crash.

On Mon, Nov 6, 2017 at 6:55 PM, Thomas Monjalon <thomas@monjalon.net> wrote:

> 06/11/2017 19:48, Ferruh Yigit:
> > Some devices are having problem on device reset that happens during DPDK
> > application exit [1].
> >
> > Create a static list of devices and exclude them from device reset.
> >
> > [1]
> > http://dpdk.org/ml/archives/dev/2017-November/080927.html
> >
> > Fixes: b58eedfc7dd5 ("igb_uio: issue FLR during open and release of
> device file")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
>
> Applied, thanks
>
> An option may be required to disable this exception
> which may be a security hole.
>
>

Which host. Anything guest can do to crash host is a high severity big in
the host

Chas Williams Nov. 7, 2017, 6:12 p.m. UTC | #5

Environment: Dell PowerEdge R730, Intel Corporation 82599ES 10-Gigabit
SFI/SFP+ Network Connection shared via PCI pass-through
Host: Debian 8
Guest: Custom Debian 8 with DPDK application based on 17.11

When we shutdown the guest, the kernel panics with:

[  279.021818] Do you have a strange power saving mode enabled?
[  279.021819] Dazed and confused, but trying to continue
[  279.021847] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 3
[  279.021849] {1}[Hardware Error]: event severity: fatal
[  279.021850] {1}[Hardware Error]:  Error 0, type: fatal
[  279.021851] {1}[Hardware Error]:   section_type: PCIe error
[  279.021852] {1}[Hardware Error]:   port_type: 0, PCIe end point
[  279.021853] {1}[Hardware Error]:   version: 1.16
[  279.021854] {1}[Hardware Error]:   command: 0x0507, status: 0x4010
[  279.021855] {1}[Hardware Error]:   device_id: 0000:03:00.0
[  279.021855] {1}[Hardware Error]:   slot: 0
[  279.021856] {1}[Hardware Error]:   secondary_bus: 0x00
[  279.021857] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10fb
[  279.021858] {1}[Hardware Error]:   class_code: 000002
[  279.021859] Kernel panic - not syncing: Fatal hardware error!
[  279.021977] sched: Unexpected reschedule of offline CPU#1!
[  279.021984] ------------[ cut here ]------------
[  279.021992] WARNING: CPU: 43 PID: 2807 at
/build/linux-fHlJSJ/linux-4.12.6/arch/x86/kernel/smp.c:128
native_smp_send_reschedule+0x34/0x40
[  279.021993] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1
vfio openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c
crc32c_generic nfsd nfs_aclr
pcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc
fscache tun intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm irqbypass mgag200 ttm drm_kms_helper drm joydev
crct10dif_pclmul crc32_pclmu
l ghash_clmulni_intel i2c_algo_bit ipmi_si ipmi_devintf iTCO_wdt
intel_cstate iTCO_vendor_support evdev intel_uncore mxm_wmi lpc_ich
ipmi_msghandler mfd_core ioatdma intel_rapl_perf dcdbas pcspkr shpchp
mei_me button wmi mei acpi_power_m
eter tpm_crb autofs4 ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom sg
hid_generic usbhid hid sd_mod
[  279.022044]  crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd
glue_helper ahci ehci_pci libahci ehci_hcd ixgbe libata megaraid_sas
usbcore dca i40e usb_common ptp pps_core scsi_mod mdio
[  279.022060] CPU: 43 PID: 2807 Comm: revalidator85 Not tainted
4.12.0-1-amd64 #1 Debian 4.12.6-1
[  279.022061] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.3.4
11/08/2016
[  279.022062] task: ffff91d0473f7100 task.stack: ffffafef8f4a4000
[  279.022066] RIP: 0010:native_smp_send_reschedule+0x34/0x40
[  279.022067] RSP: 0018:ffffafef8f4a7c98 EFLAGS: 00010082
[  279.022069] RAX: 000000000000002e RBX: ffff91d059d24080 RCX:
0000000000000001
[  279.022070] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
0000000000000046
[  279.022071] RBP: ffff91d04691d100 R08: 0000000000000000 R09:
000000000000002e
[  279.022072] R10: ffffafef8f4a7c90 R11: 00000000001cbb78 R12:
ffff91d85d21ae80
[  279.022073] R13: ffff91d059d24000 R14: 0000000000000002 R15:
0000000000000008
[  279.022075] FS:  00007f726affd700(0000) GS:ffff91d85d740000(0000)
knlGS:0000000000000000
[  279.022076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  279.022077] CR2: 00007fd422a52c48 CR3: 000000042d90f000 CR4:
00000000003426e0
[  279.022078] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  279.022079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  279.022080] Call Trace:
[  279.022086]  ? check_preempt_wakeup+0x181/0x220
[  279.022091]  ? check_preempt_curr+0x74/0x80
[  279.022094]  ? ttwu_do_wakeup+0x19/0x140
[  279.022098]  ? try_to_wake_up+0x1b8/0x470
[  279.022101]  ? wake_up_q+0x3f/0x70
[  279.022106]  ? futex_wake+0x15a/0x170
[  279.022108]  ? do_futex+0x2df/0xa90
[  279.022111]  ? SyS_futex+0x7a/0x170
[  279.022113]  ? SyS_read+0x76/0xc0
[  279.022118]  ? system_call_fast_compare_end+0xc/0x97
[  279.022119] Code: a3 05 51 fb cc 00 73 15 48 8b 05 28 74 a3 00 be fd 00
00 00 48 8b 80 a0 00 00 00 ff e0 89 fe 48 c7 c7 88 5c de b6 e8 e2 c9 13 00
<0f> ff c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 5d 00
[  279.022151] ---[ end trace eddc980dc8648163 ]---
[  279.454274] Kernel Offset: 0x35400000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)

The test engineer says this doesn't happen if we use SRIOV (which makes
sense since the device isn't directly shared between the guest and the
host).  If I remove the pci_reset_function() from igb_uio's .release, then
all is well.


On Tue, Nov 7, 2017 at 8:02 AM, Thomas Monjalon <thomas@monjalon.net> wrote:

> 07/11/2017 12:50, Chas Williams:
> > We still have an issue with this and PCI pass-through.  If a guest is
> > restarted while using PCI pass-through and igb_uio issues a
> > pci_reset_function(), this causes the host to crash.
>
> Please, could you better explain the exact scenario and the cause of the
> crash?
> Thanks
>
>

Chas Williams Nov. 7, 2017, 6:14 p.m. UTC | #6

Regardless if the issue is actually in the host kernel, I cannot fix all
the hypervisors so I must attempt to be well behaved as a guest.

On Tue, Nov 7, 2017 at 8:13 AM, Stephen Hemminger <
stephen@networkplumber.org> wrote:

>
>
> On Nov 7, 2017 20:50, "Chas Williams" <3chas3@gmail.com> wrote:
>
> We still have an issue with this and PCI pass-through.  If a guest is
> restarted while using PCI pass-through and igb_uio issues a
> pci_reset_function(), this causes the host to crash.
>
> On Mon, Nov 6, 2017 at 6:55 PM, Thomas Monjalon <thomas@monjalon.net>
> wrote:
>
>> 06/11/2017 19:48, Ferruh Yigit:
>> > Some devices are having problem on device reset that happens during DPDK
>> > application exit [1].
>> >
>> > Create a static list of devices and exclude them from device reset.
>> >
>> > [1]
>> > http://dpdk.org/ml/archives/dev/2017-November/080927.html
>> >
>> > Fixes: b58eedfc7dd5 ("igb_uio: issue FLR during open and release of
>> device file")
>> > Cc: stable@dpdk.org
>> >
>> > Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
>>
>> Applied, thanks
>>
>> An option may be required to disable this exception
>> which may be a security hole.
>>
>>
>
> Which host. Anything guest can do to crash host is a high severity big in
> the host
>

Ferruh Yigit Nov. 7, 2017, 6:49 p.m. UTC | #7

On 11/7/2017 10:12 AM, Chas Williams wrote:
> Environment: Dell PowerEdge R730, Intel Corporation 82599ES 10-Gigabit SFI/SFP+
> Network Connection shared via PCI pass-through
> Host: Debian 8
> Guest: Custom Debian 8 with DPDK application based on 17.11
> 
> When we shutdown the guest, the kernel panics with:
> 
> [  279.021818] Do you have a strange power saving mode enabled?
> [  279.021819] Dazed and confused, but trying to continue
> [  279.021847] {1}[Hardware Error]: Hardware error from APEI Generic Hardware
> Error Source: 3
> [  279.021849] {1}[Hardware Error]: event severity: fatal
> [  279.021850] {1}[Hardware Error]:  Error 0, type: fatal
> [  279.021851] {1}[Hardware Error]:   section_type: PCIe error
> [  279.021852] {1}[Hardware Error]:   port_type: 0, PCIe end point
> [  279.021853] {1}[Hardware Error]:   version: 1.16
> [  279.021854] {1}[Hardware Error]:   command: 0x0507, status: 0x4010
> [  279.021855] {1}[Hardware Error]:   device_id: 0000:03:00.0
> [  279.021855] {1}[Hardware Error]:   slot: 0
> [  279.021856] {1}[Hardware Error]:   secondary_bus: 0x00
> [  279.021857] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10fb
> [  279.021858] {1}[Hardware Error]:   class_code: 000002
> [  279.021859] Kernel panic - not syncing: Fatal hardware error!
> [  279.021977] sched: Unexpected reschedule of offline CPU#1!
> [  279.021984] ------------[ cut here ]------------
> [  279.021992] WARNING: CPU: 43 PID: 2807 at
> /build/linux-fHlJSJ/linux-4.12.6/arch/x86/kernel/smp.c:128
> native_smp_send_reschedule+0x34/0x40
> [  279.021993] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio
> openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4
> nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c crc32c_generic nfsd
> nfs_aclr
> pcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache tun
> intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> irqbypass mgag200 ttm drm_kms_helper drm joydev crct10dif_pclmul crc32_pclmu
> l ghash_clmulni_intel i2c_algo_bit ipmi_si ipmi_devintf iTCO_wdt intel_cstate
> iTCO_vendor_support evdev intel_uncore mxm_wmi lpc_ich ipmi_msghandler mfd_core
> ioatdma intel_rapl_perf dcdbas pcspkr shpchp mei_me button wmi mei acpi_power_m
> eter tpm_crb autofs4 ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom sg
> hid_generic usbhid hid sd_mod
> [  279.022044]  crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd
> glue_helper ahci ehci_pci libahci ehci_hcd ixgbe libata megaraid_sas usbcore dca
> i40e usb_common ptp pps_core scsi_mod mdio
> [  279.022060] CPU: 43 PID: 2807 Comm: revalidator85 Not tainted 4.12.0-1-amd64
> #1 Debian 4.12.6-1
> [  279.022061] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.3.4 11/08/2016
> [  279.022062] task: ffff91d0473f7100 task.stack: ffffafef8f4a4000
> [  279.022066] RIP: 0010:native_smp_send_reschedule+0x34/0x40
> [  279.022067] RSP: 0018:ffffafef8f4a7c98 EFLAGS: 00010082
> [  279.022069] RAX: 000000000000002e RBX: ffff91d059d24080 RCX: 0000000000000001
> [  279.022070] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000046
> [  279.022071] RBP: ffff91d04691d100 R08: 0000000000000000 R09: 000000000000002e
> [  279.022072] R10: ffffafef8f4a7c90 R11: 00000000001cbb78 R12: ffff91d85d21ae80
> [  279.022073] R13: ffff91d059d24000 R14: 0000000000000002 R15: 0000000000000008
> [  279.022075] FS:  00007f726affd700(0000) GS:ffff91d85d740000(0000)
> knlGS:0000000000000000
> [  279.022076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  279.022077] CR2: 00007fd422a52c48 CR3: 000000042d90f000 CR4: 00000000003426e0
> [  279.022078] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  279.022079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  279.022080] Call Trace:
> [  279.022086]  ? check_preempt_wakeup+0x181/0x220
> [  279.022091]  ? check_preempt_curr+0x74/0x80
> [  279.022094]  ? ttwu_do_wakeup+0x19/0x140
> [  279.022098]  ? try_to_wake_up+0x1b8/0x470
> [  279.022101]  ? wake_up_q+0x3f/0x70
> [  279.022106]  ? futex_wake+0x15a/0x170
> [  279.022108]  ? do_futex+0x2df/0xa90
> [  279.022111]  ? SyS_futex+0x7a/0x170
> [  279.022113]  ? SyS_read+0x76/0xc0
> [  279.022118]  ? system_call_fast_compare_end+0xc/0x97
> [  279.022119] Code: a3 05 51 fb cc 00 73 15 48 8b 05 28 74 a3 00 be fd 00 00 00
> 48 8b 80 a0 00 00 00 ff e0 89 fe 48 c7 c7 88 5c de b6 e8 e2 c9 13 00 <0f> ff c3
> 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 5d 00
> [  279.022151] ---[ end trace eddc980dc8648163 ]---
> [  279.454274] Kernel Offset: 0x35400000 from 0xffffffff81000000 (relocation
> range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> The test engineer says this doesn't happen if we use SRIOV (which makes sense
> since the device isn't directly shared between the guest and the host).  If I
> remove the pci_reset_function() from igb_uio's .release, then all is well.

This was tougher than expected, so many unexpected behavior. Why resetting
pass-through device in guest cause a crash in the host?

Finally, I will send a patch to remove the reset. Hopefully no more surprises
for release.

Still there will remain two improvement in igb_uio for better security,
disabling device interrupt on exit and clear master on exit.

> 
> 
> On Tue, Nov 7, 2017 at 8:02 AM, Thomas Monjalon <thomas@monjalon.net
> <mailto:thomas@monjalon.net>> wrote:
> 
>     07/11/2017 12:50, Chas Williams:
>     > We still have an issue with this and PCI pass-through.  If a guest is
>     > restarted while using PCI pass-through and igb_uio issues a
>     > pci_reset_function(), this causes the host to crash.
> 
>     Please, could you better explain the exact scenario and the cause of the crash?
>     Thanks
> 
>

Chas Williams Nov. 7, 2017, 8:47 p.m. UTC | #8

I will confess I haven't looked into the issue too hard since I have a
workaround.  My first guess is that there is something going on with the
IOMMU and quiescing a PCI pass-through device/function from the guest
(since I don't think the IOMMU is "visible" to the guest) seems iffy.

Most devices have some sort of reset to put the device into a known state
for setup/configuration (or enable/disable for the DMA engines).  If this
is done at .dev_close(), shouldn't that be as sufficient as resetting the
function?

On Tue, Nov 7, 2017 at 1:49 PM, Ferruh Yigit <ferruh.yigit@intel.com> wrote:

> On 11/7/2017 10:12 AM, Chas Williams wrote:
> > Environment: Dell PowerEdge R730, Intel Corporation 82599ES 10-Gigabit
> SFI/SFP+
> > Network Connection shared via PCI pass-through
> > Host: Debian 8
> > Guest: Custom Debian 8 with DPDK application based on 17.11
> >
> > When we shutdown the guest, the kernel panics with:
> >
> > [  279.021818] Do you have a strange power saving mode enabled?
> > [  279.021819] Dazed and confused, but trying to continue
> > [  279.021847] {1}[Hardware Error]: Hardware error from APEI Generic
> Hardware
> > Error Source: 3
> > [  279.021849] {1}[Hardware Error]: event severity: fatal
> > [  279.021850] {1}[Hardware Error]:  Error 0, type: fatal
> > [  279.021851] {1}[Hardware Error]:   section_type: PCIe error
> > [  279.021852] {1}[Hardware Error]:   port_type: 0, PCIe end point
> > [  279.021853] {1}[Hardware Error]:   version: 1.16
> > [  279.021854] {1}[Hardware Error]:   command: 0x0507, status: 0x4010
> > [  279.021855] {1}[Hardware Error]:   device_id: 0000:03:00.0
> > [  279.021855] {1}[Hardware Error]:   slot: 0
> > [  279.021856] {1}[Hardware Error]:   secondary_bus: 0x00
> > [  279.021857] {1}[Hardware Error]:   vendor_id: 0x8086, device_id:
> 0x10fb
> > [  279.021858] {1}[Hardware Error]:   class_code: 000002
> > [  279.021859] Kernel panic - not syncing: Fatal hardware error!
> > [  279.021977] sched: Unexpected reschedule of offline CPU#1!
> > [  279.021984] ------------[ cut here ]------------
> > [  279.021992] WARNING: CPU: 43 PID: 2807 at
> > /build/linux-fHlJSJ/linux-4.12.6/arch/x86/kernel/smp.c:128
> > native_smp_send_reschedule+0x34/0x40
> > [  279.021993] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1
> vfio
> > openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4
> nf_defrag_ipv4
> > nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c crc32c_generic
> nfsd
> > nfs_aclr
> > pcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc
> fscache tun
> > intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp
> kvm_intel kvm
> > irqbypass mgag200 ttm drm_kms_helper drm joydev crct10dif_pclmul
> crc32_pclmu
> > l ghash_clmulni_intel i2c_algo_bit ipmi_si ipmi_devintf iTCO_wdt
> intel_cstate
> > iTCO_vendor_support evdev intel_uncore mxm_wmi lpc_ich ipmi_msghandler
> mfd_core
> > ioatdma intel_rapl_perf dcdbas pcspkr shpchp mei_me button wmi mei
> acpi_power_m
> > eter tpm_crb autofs4 ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom sg
> > hid_generic usbhid hid sd_mod
> > [  279.022044]  crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd
> > glue_helper ahci ehci_pci libahci ehci_hcd ixgbe libata megaraid_sas
> usbcore dca
> > i40e usb_common ptp pps_core scsi_mod mdio
> > [  279.022060] CPU: 43 PID: 2807 Comm: revalidator85 Not tainted
> 4.12.0-1-amd64
> > #1 Debian 4.12.6-1
> > [  279.022061] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS
> 2.3.4 11/08/2016
> > [  279.022062] task: ffff91d0473f7100 task.stack: ffffafef8f4a4000
> > [  279.022066] RIP: 0010:native_smp_send_reschedule+0x34/0x40
> > [  279.022067] RSP: 0018:ffffafef8f4a7c98 EFLAGS: 00010082
> > [  279.022069] RAX: 000000000000002e RBX: ffff91d059d24080 RCX:
> 0000000000000001
> > [  279.022070] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
> 0000000000000046
> > [  279.022071] RBP: ffff91d04691d100 R08: 0000000000000000 R09:
> 000000000000002e
> > [  279.022072] R10: ffffafef8f4a7c90 R11: 00000000001cbb78 R12:
> ffff91d85d21ae80
> > [  279.022073] R13: ffff91d059d24000 R14: 0000000000000002 R15:
> 0000000000000008
> > [  279.022075] FS:  00007f726affd700(0000) GS:ffff91d85d740000(0000)
> > knlGS:0000000000000000
> > [  279.022076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  279.022077] CR2: 00007fd422a52c48 CR3: 000000042d90f000 CR4:
> 00000000003426e0
> > [  279.022078] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> > [  279.022079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> > [  279.022080] Call Trace:
> > [  279.022086]  ? check_preempt_wakeup+0x181/0x220
> > [  279.022091]  ? check_preempt_curr+0x74/0x80
> > [  279.022094]  ? ttwu_do_wakeup+0x19/0x140
> > [  279.022098]  ? try_to_wake_up+0x1b8/0x470
> > [  279.022101]  ? wake_up_q+0x3f/0x70
> > [  279.022106]  ? futex_wake+0x15a/0x170
> > [  279.022108]  ? do_futex+0x2df/0xa90
> > [  279.022111]  ? SyS_futex+0x7a/0x170
> > [  279.022113]  ? SyS_read+0x76/0xc0
> > [  279.022118]  ? system_call_fast_compare_end+0xc/0x97
> > [  279.022119] Code: a3 05 51 fb cc 00 73 15 48 8b 05 28 74 a3 00 be fd
> 00 00 00
> > 48 8b 80 a0 00 00 00 ff e0 89 fe 48 c7 c7 88 5c de b6 e8 e2 c9 13 00
> <0f> ff c3
> > 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 5d 00
> > [  279.022151] ---[ end trace eddc980dc8648163 ]---
> > [  279.454274] Kernel Offset: 0x35400000 from 0xffffffff81000000
> (relocation
> > range: 0xffffffff80000000-0xffffffffbfffffff)
> >
> > The test engineer says this doesn't happen if we use SRIOV (which makes
> sense
> > since the device isn't directly shared between the guest and the host).
> If I
> > remove the pci_reset_function() from igb_uio's .release, then all is
> well.
>
> This was tougher than expected, so many unexpected behavior. Why resetting
> pass-through device in guest cause a crash in the host?
>
> Finally, I will send a patch to remove the reset. Hopefully no more
> surprises
> for release.
>
> Still there will remain two improvement in igb_uio for better security,
> disabling device interrupt on exit and clear master on exit.
>
> >
> >
> > On Tue, Nov 7, 2017 at 8:02 AM, Thomas Monjalon <thomas@monjalon.net
> > <mailto:thomas@monjalon.net>> wrote:
> >
> >     07/11/2017 12:50, Chas Williams:
> >     > We still have an issue with this and PCI pass-through.  If a guest
> is
> >     > restarted while using PCI pass-through and igb_uio issues a
> >     > pci_reset_function(), this causes the host to crash.
> >
> >     Please, could you better explain the exact scenario and the cause of
> the crash?
> >     Thanks
> >
> >
>
>

Ferruh Yigit Nov. 7, 2017, 10:26 p.m. UTC | #9

On 11/7/2017 12:47 PM, Chas Williams wrote:
> I will confess I haven't looked into the issue too hard since I have a
> workaround.  My first guess is that there is something going on with the IOMMU
> and quiescing a PCI pass-through device/function from the guest (since I don't
> think the IOMMU is "visible" to the guest) seems iffy.
> 
> Most devices have some sort of reset to put the device into a known state for
> setup/configuration (or enable/disable for the DMA engines).  If this is done at
> .dev_close(), shouldn't that be as sufficient as resetting the function?

This is for the cases DPDK app terminated unexpectedly, proper exit path already
does cleanup.

> 
> On Tue, Nov 7, 2017 at 1:49 PM, Ferruh Yigit <ferruh.yigit@intel.com
> <mailto:ferruh.yigit@intel.com>> wrote:
> 
>     On 11/7/2017 10:12 AM, Chas Williams wrote:
>     > Environment: Dell PowerEdge R730, Intel Corporation 82599ES 10-Gigabit
>     SFI/SFP+
>     > Network Connection shared via PCI pass-through
>     > Host: Debian 8
>     > Guest: Custom Debian 8 with DPDK application based on 17.11
>     >
>     > When we shutdown the guest, the kernel panics with:
>     >
>     > [  279.021818] Do you have a strange power saving mode enabled?
>     > [  279.021819] Dazed and confused, but trying to continue
>     > [  279.021847] {1}[Hardware Error]: Hardware error from APEI Generic Hardware
>     > Error Source: 3
>     > [  279.021849] {1}[Hardware Error]: event severity: fatal
>     > [  279.021850] {1}[Hardware Error]:  Error 0, type: fatal
>     > [  279.021851] {1}[Hardware Error]:   section_type: PCIe error
>     > [  279.021852] {1}[Hardware Error]:   port_type: 0, PCIe end point
>     > [  279.021853] {1}[Hardware Error]:   version: 1.16
>     > [  279.021854] {1}[Hardware Error]:   command: 0x0507, status: 0x4010
>     > [  279.021855] {1}[Hardware Error]:   device_id: 0000:03:00.0
>     > [  279.021855] {1}[Hardware Error]:   slot: 0
>     > [  279.021856] {1}[Hardware Error]:   secondary_bus: 0x00
>     > [  279.021857] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10fb
>     > [  279.021858] {1}[Hardware Error]:   class_code: 000002
>     > [  279.021859] Kernel panic - not syncing: Fatal hardware error!
>     > [  279.021977] sched: Unexpected reschedule of offline CPU#1!
>     > [  279.021984] ------------[ cut here ]------------
>     > [  279.021992] WARNING: CPU: 43 PID: 2807 at
>     > /build/linux-fHlJSJ/linux-4.12.6/arch/x86/kernel/smp.c:128
>     > native_smp_send_reschedule+0x34/0x40
>     > [  279.021993] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio
>     > openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4
>     > nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c crc32c_generic nfsd
>     > nfs_aclr
>     > pcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc
>     fscache tun
>     > intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp
>     kvm_intel kvm
>     > irqbypass mgag200 ttm drm_kms_helper drm joydev crct10dif_pclmul crc32_pclmu
>     > l ghash_clmulni_intel i2c_algo_bit ipmi_si ipmi_devintf iTCO_wdt intel_cstate
>     > iTCO_vendor_support evdev intel_uncore mxm_wmi lpc_ich ipmi_msghandler
>     mfd_core
>     > ioatdma intel_rapl_perf dcdbas pcspkr shpchp mei_me button wmi mei
>     acpi_power_m
>     > eter tpm_crb autofs4 ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom sg
>     > hid_generic usbhid hid sd_mod
>     > [  279.022044]  crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd
>     > glue_helper ahci ehci_pci libahci ehci_hcd ixgbe libata megaraid_sas
>     usbcore dca
>     > i40e usb_common ptp pps_core scsi_mod mdio
>     > [  279.022060] CPU: 43 PID: 2807 Comm: revalidator85 Not tainted
>     4.12.0-1-amd64
>     > #1 Debian 4.12.6-1
>     > [  279.022061] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.3.4
>     11/08/2016
>     > [  279.022062] task: ffff91d0473f7100 task.stack: ffffafef8f4a4000
>     > [  279.022066] RIP: 0010:native_smp_send_reschedule+0x34/0x40
>     > [  279.022067] RSP: 0018:ffffafef8f4a7c98 EFLAGS: 00010082
>     > [  279.022069] RAX: 000000000000002e RBX: ffff91d059d24080 RCX:
>     0000000000000001
>     > [  279.022070] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
>     0000000000000046
>     > [  279.022071] RBP: ffff91d04691d100 R08: 0000000000000000 R09:
>     000000000000002e
>     > [  279.022072] R10: ffffafef8f4a7c90 R11: 00000000001cbb78 R12:
>     ffff91d85d21ae80
>     > [  279.022073] R13: ffff91d059d24000 R14: 0000000000000002 R15:
>     0000000000000008
>     > [  279.022075] FS:  00007f726affd700(0000) GS:ffff91d85d740000(0000)
>     > knlGS:0000000000000000
>     > [  279.022076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     > [  279.022077] CR2: 00007fd422a52c48 CR3: 000000042d90f000 CR4:
>     00000000003426e0
>     > [  279.022078] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>     0000000000000000
>     > [  279.022079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>     0000000000000400
>     > [  279.022080] Call Trace:
>     > [  279.022086]  ? check_preempt_wakeup+0x181/0x220
>     > [  279.022091]  ? check_preempt_curr+0x74/0x80
>     > [  279.022094]  ? ttwu_do_wakeup+0x19/0x140
>     > [  279.022098]  ? try_to_wake_up+0x1b8/0x470
>     > [  279.022101]  ? wake_up_q+0x3f/0x70
>     > [  279.022106]  ? futex_wake+0x15a/0x170
>     > [  279.022108]  ? do_futex+0x2df/0xa90
>     > [  279.022111]  ? SyS_futex+0x7a/0x170
>     > [  279.022113]  ? SyS_read+0x76/0xc0
>     > [  279.022118]  ? system_call_fast_compare_end+0xc/0x97
>     > [  279.022119] Code: a3 05 51 fb cc 00 73 15 48 8b 05 28 74 a3 00 be fd 00
>     00 00
>     > 48 8b 80 a0 00 00 00 ff e0 89 fe 48 c7 c7 88 5c de b6 e8 e2 c9 13 00 <0f>
>     ff c3
>     > 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 5d 00
>     > [  279.022151] ---[ end trace eddc980dc8648163 ]---
>     > [  279.454274] Kernel Offset: 0x35400000 from 0xffffffff81000000 (relocation
>     > range: 0xffffffff80000000-0xffffffffbfffffff)
>     >
>     > The test engineer says this doesn't happen if we use SRIOV (which makes sense
>     > since the device isn't directly shared between the guest and the host).  If I
>     > remove the pci_reset_function() from igb_uio's .release, then all is well.
> 
>     This was tougher than expected, so many unexpected behavior. Why resetting
>     pass-through device in guest cause a crash in the host?
> 
>     Finally, I will send a patch to remove the reset. Hopefully no more surprises
>     for release.
> 
>     Still there will remain two improvement in igb_uio for better security,
>     disabling device interrupt on exit and clear master on exit.
> 
>     >
>     >
>     > On Tue, Nov 7, 2017 at 8:02 AM, Thomas Monjalon <thomas@monjalon.net <mailto:thomas@monjalon.net>
>     > <mailto:thomas@monjalon.net <mailto:thomas@monjalon.net>>> wrote:
>     >
>     >     07/11/2017 12:50, Chas Williams:
>     >     > We still have an issue with this and PCI pass-through.  If a guest is
>     >     > restarted while using PCI pass-through and igb_uio issues a
>     >     > pci_reset_function(), this causes the host to crash.
>     >
>     >     Please, could you better explain the exact scenario and the cause of
>     the crash?
>     >     Thanks
>     >
>     >
> 
>

Chas Williams Nov. 8, 2017, noon UTC | #10

On Tue, Nov 7, 2017 at 5:26 PM, Ferruh Yigit <ferruh.yigit@intel.com> wrote:

> On 11/7/2017 12:47 PM, Chas Williams wrote:
> > I will confess I haven't looked into the issue too hard since I have a
> > workaround.  My first guess is that there is something going on with the
> IOMMU
> > and quiescing a PCI pass-through device/function from the guest (since I
> don't
> > think the IOMMU is "visible" to the guest) seems iffy.
> >
> > Most devices have some sort of reset to put the device into a known
> state for
> > setup/configuration (or enable/disable for the DMA engines).  If this is
> done at
> > .dev_close(), shouldn't that be as sufficient as resetting the function?
>
> This is for the cases DPDK app terminated unexpectedly, proper exit path
> already
> does cleanup.
>

Call a usermode helper from igb_uio that does an open/close on the device
about to be released?


>
> >
> > On Tue, Nov 7, 2017 at 1:49 PM, Ferruh Yigit <ferruh.yigit@intel.com
> > <mailto:ferruh.yigit@intel.com>> wrote:
> >
> >     On 11/7/2017 10:12 AM, Chas Williams wrote:
> >     > Environment: Dell PowerEdge R730, Intel Corporation 82599ES
> 10-Gigabit
> >     SFI/SFP+
> >     > Network Connection shared via PCI pass-through
> >     > Host: Debian 8
> >     > Guest: Custom Debian 8 with DPDK application based on 17.11
> >     >
> >     > When we shutdown the guest, the kernel panics with:
> >     >
> >     > [  279.021818] Do you have a strange power saving mode enabled?
> >     > [  279.021819] Dazed and confused, but trying to continue
> >     > [  279.021847] {1}[Hardware Error]: Hardware error from APEI
> Generic Hardware
> >     > Error Source: 3
> >     > [  279.021849] {1}[Hardware Error]: event severity: fatal
> >     > [  279.021850] {1}[Hardware Error]:  Error 0, type: fatal
> >     > [  279.021851] {1}[Hardware Error]:   section_type: PCIe error
> >     > [  279.021852] {1}[Hardware Error]:   port_type: 0, PCIe end point
> >     > [  279.021853] {1}[Hardware Error]:   version: 1.16
> >     > [  279.021854] {1}[Hardware Error]:   command: 0x0507, status:
> 0x4010
> >     > [  279.021855] {1}[Hardware Error]:   device_id: 0000:03:00.0
> >     > [  279.021855] {1}[Hardware Error]:   slot: 0
> >     > [  279.021856] {1}[Hardware Error]:   secondary_bus: 0x00
> >     > [  279.021857] {1}[Hardware Error]:   vendor_id: 0x8086,
> device_id: 0x10fb
> >     > [  279.021858] {1}[Hardware Error]:   class_code: 000002
> >     > [  279.021859] Kernel panic - not syncing: Fatal hardware error!
> >     > [  279.021977] sched: Unexpected reschedule of offline CPU#1!
> >     > [  279.021984] ------------[ cut here ]------------
> >     > [  279.021992] WARNING: CPU: 43 PID: 2807 at
> >     > /build/linux-fHlJSJ/linux-4.12.6/arch/x86/kernel/smp.c:128
> >     > native_smp_send_reschedule+0x34/0x40
> >     > [  279.021993] Modules linked in: vfio_pci vfio_virqfd
> vfio_iommu_type1 vfio
> >     > openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4
> nf_defrag_ipv4
> >     > nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c
> crc32c_generic nfsd
> >     > nfs_aclr
> >     > pcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace
> sunrpc
> >     fscache tun
> >     > intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp
> >     kvm_intel kvm
> >     > irqbypass mgag200 ttm drm_kms_helper drm joydev crct10dif_pclmul
> crc32_pclmu
> >     > l ghash_clmulni_intel i2c_algo_bit ipmi_si ipmi_devintf iTCO_wdt
> intel_cstate
> >     > iTCO_vendor_support evdev intel_uncore mxm_wmi lpc_ich
> ipmi_msghandler
> >     mfd_core
> >     > ioatdma intel_rapl_perf dcdbas pcspkr shpchp mei_me button wmi mei
> >     acpi_power_m
> >     > eter tpm_crb autofs4 ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom
> sg
> >     > hid_generic usbhid hid sd_mod
> >     > [  279.022044]  crc32c_intel aesni_intel aes_x86_64 crypto_simd
> cryptd
> >     > glue_helper ahci ehci_pci libahci ehci_hcd ixgbe libata
> megaraid_sas
> >     usbcore dca
> >     > i40e usb_common ptp pps_core scsi_mod mdio
> >     > [  279.022060] CPU: 43 PID: 2807 Comm: revalidator85 Not tainted
> >     4.12.0-1-amd64
> >     > #1 Debian 4.12.6-1
> >     > [  279.022061] Hardware name: Dell Inc. PowerEdge R730/0WCJNT,
> BIOS 2.3.4
> >     11/08/2016
> >     > [  279.022062] task: ffff91d0473f7100 task.stack: ffffafef8f4a4000
> >     > [  279.022066] RIP: 0010:native_smp_send_reschedule+0x34/0x40
> >     > [  279.022067] RSP: 0018:ffffafef8f4a7c98 EFLAGS: 00010082
> >     > [  279.022069] RAX: 000000000000002e RBX: ffff91d059d24080 RCX:
> >     0000000000000001
> >     > [  279.022070] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
> >     0000000000000046
> >     > [  279.022071] RBP: ffff91d04691d100 R08: 0000000000000000 R09:
> >     000000000000002e
> >     > [  279.022072] R10: ffffafef8f4a7c90 R11: 00000000001cbb78 R12:
> >     ffff91d85d21ae80
> >     > [  279.022073] R13: ffff91d059d24000 R14: 0000000000000002 R15:
> >     0000000000000008
> >     > [  279.022075] FS:  00007f726affd700(0000)
> GS:ffff91d85d740000(0000)
> >     > knlGS:0000000000000000
> >     > [  279.022076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >     > [  279.022077] CR2: 00007fd422a52c48 CR3: 000000042d90f000 CR4:
> >     00000000003426e0
> >     > [  279.022078] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >     0000000000000000
> >     > [  279.022079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >     0000000000000400
> >     > [  279.022080] Call Trace:
> >     > [  279.022086]  ? check_preempt_wakeup+0x181/0x220
> >     > [  279.022091]  ? check_preempt_curr+0x74/0x80
> >     > [  279.022094]  ? ttwu_do_wakeup+0x19/0x140
> >     > [  279.022098]  ? try_to_wake_up+0x1b8/0x470
> >     > [  279.022101]  ? wake_up_q+0x3f/0x70
> >     > [  279.022106]  ? futex_wake+0x15a/0x170
> >     > [  279.022108]  ? do_futex+0x2df/0xa90
> >     > [  279.022111]  ? SyS_futex+0x7a/0x170
> >     > [  279.022113]  ? SyS_read+0x76/0xc0
> >     > [  279.022118]  ? system_call_fast_compare_end+0xc/0x97
> >     > [  279.022119] Code: a3 05 51 fb cc 00 73 15 48 8b 05 28 74 a3 00
> be fd 00
> >     00 00
> >     > 48 8b 80 a0 00 00 00 ff e0 89 fe 48 c7 c7 88 5c de b6 e8 e2 c9 13
> 00 <0f>
> >     ff c3
> >     > 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 5d 00
> >     > [  279.022151] ---[ end trace eddc980dc8648163 ]---
> >     > [  279.454274] Kernel Offset: 0x35400000 from 0xffffffff81000000
> (relocation
> >     > range: 0xffffffff80000000-0xffffffffbfffffff)
> >     >
> >     > The test engineer says this doesn't happen if we use SRIOV (which
> makes sense
> >     > since the device isn't directly shared between the guest and the
> host).  If I
> >     > remove the pci_reset_function() from igb_uio's .release, then all
> is well.
> >
> >     This was tougher than expected, so many unexpected behavior. Why
> resetting
> >     pass-through device in guest cause a crash in the host?
> >
> >     Finally, I will send a patch to remove the reset. Hopefully no more
> surprises
> >     for release.
> >
> >     Still there will remain two improvement in igb_uio for better
> security,
> >     disabling device interrupt on exit and clear master on exit.
> >
> >     >
> >     >
> >     > On Tue, Nov 7, 2017 at 8:02 AM, Thomas Monjalon <
> thomas@monjalon.net <mailto:thomas@monjalon.net>
> >     > <mailto:thomas@monjalon.net <mailto:thomas@monjalon.net>>> wrote:
> >     >
> >     >     07/11/2017 12:50, Chas Williams:
> >     >     > We still have an issue with this and PCI pass-through.  If a
> guest is
> >     >     > restarted while using PCI pass-through and igb_uio issues a
> >     >     > pci_reset_function(), this causes the host to crash.
> >     >
> >     >     Please, could you better explain the exact scenario and the
> cause of
> >     the crash?
> >     >     Thanks
> >     >
> >     >
> >
> >
>
>

Gregory Etelson Nov. 9, 2017, 5:20 p.m. UTC | #11

Hello,

There are some AWS R3.8XLARGE instances 
that fail to bind Intel 10G VFs with igb_uio [c05cb4f939082]. 
System dmeg log show this backtrace:

igb_uio: probe of 0000:00:05.0 failed with error -16
IRQ handler type mismatch for IRQ 0
current handler: timer
Pid: 3619, comm: bash Not tainted 2.6.32-642.15.1.el6.x86_64 #1
Call Trace:
 [<ffffffff810f49e2>] ? __setup_irq+0x382/0x3c0
 [<ffffffffa03202a0>] ? uio_interrupt+0x0/0x48 [uio]
 [<ffffffff810f51e3>] ? request_threaded_irq+0x133/0x230
 [<ffffffffa0320193>] ? __uio_register_device+0x553/0x610 [uio]
 [<ffffffffa032698f>] ? igbuio_pci_probe+0x290/0x47a [igb_uio]
 [<ffffffff8129d00a>] ? kobject_get+0x1a/0x30
 [<ffffffff812c04f7>] ? local_pci_probe+0x17/0x20
 [<ffffffff812c16e1>] ? pci_device_probe+0x101/0x120
 [<ffffffff81382152>] ? driver_sysfs_add+0x62/0x90
 [<ffffffff813823fa>] ? driver_probe_device+0xaa/0x3a0
 [<ffffffff8138153a>] ? driver_bind+0xca/0x110
 [<ffffffff813805dc>] ? drv_attr_store+0x2c/0x30
 [<ffffffff812171c5>] ? sysfs_write_file+0xe5/0x170
 [<ffffffff81199e48>] ? vfs_write+0xb8/0x1a0
 [<ffffffff8119b336>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a981>] ? sys_write+0x51/0xb0

The VFs can be returned back to kernel ixgbevf driver with no faults. 

The instances can bind VFs with igb_uio[b58eedfc7dd57]

I could not find yet why some R3.8XLARGE instances can bind IXGBE VFs with 
igb_uio while other fail

lspci -vvv -s 0000:00:05.0
00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
        Physical Slot: 5
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64
        Region 0: Memory at f3010000 (64-bit, prefetchable) [size=16K]
        Region 3: Memory at f3014000 (64-bit, prefetchable) [size=16K]
        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Kernel driver in use: ixgbevf
        Kernel modules: ixgbevf

Regards,
Gregory

Ferruh Yigit Nov. 10, 2017, 1:40 a.m. UTC | #12

On 11/8/2017 4:00 AM, Chas Williams wrote:
> 
> 
> On Tue, Nov 7, 2017 at 5:26 PM, Ferruh Yigit <ferruh.yigit@intel.com
> <mailto:ferruh.yigit@intel.com>> wrote:
> 
>     On 11/7/2017 12:47 PM, Chas Williams wrote:
>     > I will confess I haven't looked into the issue too hard since I have a
>     > workaround.  My first guess is that there is something going on with the IOMMU
>     > and quiescing a PCI pass-through device/function from the guest (since I don't
>     > think the IOMMU is "visible" to the guest) seems iffy.
>     >
>     > Most devices have some sort of reset to put the device into a known state for
>     > setup/configuration (or enable/disable for the DMA engines).  If this is done at
>     > .dev_close(), shouldn't that be as sufficient as resetting the function?
> 
>     This is for the cases DPDK app terminated unexpectedly, proper exit path already
>     does cleanup.
> 
> 
> Call a usermode helper from igb_uio that does an open/close on the device about
> to be released?

Can a generic userspace code know how to cleaup various devices?
I guess driver required for this work and dpdk application that has drivers
already exit in that stage.

>  
> 
> 
>     >
>     > On Tue, Nov 7, 2017 at 1:49 PM, Ferruh Yigit <ferruh.yigit@intel.com <mailto:ferruh.yigit@intel.com>
>     > <mailto:ferruh.yigit@intel.com <mailto:ferruh.yigit@intel.com>>> wrote:
>     >
>     >     On 11/7/2017 10:12 AM, Chas Williams wrote:
>     >     > Environment: Dell PowerEdge R730, Intel Corporation 82599ES 10-Gigabit
>     >     SFI/SFP+
>     >     > Network Connection shared via PCI pass-through
>     >     > Host: Debian 8
>     >     > Guest: Custom Debian 8 with DPDK application based on 17.11
>     >     >
>     >     > When we shutdown the guest, the kernel panics with:
>     >     >
>     >     > [  279.021818] Do you have a strange power saving mode enabled?
>     >     > [  279.021819] Dazed and confused, but trying to continue
>     >     > [  279.021847] {1}[Hardware Error]: Hardware error from APEI Generic
>     Hardware
>     >     > Error Source: 3
>     >     > [  279.021849] {1}[Hardware Error]: event severity: fatal
>     >     > [  279.021850] {1}[Hardware Error]:  Error 0, type: fatal
>     >     > [  279.021851] {1}[Hardware Error]:   section_type: PCIe error
>     >     > [  279.021852] {1}[Hardware Error]:   port_type: 0, PCIe end point
>     >     > [  279.021853] {1}[Hardware Error]:   version: 1.16
>     >     > [  279.021854] {1}[Hardware Error]:   command: 0x0507, status: 0x4010
>     >     > [  279.021855] {1}[Hardware Error]:   device_id: 0000:03:00.0
>     >     > [  279.021855] {1}[Hardware Error]:   slot: 0
>     >     > [  279.021856] {1}[Hardware Error]:   secondary_bus: 0x00
>     >     > [  279.021857] {1}[Hardware Error]:   vendor_id: 0x8086, device_id:
>     0x10fb
>     >     > [  279.021858] {1}[Hardware Error]:   class_code: 000002
>     >     > [  279.021859] Kernel panic - not syncing: Fatal hardware error!
>     >     > [  279.021977] sched: Unexpected reschedule of offline CPU#1!
>     >     > [  279.021984] ------------[ cut here ]------------
>     >     > [  279.021992] WARNING: CPU: 43 PID: 2807 at
>     >     > /build/linux-fHlJSJ/linux-4.12.6/arch/x86/kernel/smp.c:128
>     >     > native_smp_send_reschedule+0x34/0x40
>     >     > [  279.021993] Modules linked in: vfio_pci vfio_virqfd
>     vfio_iommu_type1 vfio
>     >     > openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4
>     nf_defrag_ipv4
>     >     > nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c
>     crc32c_generic nfsd
>     >     > nfs_aclr
>     >     > pcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc
>     >     fscache tun
>     >     > intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp
>     >     kvm_intel kvm
>     >     > irqbypass mgag200 ttm drm_kms_helper drm joydev crct10dif_pclmul
>     crc32_pclmu
>     >     > l ghash_clmulni_intel i2c_algo_bit ipmi_si ipmi_devintf iTCO_wdt
>     intel_cstate
>     >     > iTCO_vendor_support evdev intel_uncore mxm_wmi lpc_ich ipmi_msghandler
>     >     mfd_core
>     >     > ioatdma intel_rapl_perf dcdbas pcspkr shpchp mei_me button wmi mei
>     >     acpi_power_m
>     >     > eter tpm_crb autofs4 ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom sg
>     >     > hid_generic usbhid hid sd_mod
>     >     > [  279.022044]  crc32c_intel aesni_intel aes_x86_64 crypto_simd cryptd
>     >     > glue_helper ahci ehci_pci libahci ehci_hcd ixgbe libata megaraid_sas
>     >     usbcore dca
>     >     > i40e usb_common ptp pps_core scsi_mod mdio
>     >     > [  279.022060] CPU: 43 PID: 2807 Comm: revalidator85 Not tainted
>     >     4.12.0-1-amd64
>     >     > #1 Debian 4.12.6-1
>     >     > [  279.022061] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS
>     2.3.4
>     >     11/08/2016
>     >     > [  279.022062] task: ffff91d0473f7100 task.stack: ffffafef8f4a4000
>     >     > [  279.022066] RIP: 0010:native_smp_send_reschedule+0x34/0x40
>     >     > [  279.022067] RSP: 0018:ffffafef8f4a7c98 EFLAGS: 00010082
>     >     > [  279.022069] RAX: 000000000000002e RBX: ffff91d059d24080 RCX:
>     >     0000000000000001
>     >     > [  279.022070] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
>     >     0000000000000046
>     >     > [  279.022071] RBP: ffff91d04691d100 R08: 0000000000000000 R09:
>     >     000000000000002e
>     >     > [  279.022072] R10: ffffafef8f4a7c90 R11: 00000000001cbb78 R12:
>     >     ffff91d85d21ae80
>     >     > [  279.022073] R13: ffff91d059d24000 R14: 0000000000000002 R15:
>     >     0000000000000008
>     >     > [  279.022075] FS:  00007f726affd700(0000) GS:ffff91d85d740000(0000)
>     >     > knlGS:0000000000000000
>     >     > [  279.022076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     >     > [  279.022077] CR2: 00007fd422a52c48 CR3: 000000042d90f000 CR4:
>     >     00000000003426e0
>     >     > [  279.022078] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>     >     0000000000000000
>     >     > [  279.022079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>     >     0000000000000400
>     >     > [  279.022080] Call Trace:
>     >     > [  279.022086]  ? check_preempt_wakeup+0x181/0x220
>     >     > [  279.022091]  ? check_preempt_curr+0x74/0x80
>     >     > [  279.022094]  ? ttwu_do_wakeup+0x19/0x140
>     >     > [  279.022098]  ? try_to_wake_up+0x1b8/0x470
>     >     > [  279.022101]  ? wake_up_q+0x3f/0x70
>     >     > [  279.022106]  ? futex_wake+0x15a/0x170
>     >     > [  279.022108]  ? do_futex+0x2df/0xa90
>     >     > [  279.022111]  ? SyS_futex+0x7a/0x170
>     >     > [  279.022113]  ? SyS_read+0x76/0xc0
>     >     > [  279.022118]  ? system_call_fast_compare_end+0xc/0x97
>     >     > [  279.022119] Code: a3 05 51 fb cc 00 73 15 48 8b 05 28 74 a3 00 be
>     fd 00
>     >     00 00
>     >     > 48 8b 80 a0 00 00 00 ff e0 89 fe 48 c7 c7 88 5c de b6 e8 e2 c9 13 00
>     <0f>
>     >     ff c3
>     >     > 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 8b 05 5d 00
>     >     > [  279.022151] ---[ end trace eddc980dc8648163 ]---
>     >     > [  279.454274] Kernel Offset: 0x35400000 from 0xffffffff81000000
>     (relocation
>     >     > range: 0xffffffff80000000-0xffffffffbfffffff)
>     >     >
>     >     > The test engineer says this doesn't happen if we use SRIOV (which
>     makes sense
>     >     > since the device isn't directly shared between the guest and the
>     host).  If I
>     >     > remove the pci_reset_function() from igb_uio's .release, then all is
>     well.
>     >
>     >     This was tougher than expected, so many unexpected behavior. Why resetting
>     >     pass-through device in guest cause a crash in the host?
>     >
>     >     Finally, I will send a patch to remove the reset. Hopefully no more
>     surprises
>     >     for release.
>     >
>     >     Still there will remain two improvement in igb_uio for better security,
>     >     disabling device interrupt on exit and clear master on exit.
>     >
>     >     >
>     >     >
>     >     > On Tue, Nov 7, 2017 at 8:02 AM, Thomas Monjalon <thomas@monjalon.net
>     <mailto:thomas@monjalon.net> <mailto:thomas@monjalon.net
>     <mailto:thomas@monjalon.net>>
>     >     > <mailto:thomas@monjalon.net <mailto:thomas@monjalon.net>
>     <mailto:thomas@monjalon.net <mailto:thomas@monjalon.net>>>> wrote:
>     >     >
>     >     >     07/11/2017 12:50, Chas Williams:
>     >     >     > We still have an issue with this and PCI pass-through.  If a
>     guest is
>     >     >     > restarted while using PCI pass-through and igb_uio issues a
>     >     >     > pci_reset_function(), this causes the host to crash.
>     >     >
>     >     >     Please, could you better explain the exact scenario and the cause of
>     >     the crash?
>     >     >     Thanks
>     >     >
>     >     >
>     >
>     >
> 
>

Ferruh Yigit Nov. 10, 2017, 1:42 a.m. UTC | #13

On 11/9/2017 9:20 AM, Gregory Etelson wrote:
> Hello,
> 
> There are some AWS R3.8XLARGE instances 
> that fail to bind Intel 10G VFs with igb_uio [c05cb4f939082]. 

Hi Gregory,

Will you dig this issue more? Please keep us updated.

> System dmeg log show this backtrace:
> 
> igb_uio: probe of 0000:00:05.0 failed with error -16
> IRQ handler type mismatch for IRQ 0
> current handler: timer
> Pid: 3619, comm: bash Not tainted 2.6.32-642.15.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffff810f49e2>] ? __setup_irq+0x382/0x3c0
>  [<ffffffffa03202a0>] ? uio_interrupt+0x0/0x48 [uio]
>  [<ffffffff810f51e3>] ? request_threaded_irq+0x133/0x230
>  [<ffffffffa0320193>] ? __uio_register_device+0x553/0x610 [uio]
>  [<ffffffffa032698f>] ? igbuio_pci_probe+0x290/0x47a [igb_uio]
>  [<ffffffff8129d00a>] ? kobject_get+0x1a/0x30
>  [<ffffffff812c04f7>] ? local_pci_probe+0x17/0x20
>  [<ffffffff812c16e1>] ? pci_device_probe+0x101/0x120
>  [<ffffffff81382152>] ? driver_sysfs_add+0x62/0x90
>  [<ffffffff813823fa>] ? driver_probe_device+0xaa/0x3a0
>  [<ffffffff8138153a>] ? driver_bind+0xca/0x110
>  [<ffffffff813805dc>] ? drv_attr_store+0x2c/0x30
>  [<ffffffff812171c5>] ? sysfs_write_file+0xe5/0x170
>  [<ffffffff81199e48>] ? vfs_write+0xb8/0x1a0
>  [<ffffffff8119b336>] ? fget_light_pos+0x16/0x50
>  [<ffffffff8119a981>] ? sys_write+0x51/0xb0
> 
> The VFs can be returned back to kernel ixgbevf driver with no faults. 
> 
> The instances can bind VFs with igb_uio[b58eedfc7dd57]
> 
> I could not find yet why some R3.8XLARGE instances can bind IXGBE VFs with 
> igb_uio while other fail
> 
> lspci -vvv -s 0000:00:05.0
> 00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
> Virtual Function (rev 01)
>         Physical Slot: 5
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 64
>         Region 0: Memory at f3010000 (64-bit, prefetchable) [size=16K]
>         Region 3: Memory at f3014000 (64-bit, prefetchable) [size=16K]
>         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
>                 Vector table: BAR=3 offset=00000000
>                 PBA: BAR=3 offset=00002000
>         Kernel driver in use: ixgbevf
>         Kernel modules: ixgbevf
> 
> Regards,
> Gregory
>

Ferruh Yigit Nov. 10, 2017, 2:11 a.m. UTC | #14

On 11/9/2017 5:42 PM, Ferruh Yigit wrote:
> On 11/9/2017 9:20 AM, Gregory Etelson wrote:
>> Hello,
>>
>> There are some AWS R3.8XLARGE instances 
>> that fail to bind Intel 10G VFs with igb_uio [c05cb4f939082]. 
> 
> Hi Gregory,
> 
> Will you dig this issue more? Please keep us updated.
> 
>> System dmeg log show this backtrace:
>>
>> igb_uio: probe of 0000:00:05.0 failed with error -16
>> IRQ handler type mismatch for IRQ 0
>> current handler: timer
>> Pid: 3619, comm: bash Not tainted 2.6.32-642.15.1.el6.x86_64 #1
>> Call Trace:
>>  [<ffffffff810f49e2>] ? __setup_irq+0x382/0x3c0
>>  [<ffffffffa03202a0>] ? uio_interrupt+0x0/0x48 [uio]
>>  [<ffffffff810f51e3>] ? request_threaded_irq+0x133/0x230
>>  [<ffffffffa0320193>] ? __uio_register_device+0x553/0x610 [uio]
>>  [<ffffffffa032698f>] ? igbuio_pci_probe+0x290/0x47a [igb_uio]

Here igb_uio probe() calls request_irq(), this behavior changed in latest code.

Can you please double check that you are using latest code?

Thanks,
ferruh

>>  [<ffffffff8129d00a>] ? kobject_get+0x1a/0x30
>>  [<ffffffff812c04f7>] ? local_pci_probe+0x17/0x20
>>  [<ffffffff812c16e1>] ? pci_device_probe+0x101/0x120
>>  [<ffffffff81382152>] ? driver_sysfs_add+0x62/0x90
>>  [<ffffffff813823fa>] ? driver_probe_device+0xaa/0x3a0
>>  [<ffffffff8138153a>] ? driver_bind+0xca/0x110
>>  [<ffffffff813805dc>] ? drv_attr_store+0x2c/0x30
>>  [<ffffffff812171c5>] ? sysfs_write_file+0xe5/0x170
>>  [<ffffffff81199e48>] ? vfs_write+0xb8/0x1a0
>>  [<ffffffff8119b336>] ? fget_light_pos+0x16/0x50
>>  [<ffffffff8119a981>] ? sys_write+0x51/0xb0
>>
>> The VFs can be returned back to kernel ixgbevf driver with no faults. 
>>
>> The instances can bind VFs with igb_uio[b58eedfc7dd57]
>>
>> I could not find yet why some R3.8XLARGE instances can bind IXGBE VFs with 
>> igb_uio while other fail
>>
>> lspci -vvv -s 0000:00:05.0
>> 00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
>> Virtual Function (rev 01)
>>         Physical Slot: 5
>>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
>> Stepping- SERR- FastB2B- DisINTx+
>>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>         Latency: 64
>>         Region 0: Memory at f3010000 (64-bit, prefetchable) [size=16K]
>>         Region 3: Memory at f3014000 (64-bit, prefetchable) [size=16K]
>>         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
>>                 Vector table: BAR=3 offset=00000000
>>                 PBA: BAR=3 offset=00002000
>>         Kernel driver in use: ixgbevf
>>         Kernel modules: ixgbevf
>>
>> Regards,
>> Gregory
>>
>

Gregory Etelson Nov. 10, 2017, 6:36 a.m. UTC | #15

It looks like igb_uio bind failed on servers running CentOS-6.x
Servers with CentOS-7.3 Ubuntu-14, Ubuntu-16 and AWS-1703 (Amazon Linux)
had no bind issues

Regards,
Gregory

On Fri, Nov 10, 2017 at 3:42 AM, Ferruh Yigit <ferruh.yigit@intel.com>
wrote:

> On 11/9/2017 9:20 AM, Gregory Etelson wrote:
> > Hello,
> >
> > There are some AWS R3.8XLARGE instances
> > that fail to bind Intel 10G VFs with igb_uio [c05cb4f939082].
>
> Hi Gregory,
>
> Will you dig this issue more? Please keep us updated.
>
> > System dmeg log show this backtrace:
> >
> > igb_uio: probe of 0000:00:05.0 failed with error -16
> > IRQ handler type mismatch for IRQ 0
> > current handler: timer
> > Pid: 3619, comm: bash Not tainted 2.6.32-642.15.1.el6.x86_64 #1
> > Call Trace:
> >  [<ffffffff810f49e2>] ? __setup_irq+0x382/0x3c0
> >  [<ffffffffa03202a0>] ? uio_interrupt+0x0/0x48 [uio]
> >  [<ffffffff810f51e3>] ? request_threaded_irq+0x133/0x230
> >  [<ffffffffa0320193>] ? __uio_register_device+0x553/0x610 [uio]
> >  [<ffffffffa032698f>] ? igbuio_pci_probe+0x290/0x47a [igb_uio]
> >  [<ffffffff8129d00a>] ? kobject_get+0x1a/0x30
> >  [<ffffffff812c04f7>] ? local_pci_probe+0x17/0x20
> >  [<ffffffff812c16e1>] ? pci_device_probe+0x101/0x120
> >  [<ffffffff81382152>] ? driver_sysfs_add+0x62/0x90
> >  [<ffffffff813823fa>] ? driver_probe_device+0xaa/0x3a0
> >  [<ffffffff8138153a>] ? driver_bind+0xca/0x110
> >  [<ffffffff813805dc>] ? drv_attr_store+0x2c/0x30
> >  [<ffffffff812171c5>] ? sysfs_write_file+0xe5/0x170
> >  [<ffffffff81199e48>] ? vfs_write+0xb8/0x1a0
> >  [<ffffffff8119b336>] ? fget_light_pos+0x16/0x50
> >  [<ffffffff8119a981>] ? sys_write+0x51/0xb0
> >
> > The VFs can be returned back to kernel ixgbevf driver with no faults.
> >
> > The instances can bind VFs with igb_uio[b58eedfc7dd57]
> >
> > I could not find yet why some R3.8XLARGE instances can bind IXGBE VFs
> with
> > igb_uio while other fail
> >
> > lspci -vvv -s 0000:00:05.0
> > 00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller
> > Virtual Function (rev 01)
> >         Physical Slot: 5
> >         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr-
> > Stepping- SERR- FastB2B- DisINTx+
> >         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > <TAbort- <MAbort- >SERR- <PERR- INTx-
> >         Latency: 64
> >         Region 0: Memory at f3010000 (64-bit, prefetchable) [size=16K]
> >         Region 3: Memory at f3014000 (64-bit, prefetchable) [size=16K]
> >         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
> >                 Vector table: BAR=3 offset=00000000
> >                 PBA: BAR=3 offset=00002000
> >         Kernel driver in use: ixgbevf
> >         Kernel modules: ixgbevf
> >
> > Regards,
> > Gregory
> >
>
>

Stephen Hemminger Nov. 13, 2017, 11:46 p.m. UTC | #16

On Wed, 8 Nov 2017 07:00:23 -0500
Chas Williams <3chas3@gmail.com> wrote:

> On Tue, Nov 7, 2017 at 5:26 PM, Ferruh Yigit <ferruh.yigit@intel.com> wrote:
> On 11/7/2017 12:47 PM, Chas Williams wrote:
> > I will confess I haven't looked into the issue too hard since I have a
> > workaround.  My first guess is that there is something going on with the IOMMU
> > and quiescing a PCI pass-through device/function from the guest (since I don't
> > think the IOMMU is "visible" to the guest) seems iffy.
> >
> > Most devices have some sort of reset to put the device into a known state for
> > setup/configuration (or enable/disable for the DMA engines).  If this is done at
> > .dev_close(), shouldn't that be as sufficient as resetting the function?
> 
> This is for the cases DPDK app terminated unexpectedly, proper exit path already
> does cleanup.
> 
> Call a usermode helper from igb_uio that does an open/close on the device about to be released?

usermode helper is hated by upstream kernel developers. There are many problems
such as what namespace and security.

Ferruh Yigit Nov. 15, 2017, 3:44 p.m. UTC | #17

On 11/9/2017 10:36 PM, Gregory Etelson wrote:
> It looks like igb_uio bind failed on servers running CentOS-6.x

Hi Gregory,

Below backtrace seems coming from old code, can you please confirm that you are
using latest igb_uio?

And what is the kernel version in that boxes?

Thanks,
ferruh

> Servers with CentOS-7.3 Ubuntu-14, Ubuntu-16 and AWS-1703 (Amazon Linux)
> had no bind issues
> 
> Regards,
> Gregory
> 
> On Fri, Nov 10, 2017 at 3:42 AM, Ferruh Yigit <ferruh.yigit@intel.com
> <mailto:ferruh.yigit@intel.com>> wrote:
> 
>     On 11/9/2017 9:20 AM, Gregory Etelson wrote:
>     > Hello,
>     >
>     > There are some AWS R3.8XLARGE instances
>     > that fail to bind Intel 10G VFs with igb_uio [c05cb4f939082].
> 
>     Hi Gregory,
> 
>     Will you dig this issue more? Please keep us updated.
> 
>     > System dmeg log show this backtrace:
>     >
>     > igb_uio: probe of 0000:00:05.0 failed with error -16
>     > IRQ handler type mismatch for IRQ 0
>     > current handler: timer
>     > Pid: 3619, comm: bash Not tainted 2.6.32-642.15.1.el6.x86_64 #1
>     > Call Trace:
>     >  [<ffffffff810f49e2>] ? __setup_irq+0x382/0x3c0
>     >  [<ffffffffa03202a0>] ? uio_interrupt+0x0/0x48 [uio]
>     >  [<ffffffff810f51e3>] ? request_threaded_irq+0x133/0x230
>     >  [<ffffffffa0320193>] ? __uio_register_device+0x553/0x610 [uio]
>     >  [<ffffffffa032698f>] ? igbuio_pci_probe+0x290/0x47a [igb_uio]
>     >  [<ffffffff8129d00a>] ? kobject_get+0x1a/0x30
>     >  [<ffffffff812c04f7>] ? local_pci_probe+0x17/0x20
>     >  [<ffffffff812c16e1>] ? pci_device_probe+0x101/0x120
>     >  [<ffffffff81382152>] ? driver_sysfs_add+0x62/0x90
>     >  [<ffffffff813823fa>] ? driver_probe_device+0xaa/0x3a0
>     >  [<ffffffff8138153a>] ? driver_bind+0xca/0x110
>     >  [<ffffffff813805dc>] ? drv_attr_store+0x2c/0x30
>     >  [<ffffffff812171c5>] ? sysfs_write_file+0xe5/0x170
>     >  [<ffffffff81199e48>] ? vfs_write+0xb8/0x1a0
>     >  [<ffffffff8119b336>] ? fget_light_pos+0x16/0x50
>     >  [<ffffffff8119a981>] ? sys_write+0x51/0xb0
>     >
>     > The VFs can be returned back to kernel ixgbevf driver with no faults.
>     >
>     > The instances can bind VFs with igb_uio[b58eedfc7dd57]
>     >
>     > I could not find yet why some R3.8XLARGE instances can bind IXGBE VFs with
>     > igb_uio while other fail
>     >
>     > lspci -vvv -s 0000:00:05.0
>     > 00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller
>     > Virtual Function (rev 01)
>     >         Physical Slot: 5
>     >         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>     > Stepping- SERR- FastB2B- DisINTx+
>     >         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>     > <TAbort- <MAbort- >SERR- <PERR- INTx-
>     >         Latency: 64
>     >         Region 0: Memory at f3010000 (64-bit, prefetchable) [size=16K]
>     >         Region 3: Memory at f3014000 (64-bit, prefetchable) [size=16K]
>     >         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
>     >                 Vector table: BAR=3 offset=00000000
>     >                 PBA: BAR=3 offset=00002000
>     >         Kernel driver in use: ixgbevf
>     >         Kernel modules: ixgbevf
>     >
>     > Regards,
>     > Gregory
>     >
> 
>

Gregory Etelson Nov. 15, 2017, 4:30 p.m. UTC | #18

Hello Ferruh,

re-checked igb_uio from dpdk master branch
0384f21dffc9081d1ae30f0a6e49926bfc4be85d
OS: CentOS release 6.7 (Final), 2.6.32-573.26.1.el6.x86_64

igb_uio: Use MSIX interrupt by default
ixgbevf: eth1: ixgbevf_remove: Remove complete
IRQ handler type mismatch for IRQ 0
current handler: timer
Pid: 7995, comm: python Not tainted 2.6.32-573.26.1.el6.x86_64 #1
Call Trace:
 [<ffffffff810eee02>] ? __setup_irq+0x382/0x3c0
 [<ffffffffa00212a0>] ? uio_interrupt+0x0/0x48 [uio]
 [<ffffffff810ef603>] ? request_threaded_irq+0x133/0x230
 [<ffffffffa0021193>] ? __uio_register_device+0x553/0x610 [uio]
 [<ffffffffa003797f>] ? igbuio_pci_probe+0x290/0x47a [igb_uio]
 [<ffffffff81292c7a>] ? kobject_get+0x1a/0x30
 [<ffffffff812b5e37>] ? local_pci_probe+0x17/0x20
 [<ffffffff812b7021>] ? pci_device_probe+0x101/0x120
 [<ffffffff813748e2>] ? driver_sysfs_add+0x62/0x90
 [<ffffffff81374b8a>] ? driver_probe_device+0xaa/0x3a0
 [<ffffffff81374f2b>] ? __driver_attach+0xab/0xb0
 [<ffffffff81374e80>] ? __driver_attach+0x0/0xb0
 [<ffffffff81373d74>] ? bus_for_each_dev+0x64/0x90
 [<ffffffff8137481e>] ? driver_attach+0x1e/0x20
 [<ffffffff812b73c7>] ? pci_add_dynid+0xc7/0xf0
 [<ffffffff812b74c2>] ? store_new_id+0xd2/0x110
 [<ffffffff81372d6c>] ? drv_attr_store+0x2c/0x30
 [<ffffffff8120eb15>] ? sysfs_write_file+0xe5/0x170
 [<ffffffff81192208>] ? vfs_write+0xb8/0x1a0
 [<ffffffff811936f6>] ? fget_light_pos+0x16/0x50
 [<ffffffff81192d41>] ? sys_write+0x51/0xb0
 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
igb_uio: probe of 0000:00:04.0 failed with error -16
IRQ handler type mismatch for IRQ 0

Regards,
Gregory

On Wed, Nov 15, 2017 at 5:44 PM, Ferruh Yigit <ferruh.yigit@intel.com>
wrote:

> On 11/9/2017 10:36 PM, Gregory Etelson wrote:
> > It looks like igb_uio bind failed on servers running CentOS-6.x
>
> Hi Gregory,
>
> Below backtrace seems coming from old code, can you please confirm that
> you are
> using latest igb_uio?
>
> And what is the kernel version in that boxes?
>
> Thanks,
> ferruh
>
> > Servers with CentOS-7.3 Ubuntu-14, Ubuntu-16 and AWS-1703 (Amazon Linux)
> > had no bind issues
> >
> > Regards,
> > Gregory
> >
> > On Fri, Nov 10, 2017 at 3:42 AM, Ferruh Yigit <ferruh.yigit@intel.com
> > <mailto:ferruh.yigit@intel.com>> wrote:
> >
> >     On 11/9/2017 9:20 AM, Gregory Etelson wrote:
> >     > Hello,
> >     >
> >     > There are some AWS R3.8XLARGE instances
> >     > that fail to bind Intel 10G VFs with igb_uio [c05cb4f939082].
> >
> >     Hi Gregory,
> >
> >     Will you dig this issue more? Please keep us updated.
> >
> >     > System dmeg log show this backtrace:
> >     >
> >     > igb_uio: probe of 0000:00:05.0 failed with error -16
> >     > IRQ handler type mismatch for IRQ 0
> >     > current handler: timer
> >     > Pid: 3619, comm: bash Not tainted 2.6.32-642.15.1.el6.x86_64 #1
> >     > Call Trace:
> >     >  [<ffffffff810f49e2>] ? __setup_irq+0x382/0x3c0
> >     >  [<ffffffffa03202a0>] ? uio_interrupt+0x0/0x48 [uio]
> >     >  [<ffffffff810f51e3>] ? request_threaded_irq+0x133/0x230
> >     >  [<ffffffffa0320193>] ? __uio_register_device+0x553/0x610 [uio]
> >     >  [<ffffffffa032698f>] ? igbuio_pci_probe+0x290/0x47a [igb_uio]
> >     >  [<ffffffff8129d00a>] ? kobject_get+0x1a/0x30
> >     >  [<ffffffff812c04f7>] ? local_pci_probe+0x17/0x20
> >     >  [<ffffffff812c16e1>] ? pci_device_probe+0x101/0x120
> >     >  [<ffffffff81382152>] ? driver_sysfs_add+0x62/0x90
> >     >  [<ffffffff813823fa>] ? driver_probe_device+0xaa/0x3a0
> >     >  [<ffffffff8138153a>] ? driver_bind+0xca/0x110
> >     >  [<ffffffff813805dc>] ? drv_attr_store+0x2c/0x30
> >     >  [<ffffffff812171c5>] ? sysfs_write_file+0xe5/0x170
> >     >  [<ffffffff81199e48>] ? vfs_write+0xb8/0x1a0
> >     >  [<ffffffff8119b336>] ? fget_light_pos+0x16/0x50
> >     >  [<ffffffff8119a981>] ? sys_write+0x51/0xb0
> >     >
> >     > The VFs can be returned back to kernel ixgbevf driver with no
> faults.
> >     >
> >     > The instances can bind VFs with igb_uio[b58eedfc7dd57]
> >     >
> >     > I could not find yet why some R3.8XLARGE instances can bind IXGBE
> VFs with
> >     > igb_uio while other fail
> >     >
> >     > lspci -vvv -s 0000:00:05.0
> >     > 00:05.0 Ethernet controller: Intel Corporation 82599 Ethernet
> Controller
> >     > Virtual Function (rev 01)
> >     >         Physical Slot: 5
> >     >         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV-
> VGASnoop- ParErr-
> >     > Stepping- SERR- FastB2B- DisINTx+
> >     >         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast
> >TAbort-
> >     > <TAbort- <MAbort- >SERR- <PERR- INTx-
> >     >         Latency: 64
> >     >         Region 0: Memory at f3010000 (64-bit, prefetchable)
> [size=16K]
> >     >         Region 3: Memory at f3014000 (64-bit, prefetchable)
> [size=16K]
> >     >         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
> >     >                 Vector table: BAR=3 offset=00000000
> >     >                 PBA: BAR=3 offset=00002000
> >     >         Kernel driver in use: ixgbevf
> >     >         Kernel modules: ixgbevf
> >     >
> >     > Regards,
> >     > Gregory
> >     >
> >
> >
>
>

diff mbox

Patch

diff --git a/lib/librte_eal/linuxapp/igb_uio/compat.h b/lib/librte_eal/linuxapp/igb_uio/compat.h
index 30508f35c..5d7223124 100644
--- a/lib/librte_eal/linuxapp/igb_uio/compat.h
+++ b/lib/librte_eal/linuxapp/igb_uio/compat.h
@@ -133,4 +133,21 @@  static bool pci_check_and_mask_intx(struct pci_dev *pdev)
 #define HAVE_PCI_MSI_MASK_IRQ 1
 #endif
 
-
+#define BROADCOM_PCI_VENDOR_ID 0x14E4
+static const struct pci_device_id no_reset_pci_tbl[] = {
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x168a) }, /* 57800 */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x164f) }, /* 57711 */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x168e) }, /* 57810 */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x163d) }, /* 57811 */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x168d) }, /* 57840_OBS */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x16a1) }, /* 57840_4_10 */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x16a2) }, /* 57840_2_20 */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x16ae) }, /* 57810_MF */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x163e) }, /* 57811_MF */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x16a4) }, /* 57840_MF */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x16a9) }, /* 57800_VF */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x16af) }, /* 57810_VF */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x163f) }, /* 57811_VF */
+	{ PCI_DEVICE(BROADCOM_PCI_VENDOR_ID, 0x16ad) }, /* 57840_VF */
+	{ 0 },
+};
diff --git a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c
index fd320d87d..037e02267 100644
--- a/lib/librte_eal/linuxapp/igb_uio/igb_uio.c
+++ b/lib/librte_eal/linuxapp/igb_uio/igb_uio.c
@@ -348,6 +348,11 @@  igbuio_pci_open(struct uio_info *info, struct inode *inode)
 	return 0;
 }
 
+static bool is_device_excluded_from_reset(struct pci_dev *pdev)
+{
+	return !!pci_match_id(no_reset_pci_tbl, pdev);
+}
+
 static int
 igbuio_pci_release(struct uio_info *info, struct inode *inode)
 {
@@ -360,7 +365,8 @@  igbuio_pci_release(struct uio_info *info, struct inode *inode)
 	/* stop the device from further DMA */
 	pci_clear_master(dev);
 
-	pci_reset_function(dev);
+	if (!is_device_excluded_from_reset(dev))
+		pci_reset_function(dev);
 
 	return 0;
 }