[v12,4/7] bus/pci: implement sigbus handler ops

Message ID 1538483726-96411-5-git-send-email-jia.guo@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v12,1/7] bus: add hot-unplug handler |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail Compilation issues

Commit Message

Guo, Jia Oct. 2, 2018, 12:35 p.m. UTC
  This patch implements the ops for the PCI bus sigbus handler. It finds the
PCI device that is being hot-unplugged and calls the relevant ops of the
hot-unplug handler to handle the hot-unplug failure of the device.

Signed-off-by: Jeff Guo <jia.guo@intel.com>
Acked-by: Shaopeng He <shaopeng.he@intel.com>
---
v12->v11:
no change.
---
 drivers/bus/pci/pci_common.c | 53 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)
  

Comments

Anatoly Burakov Oct. 2, 2018, 2:39 p.m. UTC | #1
On 02-Oct-18 1:35 PM, Jeff Guo wrote:
> This patch implements the ops for the PCI bus sigbus handler. It finds the
> PCI device that is being hot-unplugged and calls the relevant ops of the
> hot-unplug handler to handle the hot-unplug failure of the device.
> 
> Signed-off-by: Jeff Guo <jia.guo@intel.com>
> Acked-by: Shaopeng He <shaopeng.he@intel.com>
> ---
> v12->v11:
> no change.
> ---
>   drivers/bus/pci/pci_common.c | 53 ++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 53 insertions(+)
> 
> diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
> index d286234..f313fe9 100644
> --- a/drivers/bus/pci/pci_common.c
> +++ b/drivers/bus/pci/pci_common.c
> @@ -405,6 +405,36 @@ pci_find_device(const struct rte_device *start, rte_dev_cmp_t cmp,
>   	return NULL;
>   }
>   
> +/**
> + * find the device which encounter the failure, by iterate over all device on
> + * PCI bus to check if the memory failure address is located in the range
> + * of the BARs of the device.
> + */
> +static struct rte_pci_device *
> +pci_find_device_by_addr(const void *failure_addr)
> +{
> +	struct rte_pci_device *pdev = NULL;
> +	int i;
> +
> +	FOREACH_DEVICE_ON_PCIBUS(pdev) {
> +		for (i = 0; i != RTE_DIM(pdev->mem_resource); i++) {
> +			if ((uint64_t)(uintptr_t)failure_addr >=
> +			    (uint64_t)(uintptr_t)pdev->mem_resource[i].addr &&
> +			    (uint64_t)(uintptr_t)failure_addr <
> +			    (uint64_t)(uintptr_t)pdev->mem_resource[i].addr +
> +			    pdev->mem_resource[i].len) {

You must *really* dislike local variables :) Suggested rewriting:

const void *start, *end;
size_t len;

start = pdev->mem_resource[i].addr;
len = pdev->mem_resource[i].len;
end = RTE_PTR_ADD(start, len);

if (failure_addr >= start && failure_addr < end) {
	...
}

I think this is way clearer.

> +				RTE_LOG(INFO, EAL, "Failure address "
> +					"%16.16"PRIx64" belongs to "
> +					"device %s!\n",
> +					(uint64_t)(uintptr_t)failure_addr,
> +					pdev->device.name);

I feel like this should have DEBUG level, rather than INFO. 
Alternatively, if you really feel like this should be at level INFO, the 
message should be reworded because the word "failure" might give the 
wrong impression :)

(but really, i think this is info useful for debugging purposes but not 
interesting generally, so it should be under DEBUG IMO)
  
Guo, Jia Oct. 4, 2018, 3:58 a.m. UTC | #2
On 10/2/2018 10:39 PM, Burakov, Anatoly wrote:
> On 02-Oct-18 1:35 PM, Jeff Guo wrote:
>> This patch implements the ops for the PCI bus sigbus handler. It 
>> finds the
>> PCI device that is being hot-unplugged and calls the relevant ops of the
>> hot-unplug handler to handle the hot-unplug failure of the device.
>>
>> Signed-off-by: Jeff Guo <jia.guo@intel.com>
>> Acked-by: Shaopeng He <shaopeng.he@intel.com>
>> ---
>> v12->v11:
>> no change.
>> ---
>>   drivers/bus/pci/pci_common.c | 53 
>> ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 53 insertions(+)
>>
>> diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
>> index d286234..f313fe9 100644
>> --- a/drivers/bus/pci/pci_common.c
>> +++ b/drivers/bus/pci/pci_common.c
>> @@ -405,6 +405,36 @@ pci_find_device(const struct rte_device *start, 
>> rte_dev_cmp_t cmp,
>>       return NULL;
>>   }
>>   +/**
>> + * find the device which encounter the failure, by iterate over all 
>> device on
>> + * PCI bus to check if the memory failure address is located in the 
>> range
>> + * of the BARs of the device.
>> + */
>> +static struct rte_pci_device *
>> +pci_find_device_by_addr(const void *failure_addr)
>> +{
>> +    struct rte_pci_device *pdev = NULL;
>> +    int i;
>> +
>> +    FOREACH_DEVICE_ON_PCIBUS(pdev) {
>> +        for (i = 0; i != RTE_DIM(pdev->mem_resource); i++) {
>> +            if ((uint64_t)(uintptr_t)failure_addr >=
>> + (uint64_t)(uintptr_t)pdev->mem_resource[i].addr &&
>> +                (uint64_t)(uintptr_t)failure_addr <
>> + (uint64_t)(uintptr_t)pdev->mem_resource[i].addr +
>> +                pdev->mem_resource[i].len) {
>
> You must *really* dislike local variables :) Suggested rewriting:
>
> const void *start, *end;
> size_t len;
>
> start = pdev->mem_resource[i].addr;
> len = pdev->mem_resource[i].len;
> end = RTE_PTR_ADD(start, len);
>
> if (failure_addr >= start && failure_addr < end) {
>     ...
> }
>
> I think this is way clearer.
>

good point, local variable might be good and helpful. Thanks.


>> +                RTE_LOG(INFO, EAL, "Failure address "
>> +                    "%16.16"PRIx64" belongs to "
>> +                    "device %s!\n",
>> +                    (uint64_t)(uintptr_t)failure_addr,
>> +                    pdev->device.name);
>
> I feel like this should have DEBUG level, rather than INFO. 
> Alternatively, if you really feel like this should be at level INFO, 
> the message should be reworded because the word "failure" might give 
> the wrong impression :)
>
> (but really, i think this is info useful for debugging purposes but 
> not interesting generally, so it should be under DEBUG IMO)
>

ok, i accept it.
  

Patch

diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index d286234..f313fe9 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -405,6 +405,36 @@  pci_find_device(const struct rte_device *start, rte_dev_cmp_t cmp,
 	return NULL;
 }
 
+/**
+ * find the device which encounter the failure, by iterate over all device on
+ * PCI bus to check if the memory failure address is located in the range
+ * of the BARs of the device.
+ */
+static struct rte_pci_device *
+pci_find_device_by_addr(const void *failure_addr)
+{
+	struct rte_pci_device *pdev = NULL;
+	int i;
+
+	FOREACH_DEVICE_ON_PCIBUS(pdev) {
+		for (i = 0; i != RTE_DIM(pdev->mem_resource); i++) {
+			if ((uint64_t)(uintptr_t)failure_addr >=
+			    (uint64_t)(uintptr_t)pdev->mem_resource[i].addr &&
+			    (uint64_t)(uintptr_t)failure_addr <
+			    (uint64_t)(uintptr_t)pdev->mem_resource[i].addr +
+			    pdev->mem_resource[i].len) {
+				RTE_LOG(INFO, EAL, "Failure address "
+					"%16.16"PRIx64" belongs to "
+					"device %s!\n",
+					(uint64_t)(uintptr_t)failure_addr,
+					pdev->device.name);
+				return pdev;
+			}
+		}
+	}
+	return NULL;
+}
+
 static int
 pci_hot_unplug_handler(struct rte_device *dev)
 {
@@ -433,6 +463,28 @@  pci_hot_unplug_handler(struct rte_device *dev)
 }
 
 static int
+pci_sigbus_handler(const void *failure_addr)
+{
+	struct rte_pci_device *pdev = NULL;
+	int ret = 0;
+
+	pdev = pci_find_device_by_addr(failure_addr);
+	if (!pdev) {
+		/* It is a generic sigbus error, no bus would handle it. */
+		ret = 1;
+	} else {
+		/* The sigbus error is caused of hot-unplug. */
+		ret = pci_hot_unplug_handler(&pdev->device);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "Failed to handle hot-unplug for "
+				"device %s", pdev->name);
+			ret = -1;
+		}
+	}
+	return ret;
+}
+
+static int
 pci_plug(struct rte_device *dev)
 {
 	return pci_probe_all_drivers(RTE_DEV_TO_PCI(dev));
@@ -463,6 +515,7 @@  struct rte_pci_bus rte_pci_bus = {
 		.parse = pci_parse,
 		.get_iommu_class = rte_pci_get_iommu_class,
 		.hot_unplug_handler = pci_hot_unplug_handler,
+		.sigbus_handler = pci_sigbus_handler,
 	},
 	.device_list = TAILQ_HEAD_INITIALIZER(rte_pci_bus.device_list),
 	.driver_list = TAILQ_HEAD_INITIALIZER(rte_pci_bus.driver_list),