[dpdk-dev,v2,4/4] net/failsafe: fix removed device handling

Message ID 20171213160916.e3rmxmhfhqz72wco@bidouze.vm.6wind.com (mailing list archive)
State Not Applicable, archived
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation fail apply patch file failure

Commit Message

Gaëtan Rivet Dec. 13, 2017, 4:09 p.m. UTC
  On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> Hi Gaetan
> Thanks for the review.
> Some comments..
> 
> > -----Original Message-----
> > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > Sent: Wednesday, December 13, 2017 5:17 PM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> > 
> > Hi Matan,
> > 
> > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote:
> > > There is time between the physical removal of the device until
> > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > > applications still don't know about the removal and may call
> > > sub-device control operation which should return an error.
> > >
> > > In previous code this error is reported to the application contrary to
> > > fail-safe principle that the app should not be aware of device removal.
> > >
> > > Add an removal check in each relevant control command error flow and
> > > prevent an error report to application when the sub-device is removed.
> > >
> > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > Cc: stable@dpdk.org
> > >
> > 
> > This patch is not a fix.
> > It relies on an eth_dev API evolution. Without this evolution, this patch is
> > meaningless and would break compilation if backported in stable branch.
> > 
> 
> It is a fix because the bug is finally solved by this patch.
> I agree that it cannot be backported itself, but maybe all the series should be backported.
> Other idea:
> Add new patch which documents the bug and backport it.
> Remove it in this patch and remove cc stable from it.
> What do you think?
> 

I think you could write a crude version that would not rely on the
ethdev evolution (checking sdev->remove only), which would be incomplete
but still better than nothing.
And why not in this patch document the issue.
Without any dependency outside failsafe, this could be backported.

Then complete the fix with the API evolution if the new devops is
accepted.

> > Please remove those tags.
> > 
> > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > ---
> > >  drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
> > >  drivers/net/failsafe/failsafe_ops.c     | 34 ++++++++++++++++++++++-----
> > ------
> > >  drivers/net/failsafe/failsafe_private.h | 10 ++++++++++
> > >  3 files changed, 44 insertions(+), 18 deletions(-)
> > 
> > < ... >
> > 
> > > +/*
> > > + * Check if sub device was removed.
> > > + */
> > > +static inline int
> > > +fs_is_removed(struct sub_device *sdev) {
> > > +	if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev))
> > != 0)
> > > +		return 1;
> > > +	return 0;
> > > +}
> > 
> > Have you considered adding this check within the subdev iterator itself?
> > I think it would prevent you from having to add it to each return value
> > checks.
> > 
> > It is still MT-unsafe anyway.
> >
> 
> This fix doesn't come to solve the MT issue, It comes to solve the error report to application because of removal.
> Adding the check in subdev iterator doesn't make sense for this issue.
> 
> Matan. 

If you add this check in the iterator itself, you would skip removed
devices before attempting operating upon them, right?

Then it should probably help with your issue, unless you tested it and
verified that it didnt?

Something like this:

---8<---


--->8---

Only issue being that it is completely racy, but as this MT-unsafe property is
inescapable we might as well ignore it and go for KISS.

If that's enough, I would prefer instead of having this additional check
added to all rte_eth operations.
  

Comments

Thomas Monjalon Dec. 13, 2017, 5:09 p.m. UTC | #1
13/12/2017 17:09, Gaëtan Rivet:
> On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > > Cc: stable@dpdk.org
> > > >
> > > 
> > > This patch is not a fix.
> > > It relies on an eth_dev API evolution. Without this evolution, this patch is
> > > meaningless and would break compilation if backported in stable branch.
> > > 
> > 
> > It is a fix because the bug is finally solved by this patch.
> > I agree that it cannot be backported itself, but maybe all the series should be backported.
> > Other idea:
> > Add new patch which documents the bug and backport it.
> > Remove it in this patch and remove cc stable from it.
> > What do you think?
> > 
> 
> I think you could write a crude version that would not rely on the
> ethdev evolution (checking sdev->remove only), which would be incomplete
> but still better than nothing.
> And why not in this patch document the issue.
> Without any dependency outside failsafe, this could be backported.
> 
> Then complete the fix with the API evolution if the new devops is
> accepted.

I think it is not worth the effort.
It is a limitation in earlier releases and will be properly fixed
with the new op.
Please just remove the Cc:stable@dpdk.org.
  
Gaëtan Rivet Dec. 13, 2017, 9:55 p.m. UTC | #2
Hi again Matan,

On Wed, Dec 13, 2017 at 05:09:16PM +0100, Gaëtan Rivet wrote:
> On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> > Hi Gaetan
> > Thanks for the review.
> > Some comments..
> > 
> > > -----Original Message-----
> > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > Sent: Wednesday, December 13, 2017 5:17 PM
> > > To: Matan Azrad <matan@mellanox.com>
> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> > > 
> > > Hi Matan,
> > > 
> > > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote:
> > > > There is time between the physical removal of the device until
> > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > > > applications still don't know about the removal and may call
> > > > sub-device control operation which should return an error.
> > > >
> > > > In previous code this error is reported to the application contrary to
> > > > fail-safe principle that the app should not be aware of device removal.
> > > >
> > > > Add an removal check in each relevant control command error flow and
> > > > prevent an error report to application when the sub-device is removed.
> > > >
> > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > > Cc: stable@dpdk.org
> > > >
> > > 
> > > This patch is not a fix.
> > > It relies on an eth_dev API evolution. Without this evolution, this patch is
> > > meaningless and would break compilation if backported in stable branch.
> > > 
> > 
> > It is a fix because the bug is finally solved by this patch.
> > I agree that it cannot be backported itself, but maybe all the series should be backported.
> > Other idea:
> > Add new patch which documents the bug and backport it.
> > Remove it in this patch and remove cc stable from it.
> > What do you think?
> > 
> 
> I think you could write a crude version that would not rely on the
> ethdev evolution (checking sdev->remove only), which would be incomplete
> but still better than nothing.
> And why not in this patch document the issue.
> Without any dependency outside failsafe, this could be backported.
> 
> Then complete the fix with the API evolution if the new devops is
> accepted.
> 
> > > Please remove those tags.
> > > 
> > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > ---
> > > >  drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
> > > >  drivers/net/failsafe/failsafe_ops.c     | 34 ++++++++++++++++++++++-----
> > > ------
> > > >  drivers/net/failsafe/failsafe_private.h | 10 ++++++++++
> > > >  3 files changed, 44 insertions(+), 18 deletions(-)
> > > 
> > > < ... >
> > > 
> > > > +/*
> > > > + * Check if sub device was removed.
> > > > + */
> > > > +static inline int
> > > > +fs_is_removed(struct sub_device *sdev) {
> > > > +	if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev))
> > > != 0)
> > > > +		return 1;
> > > > +	return 0;
> > > > +}
> > > 
> > > Have you considered adding this check within the subdev iterator itself?
> > > I think it would prevent you from having to add it to each return value
> > > checks.
> > > 
> > > It is still MT-unsafe anyway.
> > >
> > 
> > This fix doesn't come to solve the MT issue, It comes to solve the error report to application because of removal.
> > Adding the check in subdev iterator doesn't make sense for this issue.
> > 
> > Matan. 
> 
> If you add this check in the iterator itself, you would skip removed
> devices before attempting operating upon them, right?
> 
> Then it should probably help with your issue, unless you tested it and
> verified that it didnt?
> 
> Something like this:
> 
> ---8<---
> 
> diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
> index d81cc3ca6..62ddc0689 100644
> --- a/drivers/net/failsafe/failsafe_private.h
> +++ b/drivers/net/failsafe/failsafe_private.h
> @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev,
>         subs = PRIV(dev)->subs;
>         tail = PRIV(dev)->subs_tail;
>         while (sid < tail) {
> +               if (min_state > DEV_PROBED &&
> +                   fs_is_removed(&sub[sid]))
> +                       goto next;
>                 if (subs[sid].state >= min_state)
>                         break;
> +next:
>                 sid++;
>         }
>         *sid_out = sid;
> 
> --->8---
> 
> Only issue being that it is completely racy, but as this MT-unsafe property is
> inescapable we might as well ignore it and go for KISS.
> 
> If that's enough, I would prefer instead of having this additional check
> added to all rte_eth operations.
> 

Ok, actually you were right here to do it this way. The "is_removed"
check needs to happen after the operation attempt to effectively
mitigate the possible race. Checking before attempting the call will be
much less effective.

That being said, would it be cleaner to have eth_dev ops return -ENODEV
directly, and check against it within fail-safe?
  
Matan Azrad Dec. 14, 2017, 10:40 a.m. UTC | #3
Hi Gaetan

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Wednesday, December 13, 2017 11:56 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> 
> Hi again Matan,
> 
> On Wed, Dec 13, 2017 at 05:09:16PM +0100, Gaëtan Rivet wrote:
> > On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> > > Hi Gaetan
> > > Thanks for the review.
> > > Some comments..
> > >
> > > > -----Original Message-----
> > > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > > Sent: Wednesday, December 13, 2017 5:17 PM
> > > > To: Matan Azrad <matan@mellanox.com>
> > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas
> Monjalon
> > > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device
> > > > handling
> > > >
> > > > Hi Matan,
> > > >
> > > > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote:
> > > > > There is time between the physical removal of the device until
> > > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > > > > applications still don't know about the removal and may call
> > > > > sub-device control operation which should return an error.
> > > > >
> > > > > In previous code this error is reported to the application
> > > > > contrary to fail-safe principle that the app should not be aware of
> device removal.
> > > > >
> > > > > Add an removal check in each relevant control command error flow
> > > > > and prevent an error report to application when the sub-device is
> removed.
> > > > >
> > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > > > Cc: stable@dpdk.org
> > > > >
> > > >
> > > > This patch is not a fix.
> > > > It relies on an eth_dev API evolution. Without this evolution,
> > > > this patch is meaningless and would break compilation if backported in
> stable branch.
> > > >
> > >
> > > It is a fix because the bug is finally solved by this patch.
> > > I agree that it cannot be backported itself, but maybe all the series should
> be backported.
> > > Other idea:
> > > Add new patch which documents the bug and backport it.
> > > Remove it in this patch and remove cc stable from it.
> > > What do you think?
> > >
> >
> > I think you could write a crude version that would not rely on the
> > ethdev evolution (checking sdev->remove only), which would be
> > incomplete but still better than nothing.
> > And why not in this patch document the issue.
> > Without any dependency outside failsafe, this could be backported.
> >
> > Then complete the fix with the API evolution if the new devops is
> > accepted.
> >
> > > > Please remove those tags.
> > > >
> > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > > ---
> > > > >  drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
> > > > >  drivers/net/failsafe/failsafe_ops.c     | 34
> ++++++++++++++++++++++-----
> > > > ------
> > > > >  drivers/net/failsafe/failsafe_private.h | 10 ++++++++++
> > > > >  3 files changed, 44 insertions(+), 18 deletions(-)
> > > >
> > > > < ... >
> > > >
> > > > > +/*
> > > > > + * Check if sub device was removed.
> > > > > + */
> > > > > +static inline int
> > > > > +fs_is_removed(struct sub_device *sdev) {
> > > > > +	if (sdev->remove == 1 ||
> rte_eth_dev_is_removed(PORT_ID(sdev))
> > > > != 0)
> > > > > +		return 1;
> > > > > +	return 0;
> > > > > +}
> > > >
> > > > Have you considered adding this check within the subdev iterator itself?
> > > > I think it would prevent you from having to add it to each return
> > > > value checks.
> > > >
> > > > It is still MT-unsafe anyway.
> > > >
> > >
> > > This fix doesn't come to solve the MT issue, It comes to solve the error
> report to application because of removal.
> > > Adding the check in subdev iterator doesn't make sense for this issue.
> > >
> > > Matan.
> >
> > If you add this check in the iterator itself, you would skip removed
> > devices before attempting operating upon them, right?
> >
> > Then it should probably help with your issue, unless you tested it and
> > verified that it didnt?
> >
> > Something like this:
> >
> > ---8<---
> >
> > diff --git a/drivers/net/failsafe/failsafe_private.h
> > b/drivers/net/failsafe/failsafe_private.h
> > index d81cc3ca6..62ddc0689 100644
> > --- a/drivers/net/failsafe/failsafe_private.h
> > +++ b/drivers/net/failsafe/failsafe_private.h
> > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev,
> >         subs = PRIV(dev)->subs;
> >         tail = PRIV(dev)->subs_tail;
> >         while (sid < tail) {
> > +               if (min_state > DEV_PROBED &&
> > +                   fs_is_removed(&sub[sid]))
> > +                       goto next;
> >                 if (subs[sid].state >= min_state)
> >                         break;
> > +next:
> >                 sid++;
> >         }
> >         *sid_out = sid;
> >
> > --->8---
> >
> > Only issue being that it is completely racy, but as this MT-unsafe
> > property is inescapable we might as well ignore it and go for KISS.
> >
> > If that's enough, I would prefer instead of having this additional
> > check added to all rte_eth operations.
> >
> 
> Ok, actually you were right here to do it this way. The "is_removed"
> check needs to happen after the operation attempt to effectively mitigate
> the possible race. Checking before attempting the call will be much less
> effective.
> 
> That being said, would it be cleaner to have eth_dev ops return -ENODEV
> directly, and check against it within fail-safe?
> 

I think that according to "is_removed" semantic we must return a Boolean value (Each value different from '0' means that the device is removed) like other functions in c library (for example isspace()).

Thanks.
 

> --
> Gaëtan Rivet
> 6WIND
  
Matan Azrad Dec. 14, 2017, 10:40 a.m. UTC | #4
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, December 13, 2017 7:09 PM
> To: Gaëtan Rivet <gaetan.rivet@6wind.com>; Matan Azrad
> <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org;
> stable@dpdk.org
> Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> 
> 13/12/2017 17:09, Gaëtan Rivet:
> > On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > > > Cc: stable@dpdk.org
> > > > >
> > > >
> > > > This patch is not a fix.
> > > > It relies on an eth_dev API evolution. Without this evolution,
> > > > this patch is meaningless and would break compilation if backported in
> stable branch.
> > > >
> > >
> > > It is a fix because the bug is finally solved by this patch.
> > > I agree that it cannot be backported itself, but maybe all the series should
> be backported.
> > > Other idea:
> > > Add new patch which documents the bug and backport it.
> > > Remove it in this patch and remove cc stable from it.
> > > What do you think?
> > >
> >
> > I think you could write a crude version that would not rely on the
> > ethdev evolution (checking sdev->remove only), which would be
> > incomplete but still better than nothing.
> > And why not in this patch document the issue.
> > Without any dependency outside failsafe, this could be backported.
> >
> > Then complete the fix with the API evolution if the new devops is
> > accepted.
> 
> I think it is not worth the effort.
> It is a limitation in earlier releases and will be properly fixed with the new op.
> Please just remove the Cc:stable@dpdk.org.

OK.
  
Gaëtan Rivet Dec. 14, 2017, 10:48 a.m. UTC | #5
On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote:
> Hi Gaetan
> 

<snip>

> > >
> > > If you add this check in the iterator itself, you would skip removed
> > > devices before attempting operating upon them, right?
> > >
> > > Then it should probably help with your issue, unless you tested it and
> > > verified that it didnt?
> > >
> > > Something like this:
> > >
> > > ---8<---
> > >
> > > diff --git a/drivers/net/failsafe/failsafe_private.h
> > > b/drivers/net/failsafe/failsafe_private.h
> > > index d81cc3ca6..62ddc0689 100644
> > > --- a/drivers/net/failsafe/failsafe_private.h
> > > +++ b/drivers/net/failsafe/failsafe_private.h
> > > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev,
> > >         subs = PRIV(dev)->subs;
> > >         tail = PRIV(dev)->subs_tail;
> > >         while (sid < tail) {
> > > +               if (min_state > DEV_PROBED &&
> > > +                   fs_is_removed(&sub[sid]))
> > > +                       goto next;
> > >                 if (subs[sid].state >= min_state)
> > >                         break;
> > > +next:
> > >                 sid++;
> > >         }
> > >         *sid_out = sid;
> > >
> > > --->8---
> > >
> > > Only issue being that it is completely racy, but as this MT-unsafe
> > > property is inescapable we might as well ignore it and go for KISS.
> > >
> > > If that's enough, I would prefer instead of having this additional
> > > check added to all rte_eth operations.
> > >
> > 
> > Ok, actually you were right here to do it this way. The "is_removed"
> > check needs to happen after the operation attempt to effectively mitigate
> > the possible race. Checking before attempting the call will be much less
> > effective.
> > 
> > That being said, would it be cleaner to have eth_dev ops return -ENODEV
> > directly, and check against it within fail-safe?
> > 
> 
> I think that according to "is_removed" semantic we must return a Boolean value (Each value different from '0' means that the device is removed) like other functions in c library (for example isspace()).
> 

Sure, I wasn't discussing the interface proposed by rte_eth_dev_is_removed().

What I meant was to ask whether checking rte_eth_dev_is_removed() would
be more interesting in the ethdev layer, making the eth_dev_ops return
-ENODEV regardless of the previous error if this check is supported by
the driver and signal that the port is removed.

I think this information could be interesting to other systems, not just
fail-safe.
  
Matan Azrad Dec. 14, 2017, 1:07 p.m. UTC | #6
Hi Gaetan

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Thursday, December 14, 2017 12:49 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> 
> On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote:
> > Hi Gaetan
> >
> 
> <snip>
> 
> > > >
> > > > If you add this check in the iterator itself, you would skip
> > > > removed devices before attempting operating upon them, right?
> > > >
> > > > Then it should probably help with your issue, unless you tested it
> > > > and verified that it didnt?
> > > >
> > > > Something like this:
> > > >
> > > > ---8<---
> > > >
> > > > diff --git a/drivers/net/failsafe/failsafe_private.h
> > > > b/drivers/net/failsafe/failsafe_private.h
> > > > index d81cc3ca6..62ddc0689 100644
> > > > --- a/drivers/net/failsafe/failsafe_private.h
> > > > +++ b/drivers/net/failsafe/failsafe_private.h
> > > > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev,
> > > >         subs = PRIV(dev)->subs;
> > > >         tail = PRIV(dev)->subs_tail;
> > > >         while (sid < tail) {
> > > > +               if (min_state > DEV_PROBED &&
> > > > +                   fs_is_removed(&sub[sid]))
> > > > +                       goto next;
> > > >                 if (subs[sid].state >= min_state)
> > > >                         break;
> > > > +next:
> > > >                 sid++;
> > > >         }
> > > >         *sid_out = sid;
> > > >
> > > > --->8---
> > > >
> > > > Only issue being that it is completely racy, but as this MT-unsafe
> > > > property is inescapable we might as well ignore it and go for KISS.
> > > >
> > > > If that's enough, I would prefer instead of having this additional
> > > > check added to all rte_eth operations.
> > > >
> > >
> > > Ok, actually you were right here to do it this way. The "is_removed"
> > > check needs to happen after the operation attempt to effectively
> > > mitigate the possible race. Checking before attempting the call will
> > > be much less effective.
> > >
> > > That being said, would it be cleaner to have eth_dev ops return
> > > -ENODEV directly, and check against it within fail-safe?
> > >
> >
> > I think that according to "is_removed" semantic we must return a Boolean
> value (Each value different from '0' means that the device is removed) like
> other functions in c library (for example isspace()).
> >
> 
> Sure, I wasn't discussing the interface proposed by
> rte_eth_dev_is_removed().
> 
> What I meant was to ask whether checking rte_eth_dev_is_removed()
> would be more interesting in the ethdev layer, making the eth_dev_ops
> return -ENODEV regardless of the previous error if this check is supported by
> the driver and signal that the port is removed.
> 
> I think this information could be interesting to other systems, not just fail-
> safe.
> 

Ok. Got you now.
Interesting approach - plan:
	1. update fs_link_update to use rte_eth* functions.
	2. maybe -EIO is preferred because -ENODEV is used for no port error?
	3. update all relevant rte_eth* to use "is_removed" in error flows(1 patch for flow APIs and 1 for the others).
	4. Change fs checks in error flows to check rte_eth* return values.
	5. Remove CC stable from commit massage.

What do you think?

> --
> Gaëtan Rivet
> 6WIND
  
Gaëtan Rivet Dec. 14, 2017, 1:27 p.m. UTC | #7
On Thu, Dec 14, 2017 at 01:07:31PM +0000, Matan Azrad wrote:
> Hi Gaetan
> 
> > -----Original Message-----
> > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > Sent: Thursday, December 14, 2017 12:49 PM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> > 
> > On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote:
> > > Hi Gaetan
> > >
> > 

<snip>

> > > > Ok, actually you were right here to do it this way. The "is_removed"
> > > > check needs to happen after the operation attempt to effectively
> > > > mitigate the possible race. Checking before attempting the call will
> > > > be much less effective.
> > > >
> > > > That being said, would it be cleaner to have eth_dev ops return
> > > > -ENODEV directly, and check against it within fail-safe?
> > > >
> > >
> > > I think that according to "is_removed" semantic we must return a Boolean
> > value (Each value different from '0' means that the device is removed) like
> > other functions in c library (for example isspace()).
> > >
> > 
> > Sure, I wasn't discussing the interface proposed by
> > rte_eth_dev_is_removed().
> > 
> > What I meant was to ask whether checking rte_eth_dev_is_removed()
> > would be more interesting in the ethdev layer, making the eth_dev_ops
> > return -ENODEV regardless of the previous error if this check is supported by
> > the driver and signal that the port is removed.
> > 
> > I think this information could be interesting to other systems, not just fail-
> > safe.
> > 
> 
> Ok. Got you now.
> Interesting approach - plan:
> 	1. update fs_link_update to use rte_eth* functions.

I'm surprised it doesn't already.
Either the rte_eth* function was introduced after the failsafe, or be
wary of potential issues. I don't see a problem right now though.

> 	2. maybe -EIO is preferred because -ENODEV is used for no port error?

Good point, didn't think about it.
Prepare yourself maybe to some arguments about the most relevant error
code. -EIO seems fine to me, but maybe use a wrapper for all this.

Something like:

---8<---

static int
eth_error(pid, int original_ret)
{
    int ret;

    if (original_ret == 0)
        return original_ret;
    ret = rte_eth_is_removed(pid);
    if (ret == 0 || ret == -ENOTSUP)
        return original_ret;
    return -EIO;
}

int
rte_eth_ops_xyz(pid)
{
        int ret;
        ret = eth_dev(pid).ops_xyz();
        return eth_error(pid, ret);
}

--->8---

This way you would be able to change it easily and the logic would be
insulated.

> 	3. update all relevant rte_eth* to use "is_removed" in error flows(1 patch for flow APIs and 1 for the others).
> 	4. Change fs checks in error flows to check rte_eth* return values.
> 	5. Remove CC stable from commit massage.
> 
> What do you think?
> 

Agreed otherwise.

Thanks,

> > --
> > Gaëtan Rivet
> > 6WIND
  
Matan Azrad Dec. 14, 2017, 2:43 p.m. UTC | #8
Hi

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Thursday, December 14, 2017 3:27 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> 
> On Thu, Dec 14, 2017 at 01:07:31PM +0000, Matan Azrad wrote:
> > Hi Gaetan
> >
> > > -----Original Message-----
> > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > Sent: Thursday, December 14, 2017 12:49 PM
> > > To: Matan Azrad <matan@mellanox.com>
> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device
> > > handling
> > >
> > > On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote:
> > > > Hi Gaetan
> > > >
> > >
> 
> <snip>
> 
> > > > > Ok, actually you were right here to do it this way. The "is_removed"
> > > > > check needs to happen after the operation attempt to effectively
> > > > > mitigate the possible race. Checking before attempting the call
> > > > > will be much less effective.
> > > > >
> > > > > That being said, would it be cleaner to have eth_dev ops return
> > > > > -ENODEV directly, and check against it within fail-safe?
> > > > >
> > > >
> > > > I think that according to "is_removed" semantic we must return a
> > > > Boolean
> > > value (Each value different from '0' means that the device is
> > > removed) like other functions in c library (for example isspace()).
> > > >
> > >
> > > Sure, I wasn't discussing the interface proposed by
> > > rte_eth_dev_is_removed().
> > >
> > > What I meant was to ask whether checking rte_eth_dev_is_removed()
> > > would be more interesting in the ethdev layer, making the
> > > eth_dev_ops return -ENODEV regardless of the previous error if this
> > > check is supported by the driver and signal that the port is removed.
> > >
> > > I think this information could be interesting to other systems, not
> > > just fail- safe.
> > >
> >
> > Ok. Got you now.
> > Interesting approach - plan:
> > 	1. update fs_link_update to use rte_eth* functions.
> 
> I'm surprised it doesn't already.
> Either the rte_eth* function was introduced after the failsafe, or be wary of
> potential issues. I don't see a problem right now though.
> 
> > 	2. maybe -EIO is preferred because -ENODEV is used for no port
> error?
> 
> Good point, didn't think about it.
> Prepare yourself maybe to some arguments about the most relevant error
> code. -EIO seems fine to me, but maybe use a wrapper for all this.
> 
> Something like:
> 
> ---8<---
> 
> static int
> eth_error(pid, int original_ret)
> {
>     int ret;
> 
>     if (original_ret == 0)
>         return original_ret;
>     ret = rte_eth_is_removed(pid);
>     if (ret == 0 || ret == -ENOTSUP)
>         return original_ret;
>     return -EIO;
> }
> 
> int
> rte_eth_ops_xyz(pid)
> {
>         int ret;
>         ret = eth_dev(pid).ops_xyz();
>         return eth_error(pid, ret);
> }
> 
> --->8---
> 
> This way you would be able to change it easily and the logic would be
> insulated.
> 

Nice.

> > 	3. update all relevant rte_eth* to use "is_removed" in error flows(1
> patch for flow APIs and 1 for the others).
> > 	4. Change fs checks in error flows to check rte_eth* return values.
> > 	5. Remove CC stable from commit massage.
> >
> > What do you think?
> >
> 
> Agreed otherwise.
> 

Will create V3, thanks!

> Thanks,
> 
> > > --
> > > Gaëtan Rivet
> > > 6WIND
> 
> --
> Gaëtan Rivet
> 6WIND
  

Patch

diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index d81cc3ca6..62ddc0689 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -316,8 +316,12 @@  fs_find_next(struct rte_eth_dev *dev,
        subs = PRIV(dev)->subs;
        tail = PRIV(dev)->subs_tail;
        while (sid < tail) {
+               if (min_state > DEV_PROBED &&
+                   fs_is_removed(&sub[sid]))
+                       goto next;
                if (subs[sid].state >= min_state)
                        break;
+next:
                sid++;
        }
        *sid_out = sid;