[2/2] kni: fix rtnl deadlocks and race conditions v4

Message ID 20210225143239.14220-2-eladv6@gmail.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [1/2] kni: fix kernel deadlock when using mlx devices |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail apply issues

Commit Message

Elad Nachman Feb. 25, 2021, 2:32 p.m. UTC
  This part of the series includes my fixes for the issues reported
by Ferruh and Igor (and Igor comments for v3 of the patch)
on top of part 1 of the patch series:

A. KNI sync lock is being locked while rtnl is held.
If two threads are calling kni_net_process_request() ,
then the first one will take the sync lock, release rtnl lock then sleep.
The second thread will try to lock sync lock while holding rtnl.
The first thread will wake, and try to lock rtnl, resulting in a deadlock.
The remedy is to release rtnl before locking the KNI sync lock.
Since in between nothing is accessing Linux network-wise,
no rtnl locking is needed.

B. There is a race condition in __dev_close_many() processing the
close_list while the application terminates.
It looks like if two vEth devices are terminating,
and one releases the rtnl lock, the other takes it,
updating the close_list in an unstable state,
causing the close_list to become a circular linked list,
hence list_for_each_entry() will endlessly loop inside
__dev_close_many() .
Since the description for the original patch indicate the
original motivation was bringing the device up,
I have changed kni_net_process_request() to hold the rtnl mutex
in case of bringing the device down since this is the path called
from __dev_close_many() , causing the corruption of the close_list. 
In order to prevent deadlock in Mellanox device in this case, the
code has been modified not to wait for user-space while holding 
the rtnl lock.
Instead, after the request has been sent, all locks are relinquished
and the function exits immediately with return code of zero (success).

To summarize:
request != interface down : unlock rtnl, send request to user-space,
wait for response, send the response error code to caller in user-space.

request == interface down: send request to user-space, return immediately
with error code of 0 (success) to user-space.

Signed-off-by: Elad Nachman <eladv6@gmail.com>


---
v4:
* for if down case, send asynchronously with rtnl locked and without
  wait, returning immediately to avoid both kernel race conditions
  and deadlock in user-space
v3:
* Include original patch and new patch as a series of patch, added a
  comment to the new patch
v2:
* rebuild the patch as increment from patch 64106
* fix comment and blank lines
---
 kernel/linux/kni/kni_net.c      | 41 +++++++++++++++++++++++++++------
 lib/librte_kni/rte_kni.c        |  7 ++++--
 lib/librte_kni/rte_kni_common.h |  1 +
 3 files changed, 40 insertions(+), 9 deletions(-)
  

Comments

Igor Ryzhov Feb. 25, 2021, 9:01 p.m. UTC | #1
Hi Elad,

Thanks for the patch, but this is still NACK from me.

The only real advantage of KNI over other exceptional-path techniques
like virtio-user is the ability to configure DPDK-managed interfaces
directly
from the kernel using well-known utils like iproute2. A very important part
of this is getting responses from the DPDK app and knowing the actual
result of command execution.
If you're making async requests to the application and you don't know
the result, then what's the point of using KNI at all?

Igor

On Thu, Feb 25, 2021 at 5:32 PM Elad Nachman <eladv6@gmail.com> wrote:

> This part of the series includes my fixes for the issues reported
> by Ferruh and Igor (and Igor comments for v3 of the patch)
> on top of part 1 of the patch series:
>
> A. KNI sync lock is being locked while rtnl is held.
> If two threads are calling kni_net_process_request() ,
> then the first one will take the sync lock, release rtnl lock then sleep.
> The second thread will try to lock sync lock while holding rtnl.
> The first thread will wake, and try to lock rtnl, resulting in a deadlock.
> The remedy is to release rtnl before locking the KNI sync lock.
> Since in between nothing is accessing Linux network-wise,
> no rtnl locking is needed.
>
> B. There is a race condition in __dev_close_many() processing the
> close_list while the application terminates.
> It looks like if two vEth devices are terminating,
> and one releases the rtnl lock, the other takes it,
> updating the close_list in an unstable state,
> causing the close_list to become a circular linked list,
> hence list_for_each_entry() will endlessly loop inside
> __dev_close_many() .
> Since the description for the original patch indicate the
> original motivation was bringing the device up,
> I have changed kni_net_process_request() to hold the rtnl mutex
> in case of bringing the device down since this is the path called
> from __dev_close_many() , causing the corruption of the close_list.
> In order to prevent deadlock in Mellanox device in this case, the
> code has been modified not to wait for user-space while holding
> the rtnl lock.
> Instead, after the request has been sent, all locks are relinquished
> and the function exits immediately with return code of zero (success).
>
> To summarize:
> request != interface down : unlock rtnl, send request to user-space,
> wait for response, send the response error code to caller in user-space.
>
> request == interface down: send request to user-space, return immediately
> with error code of 0 (success) to user-space.
>
> Signed-off-by: Elad Nachman <eladv6@gmail.com>
>
>
> ---
> v4:
> * for if down case, send asynchronously with rtnl locked and without
>   wait, returning immediately to avoid both kernel race conditions
>   and deadlock in user-space
> v3:
> * Include original patch and new patch as a series of patch, added a
>   comment to the new patch
> v2:
> * rebuild the patch as increment from patch 64106
> * fix comment and blank lines
> ---
>  kernel/linux/kni/kni_net.c      | 41 +++++++++++++++++++++++++++------
>  lib/librte_kni/rte_kni.c        |  7 ++++--
>  lib/librte_kni/rte_kni_common.h |  1 +
>  3 files changed, 40 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c
> index f0b6e9a8d..ba991802b 100644
> --- a/kernel/linux/kni/kni_net.c
> +++ b/kernel/linux/kni/kni_net.c
> @@ -110,12 +110,34 @@ kni_net_process_request(struct net_device *dev,
> struct rte_kni_request *req)
>         void *resp_va;
>         uint32_t num;
>         int ret_val;
> +       int req_is_dev_stop = 0;
> +
> +       /* For configuring the interface to down,
> +        * rtnl must be held all the way to prevent race condition
> +        * inside __dev_close_many() between two netdev instances of KNI
> +        */
> +       if (req->req_id == RTE_KNI_REQ_CFG_NETWORK_IF &&
> +                       req->if_up == 0)
> +               req_is_dev_stop = 1;
>
>         ASSERT_RTNL();
>
> +       /* Since we need to wait and RTNL mutex is held
> +        * drop the mutex and hold reference to keep device
> +        */
> +       if (!req_is_dev_stop) {
> +               dev_hold(dev);
> +               rtnl_unlock();
> +       }
> +
>         mutex_lock(&kni->sync_lock);
>
> -       /* Construct data */
> +       /* Construct data, for dev stop send asynchronously
> +        * so instruct user-space not to sent response as no
> +        * one will be waiting for it.
> +        */
> +       if (req_is_dev_stop)
> +               req->skip_post_resp_to_q = 1;
>         memcpy(kni->sync_kva, req, sizeof(struct rte_kni_request));
>         num = kni_fifo_put(kni->req_q, &kni->sync_va, 1);
>         if (num < 1) {
> @@ -124,16 +146,16 @@ kni_net_process_request(struct net_device *dev,
> struct rte_kni_request *req)
>                 goto fail;
>         }
>
> -       /* Since we need to wait and RTNL mutex is held
> -        * drop the mutex and hold refernce to keep device
> +       /* No result available since request is handled
> +        * asynchronously. set response to success.
>          */
> -       dev_hold(dev);
> -       rtnl_unlock();
> +       if (req_is_dev_stop) {
> +               req->result = 0;
> +               goto async;
> +       }
>
>         ret_val = wait_event_interruptible_timeout(kni->wq,
>                         kni_fifo_count(kni->resp_q), 3 * HZ);
> -       rtnl_lock();
> -       dev_put(dev);
>
>         if (signal_pending(current) || ret_val <= 0) {
>                 ret = -ETIME;
> @@ -148,10 +170,15 @@ kni_net_process_request(struct net_device *dev,
> struct rte_kni_request *req)
>         }
>
>         memcpy(req, kni->sync_kva, sizeof(struct rte_kni_request));
> +async:
>         ret = 0;
>
>  fail:
>         mutex_unlock(&kni->sync_lock);
> +       if (!req_is_dev_stop) {
> +               rtnl_lock();
> +               dev_put(dev);
> +       }
>         return ret;
>  }
>
> diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
> index 837d0217d..6d777266d 100644
> --- a/lib/librte_kni/rte_kni.c
> +++ b/lib/librte_kni/rte_kni.c
> @@ -591,8 +591,11 @@ rte_kni_handle_request(struct rte_kni *kni)
>                 break;
>         }
>
> -       /* Construct response mbuf and put it back to resp_q */
> -       ret = kni_fifo_put(kni->resp_q, (void **)&req, 1);
> +       /* if needed, construct response mbuf and put it back to resp_q */
> +       if (!req->skip_post_resp_to_q)
> +               ret = kni_fifo_put(kni->resp_q, (void **)&req, 1);
> +       else
> +               ret = 1;
>         if (ret != 1) {
>                 RTE_LOG(ERR, KNI, "Fail to put the muf back to resp_q\n");
>                 return -1; /* It is an error of can't putting the mbuf
> back */
> diff --git a/lib/librte_kni/rte_kni_common.h
> b/lib/librte_kni/rte_kni_common.h
> index ffb318273..3b5d06850 100644
> --- a/lib/librte_kni/rte_kni_common.h
> +++ b/lib/librte_kni/rte_kni_common.h
> @@ -48,6 +48,7 @@ struct rte_kni_request {
>                 uint8_t promiscusity;/**< 1: promisc mode enable, 0:
> disable */
>                 uint8_t allmulti;    /**< 1: all-multicast mode enable, 0:
> disable */
>         };
> +       int32_t skip_post_resp_to_q; /**< 1: skip queue response 0:
> disable */
>         int32_t result;               /**< Result for processing request */
>  } __attribute__((__packed__));
>
> --
> 2.17.1
>
>
  
Stephen Hemminger Feb. 26, 2021, 3:48 p.m. UTC | #2
On Fri, 26 Feb 2021 00:01:01 +0300
Igor Ryzhov <iryzhov@nfware.com> wrote:

> Hi Elad,
> 
> Thanks for the patch, but this is still NACK from me.
> 
> The only real advantage of KNI over other exceptional-path techniques
> like virtio-user is the ability to configure DPDK-managed interfaces
> directly
> from the kernel using well-known utils like iproute2. A very important part
> of this is getting responses from the DPDK app and knowing the actual
> result of command execution.
> If you're making async requests to the application and you don't know
> the result, then what's the point of using KNI at all?
> 
> Igor

Do you have a better proposal that keeps the request result but does not
call userspace with lock held.

PS: I also have strong dislike of KNI, as designed it would have been rejected
by Linux kernel developers.  A better solution would be userspace version of
something like devlink devices. But doing control operations by proxy is
a locking nightmare.
  
Elad Nachman Feb. 26, 2021, 5:43 p.m. UTC | #3
The way the kernel handles its locks and lists for the dev close many
path, there is no way you can go around this with rtnl unlocked :
"

There is a race condition in __dev_close_many() processing the
close_list while the application terminates.
It looks like if two vEth devices are terminating,
and one releases the rtnl lock, the other takes it,
updating the close_list in an unstable state,
causing the close_list to become a circular linked list,
hence list_for_each_entry() will endlessly loop inside
__dev_close_many() .

"
And I don't expect David Miller will bend the kernel networking for DPDK or KNI.

But - Stephen - if you can personally convince David to accept a
kernel patch which will separate the close_list locking mechanism to a
separate (RCU?) lock, then I can introduce first a patch to the kernel
which will add a lock for the close_list, this way rtnl can be
unlocked for the if down case.

After that kernel patch, your original patch + relocation of the sync
mutex locking will do the job .

Otherwise, rtnl has to be kept locked all of the way for the if down
case in order to prevent corruption causing a circular linked list out
of the close_list, causing a hang in the kernel.

Currently, the rtnl lock is the only thing keeping the close_list from
corruption.

If you doubt rtnl cannot be unlocked for dev close path, you can
consult David for his opinion, as I think it is critical to understand
what the kernel can or cannot do, or expects to be done before we can
unlock its locks as we wish inside rte_kni.ko .

Otherwise, if we are still in disagreement on how to patch this set of
problems, I think the responsible way around it is to completely
remove kni from the main dpdk tree and move it to dpdk-kmods
repository.

I know BSD style open-source does not carry legal responsibility from
the developers, but I think when a bunch of developers know a piece of
code is highly buggy, they should not leave it for countless new users
to bounce their head desperately against, if they cannot agree on a
correct way to solve the bunch of problems, of which I think we all
agree exist (we just do not agree on the proper solution or patch)...

That's my two cents,

Elad.

On Fri, Feb 26, 2021 at 5:49 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Fri, 26 Feb 2021 00:01:01 +0300
> Igor Ryzhov <iryzhov@nfware.com> wrote:
>
> > Hi Elad,
> >
> > Thanks for the patch, but this is still NACK from me.
> >
> > The only real advantage of KNI over other exceptional-path techniques
> > like virtio-user is the ability to configure DPDK-managed interfaces
> > directly
> > from the kernel using well-known utils like iproute2. A very important part
> > of this is getting responses from the DPDK app and knowing the actual
> > result of command execution.
> > If you're making async requests to the application and you don't know
> > the result, then what's the point of using KNI at all?
> >
> > Igor
>
> Do you have a better proposal that keeps the request result but does not
> call userspace with lock held.
>
> PS: I also have strong dislike of KNI, as designed it would have been rejected
> by Linux kernel developers.  A better solution would be userspace version of
> something like devlink devices. But doing control operations by proxy is
> a locking nightmare.
  
Igor Ryzhov March 1, 2021, 8:10 a.m. UTC | #4
Stephen,

No, I don't have a better proposal, but I think it is not correct to change
the behavior of KNI (making link down without a real response).
Even though we know that communicating with userspace under rtnl_lock is a
bad idea, it works as it is for many years already.

Elad,

I agree with you that KNI should be removed from the main tree if it is not
possible to fix this __dev_close_many issue.
There were discussions about this multiple times already, but no one is
working on this AFAIK.
Last time the discussion was a month ago:
https://www.mail-archive.com/dev@dpdk.org/msg196033.html

Igor

On Fri, Feb 26, 2021 at 8:43 PM Elad Nachman <eladv6@gmail.com> wrote:

> The way the kernel handles its locks and lists for the dev close many
> path, there is no way you can go around this with rtnl unlocked :
> "
>
> There is a race condition in __dev_close_many() processing the
> close_list while the application terminates.
> It looks like if two vEth devices are terminating,
> and one releases the rtnl lock, the other takes it,
> updating the close_list in an unstable state,
> causing the close_list to become a circular linked list,
> hence list_for_each_entry() will endlessly loop inside
> __dev_close_many() .
>
> "
> And I don't expect David Miller will bend the kernel networking for DPDK
> or KNI.
>
> But - Stephen - if you can personally convince David to accept a
> kernel patch which will separate the close_list locking mechanism to a
> separate (RCU?) lock, then I can introduce first a patch to the kernel
> which will add a lock for the close_list, this way rtnl can be
> unlocked for the if down case.
>
> After that kernel patch, your original patch + relocation of the sync
> mutex locking will do the job .
>
> Otherwise, rtnl has to be kept locked all of the way for the if down
> case in order to prevent corruption causing a circular linked list out
> of the close_list, causing a hang in the kernel.
>
> Currently, the rtnl lock is the only thing keeping the close_list from
> corruption.
>
> If you doubt rtnl cannot be unlocked for dev close path, you can
> consult David for his opinion, as I think it is critical to understand
> what the kernel can or cannot do, or expects to be done before we can
> unlock its locks as we wish inside rte_kni.ko .
>
> Otherwise, if we are still in disagreement on how to patch this set of
> problems, I think the responsible way around it is to completely
> remove kni from the main dpdk tree and move it to dpdk-kmods
> repository.
>
> I know BSD style open-source does not carry legal responsibility from
> the developers, but I think when a bunch of developers know a piece of
> code is highly buggy, they should not leave it for countless new users
> to bounce their head desperately against, if they cannot agree on a
> correct way to solve the bunch of problems, of which I think we all
> agree exist (we just do not agree on the proper solution or patch)...
>
> That's my two cents,
>
> Elad.
>
> On Fri, Feb 26, 2021 at 5:49 PM Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> >
> > On Fri, 26 Feb 2021 00:01:01 +0300
> > Igor Ryzhov <iryzhov@nfware.com> wrote:
> >
> > > Hi Elad,
> > >
> > > Thanks for the patch, but this is still NACK from me.
> > >
> > > The only real advantage of KNI over other exceptional-path techniques
> > > like virtio-user is the ability to configure DPDK-managed interfaces
> > > directly
> > > from the kernel using well-known utils like iproute2. A very important
> part
> > > of this is getting responses from the DPDK app and knowing the actual
> > > result of command execution.
> > > If you're making async requests to the application and you don't know
> > > the result, then what's the point of using KNI at all?
> > >
> > > Igor
> >
> > Do you have a better proposal that keeps the request result but does not
> > call userspace with lock held.
> >
> > PS: I also have strong dislike of KNI, as designed it would have been
> rejected
> > by Linux kernel developers.  A better solution would be userspace
> version of
> > something like devlink devices. But doing control operations by proxy is
> > a locking nightmare.
>
  
Stephen Hemminger March 1, 2021, 4:38 p.m. UTC | #5
On Mon, 1 Mar 2021 11:10:01 +0300
Igor Ryzhov <iryzhov@nfware.com> wrote:

> Stephen,
> 
> No, I don't have a better proposal, but I think it is not correct to change
> the behavior of KNI (making link down without a real response).
> Even though we know that communicating with userspace under rtnl_lock is a
> bad idea, it works as it is for many years already.
> 
> Elad,
> 
> I agree with you that KNI should be removed from the main tree if it is not
> possible to fix this __dev_close_many issue.
> There were discussions about this multiple times already, but no one is
> working on this AFAIK.
> Last time the discussion was a month ago:
> https://www.mail-archive.com/dev@dpdk.org/msg196033.html
> 
> Igor

The better proposal would be to make DPDK virtio smarter.
There already is virtio devices that must handle this (VDPA) etc.
And when you can control link through virtio, then put a big warning
in KNI that says "Don't use this"
  
Dan Gora March 1, 2021, 8:27 p.m. UTC | #6
Hi All,

Sorry to butt in on this, but I fixed this same issue about 3 years
ago in my application, but I was never able to get the changes
integrated and eventually just gave up trying.

The rule with KNI is:
1) The app should have a separate control thread per rte_kni which
just spins calling rte_kni_handle_request().  This ensures that other
threads calling rte_kni_XXX functions will always get a response.

2) In order to deal with lockups and timeouts when closing the device, I sent
patches which separated the closing process into two steps:
rte_kni_release() which would unregister the underlying netdev, then
rte_kni_free() which would free the KNI portions of the KNI device.
When rte_kni_release() is called the kernel netdev is unregistered and
a response is sent back to the application, the control thread calling
rte_kni_handle_request() is still running, so the application will
still get a response back from the kernel and not lock up, the
application then kills the control thread so that
rte_kni_handle_request() is not called again, then the application
calls rte_kni_free() which frees all of the FIFOs and closes the
device.

If anyone is interested the patches are probably still floating around
patchwork.  If not you can check them out here:

https://github.com/danielgora/dpdk.git

thanks-
dan

On Mon, Mar 1, 2021 at 5:10 AM Igor Ryzhov <iryzhov@nfware.com> wrote:
>
> Stephen,
>
> No, I don't have a better proposal, but I think it is not correct to change
> the behavior of KNI (making link down without a real response).
> Even though we know that communicating with userspace under rtnl_lock is a
> bad idea, it works as it is for many years already.
>
> Elad,
>
> I agree with you that KNI should be removed from the main tree if it is not
> possible to fix this __dev_close_many issue.
> There were discussions about this multiple times already, but no one is
> working on this AFAIK.
> Last time the discussion was a month ago:
> https://www.mail-archive.com/dev@dpdk.org/msg196033.html
>
> Igor
>
> On Fri, Feb 26, 2021 at 8:43 PM Elad Nachman <eladv6@gmail.com> wrote:
>
> > The way the kernel handles its locks and lists for the dev close many
> > path, there is no way you can go around this with rtnl unlocked :
> > "
> >
> > There is a race condition in __dev_close_many() processing the
> > close_list while the application terminates.
> > It looks like if two vEth devices are terminating,
> > and one releases the rtnl lock, the other takes it,
> > updating the close_list in an unstable state,
> > causing the close_list to become a circular linked list,
> > hence list_for_each_entry() will endlessly loop inside
> > __dev_close_many() .
> >
> > "
> > And I don't expect David Miller will bend the kernel networking for DPDK
> > or KNI.
> >
> > But - Stephen - if you can personally convince David to accept a
> > kernel patch which will separate the close_list locking mechanism to a
> > separate (RCU?) lock, then I can introduce first a patch to the kernel
> > which will add a lock for the close_list, this way rtnl can be
> > unlocked for the if down case.
> >
> > After that kernel patch, your original patch + relocation of the sync
> > mutex locking will do the job .
> >lockups
> > Otherwise, rtnl has to be kept locked all of the way for the if down
> > case in order to prevent corruption causing a circular linked list out
> > of the close_list, causing a hang in the kernel.
> >lockups
> > Currently, the rtnl lock is the only thing keeping the close_list from
> > corruption.
> >
> > If you doubt rtnl cannot be unlocked for dev close path, you can
> > consult David for his opinion, as I think it is critical to understand
> > what the kernel can or cannot do, or expects to be done before we can
> > unlock its locks as we wish inside rte_kni.ko .
> >
> > Otherwise, if we are still in disagreement on how to patch this set of
> > problems, I think the responsible way around it is to completely
> > remove kni from the main dpdk tree and move it to dpdk-kmods
> > repository.
> >
> > I know BSD style open-source does not carry legal responsibility from
> > the developers, but I think when a bunch of developers know a piece of
> > code is highly buggy, they should not leave it for countless new users
> > to bounce their head desperately against, if they cannot agree on a
> > correct way to solve the bunch of problems, of which I think we all
> > agree exist (we just do not agree on the proper solution or patch)...
> >
> > That's my two cents,
> >
> > Elad.
> >
> > On Fri, Feb 26, 2021 at 5:49 PM Stephen Hemminger
> > <stephen@networkplumber.org> wrote:
> > >
> > > On Fri, 26 Feb 2021 00:01:01 +0300
> > > Igor Ryzhov <iryzhov@nfware.com> wrote:
> > >
> > > > Hi Elad,
> > > >
> > > > Thanks for the patch, but this is still NACK from me.
> > > >
> > > > The only real advantage of KNI over other exceptional-path techniques
> > > > like virtio-user is the ability to configure DPDK-managed interfaces
> > > > directly
> > > > from the kernel using well-known utils like iproute2. A very important
> > part
> > > > of this is getting responses from the DPDK app and knowing the actual
> > > > result of command execution.
> > > > If you're making async requests to the application and you don't know
> > > > the result, then what's the point of using KNI at all?
> > > >
> > > > Igor
> > >
> > > Do you have a better proposal that keeps the request result but does not
> > > call userspace with lock held.
> > >
> > > PS: I also have strong dislike of KNI, as designed it would have been
> > rejected
> > > by Linux kernel developers.  A better solution would be userspace
> > version of
> > > something like devlink devices. But doing control operations by proxy is
> > > a locking nightmare.
> >
  
Dan Gora March 1, 2021, 9:26 p.m. UTC | #7
This is from my git commit fixing this:

kni: separate releasing netdev from freeing KNI interface

    Currently the rte_kni kernel driver suffers from a problem where
    when the interface is released, it generates a callback to the DPDK
    application to change the interface state to Down.  However, after the
    DPDK application handles the callback and generates a response back to
    the kernel, the rte_kni driver cannot wake the thread which is asleep
    waiting for the response, because it is holding the kni_link_lock
    semaphore and it has already removed the 'struct kni_dev' from the
    list of interfaces to poll for responses.

    This means that if the KNI interface is in the Up state when
    rte_kni_release() is called, it will always sleep for three seconds
    until kni_net_release gives up waiting for a response from the DPDK
    application.

    To fix this, we must separate the step to release the kernel network
    interface from the steps to remove the KNI interface from the list
    of interfaces to poll.

    When the kernel network interface is removed with unregister_netdev(),
    if the interface is up, it will generate a callback to mark the
    interface down, which calls kni_net_release().  kni_net_release() will
    block waiting for the DPDK application to call rte_kni_handle_request()
    to handle the callback, but it also needs the thread in the KNI driver
    (either the per-dev thread for multi-thread or the per-driver thread)
    to call kni_net_poll_resp() in order to wake the thread sleeping in
    kni_net_release (actually kni_net_process_request()).

    So now, KNI interfaces should be removed as such:

    1) The user calls rte_kni_release().  This only unregisters the
    netdev in the kernel, but touches nothing else.  This allows all the
    threads to run which are necessary to handle the callback into the
    DPDK application to mark the interface down.

    2) The user stops the thread running rte_kni_handle_request().
    After rte_kni_release() has been called, there will be no more
    callbacks for that interface so it is not necessary.  It cannot be
    running at the same time that rte_kni_free() frees all of the FIFOs
    and DPDK memory for that KNI interface.

    3) The user calls rte_kni_free().  This performs the RTE_KNI_IOCTL_FREE
    ioctl which calls kni_ioctl_free().  This function removes the struct
    kni_dev from the list of interfaces to poll (and kills the per-dev
    kthread, if configured for multi-thread), then frees the memory in
    the FIFOs.

    Signed-off-by: Dan Gora <dg@adax.com>

I'm not sure that this is exactly the problem that you're seeing, but
it sounds like it to me.

thanks
dan

On Mon, Mar 1, 2021 at 5:27 PM Dan Gora <dg@adax.com> wrote:
>
> Hi All,
>
> Sorry to butt in on this, but I fixed this same issue about 3 years
> ago in my application, but I was never able to get the changes
> integrated and eventually just gave up trying.
>
> The rule with KNI is:
> 1) The app should have a separate control thread per rte_kni which
> just spins calling rte_kni_handle_request().  This ensures that other
> threads calling rte_kni_XXX functions will always get a response.
>
> 2) In order to deal with lockups and timeouts when closing the device, I sent
> patches which separated the closing process into two steps:
> rte_kni_release() which would unregister the underlying netdev, then
> rte_kni_free() which would free the KNI portions of the KNI device.
> When rte_kni_release() is called the kernel netdev is unregistered and
> a response is sent back to the application, the control thread calling
> rte_kni_handle_request() is still running, so the application will
> still get a response back from the kernel and not lock up, the
> application then kills the control thread so that
> rte_kni_handle_request() is not called again, then the application
> calls rte_kni_free() which frees all of the FIFOs and closes the
> device.
>
> If anyone is interested the patches are probably still floating around
> patchwork.  If not you can check them out here:
>
> https://github.com/danielgora/dpdk.git
>
> thanks-
> dan
>
> On Mon, Mar 1, 2021 at 5:10 AM Igor Ryzhov <iryzhov@nfware.com> wrote:
> >
> > Stephen,
> >
> > No, I don't have a better proposal, but I think it is not correct to change
> > the behavior of KNI (making link down without a real response).
> > Even though we know that communicating with userspace under rtnl_lock is a
> > bad idea, it works as it is for many years already.
> >
> > Elad,
> >
> > I agree with you that KNI should be removed from the main tree if it is not
> > possible to fix this __dev_close_many issue.
> > There were discussions about this multiple times already, but no one is
> > working on this AFAIK.
> > Last time the discussion was a month ago:
> > https://www.mail-archive.com/dev@dpdk.org/msg196033.html
> >
> > Igor
> >
> > On Fri, Feb 26, 2021 at 8:43 PM Elad Nachman <eladv6@gmail.com> wrote:
> >
> > > The way the kernel handles its locks and lists for the dev close many
> > > path, there is no way you can go around this with rtnl unlocked :
> > > "
> > >
> > > There is a race condition in __dev_close_many() processing the
> > > close_list while the application terminates.
> > > It looks like if two vEth devices are terminating,
> > > and one releases the rtnl lock, the other takes it,
> > > updating the close_list in an unstable state,
> > > causing the close_list to become a circular linked list,
> > > hence list_for_each_entry() will endlessly loop inside
> > > __dev_close_many() .
> > >
> > > "
> > > And I don't expect David Miller will bend the kernel networking for DPDK
> > > or KNI.
> > >
> > > But - Stephen - if you can personally convince David to accept a
> > > kernel patch which will separate the close_list locking mechanism to a
> > > separate (RCU?) lock, then I can introduce first a patch to the kernel
> > > which will add a lock for the close_list, this way rtnl can be
> > > unlocked for the if down case.
> > >
> > > After that kernel patch, your original patch + relocation of the sync
> > > mutex locking will do the job .
> > >lockups
> > > Otherwise, rtnl has to be kept locked all of the way for the if down
> > > case in order to prevent corruption causing a circular linked list out
> > > of the close_list, causing a hang in the kernel.
> > >lockups
> > > Currently, the rtnl lock is the only thing keeping the close_list from
> > > corruption.
> > >
> > > If you doubt rtnl cannot be unlocked for dev close path, you can
> > > consult David for his opinion, as I think it is critical to understand
> > > what the kernel can or cannot do, or expects to be done before we can
> > > unlock its locks as we wish inside rte_kni.ko .
> > >
> > > Otherwise, if we are still in disagreement on how to patch this set of
> > > problems, I think the responsible way around it is to completely
> > > remove kni from the main dpdk tree and move it to dpdk-kmods
> > > repository.
> > >
> > > I know BSD style open-source does not carry legal responsibility from
> > > the developers, but I think when a bunch of developers know a piece of
> > > code is highly buggy, they should not leave it for countless new users
> > > to bounce their head desperately against, if they cannot agree on a
> > > correct way to solve the bunch of problems, of which I think we all
> > > agree exist (we just do not agree on the proper solution or patch)...
> > >
> > > That's my two cents,
> > >
> > > Elad.
> > >
> > > On Fri, Feb 26, 2021 at 5:49 PM Stephen Hemminger
> > > <stephen@networkplumber.org> wrote:
> > > >
> > > > On Fri, 26 Feb 2021 00:01:01 +0300
> > > > Igor Ryzhov <iryzhov@nfware.com> wrote:
> > > >
> > > > > Hi Elad,
> > > > >
> > > > > Thanks for the patch, but this is still NACK from me.
> > > > >
> > > > > The only real advantage of KNI over other exceptional-path techniques
> > > > > like virtio-user is the ability to configure DPDK-managed interfaces
> > > > > directly
> > > > > from the kernel using well-known utils like iproute2. A very important
> > > part
> > > > > of this is getting responses from the DPDK app and knowing the actual
> > > > > result of command execution.
> > > > > If you're making async requests to the application and you don't know
> > > > > the result, then what's the point of using KNI at all?
> > > > >
> > > > > Igor
> > > >
> > > > Do you have a better proposal that keeps the request result but does not
> > > > call userspace with lock held.
> > > >
> > > > PS: I also have strong dislike of KNI, as designed it would have been
> > > rejected
> > > > by Linux kernel developers.  A better solution would be userspace
> > > version of
> > > > something like devlink devices. But doing control operations by proxy is
> > > > a locking nightmare.
> > >
  
Elad Nachman March 2, 2021, 4:44 p.m. UTC | #8
Hi Dan,

Thanks for the information but you are addressing a different problem. The
problem discussed here is making ifconfig up or down while the DPDK
application is running.

Elad.

בתאריך יום ב׳, 1 במרץ 2021, 23:26, מאת Dan Gora ‏<dg@adax.com>:

> This is from my git commit fixing this:
>
> kni: separate releasing netdev from freeing KNI interface
>
>     Currently the rte_kni kernel driver suffers from a problem where
>     when the interface is released, it generates a callback to the DPDK
>     application to change the interface state to Down.  However, after the
>     DPDK application handles the callback and generates a response back to
>     the kernel, the rte_kni driver cannot wake the thread which is asleep
>     waiting for the response, because it is holding the kni_link_lock
>     semaphore and it has already removed the 'struct kni_dev' from the
>     list of interfaces to poll for responses.
>
>     This means that if the KNI interface is in the Up state when
>     rte_kni_release() is called, it will always sleep for three seconds
>     until kni_net_release gives up waiting for a response from the DPDK
>     application.
>
>     To fix this, we must separate the step to release the kernel network
>     interface from the steps to remove the KNI interface from the list
>     of interfaces to poll.
>
>     When the kernel network interface is removed with unregister_netdev(),
>     if the interface is up, it will generate a callback to mark the
>     interface down, which calls kni_net_release().  kni_net_release() will
>     block waiting for the DPDK application to call rte_kni_handle_request()
>     to handle the callback, but it also needs the thread in the KNI driver
>     (either the per-dev thread for multi-thread or the per-driver thread)
>     to call kni_net_poll_resp() in order to wake the thread sleeping in
>     kni_net_release (actually kni_net_process_request()).
>
>     So now, KNI interfaces should be removed as such:
>
>     1) The user calls rte_kni_release().  This only unregisters the
>     netdev in the kernel, but touches nothing else.  This allows all the
>     threads to run which are necessary to handle the callback into the
>     DPDK application to mark the interface down.
>
>     2) The user stops the thread running rte_kni_handle_request().
>     After rte_kni_release() has been called, there will be no more
>     callbacks for that interface so it is not necessary.  It cannot be
>     running at the same time that rte_kni_free() frees all of the FIFOs
>     and DPDK memory for that KNI interface.
>
>     3) The user calls rte_kni_free().  This performs the RTE_KNI_IOCTL_FREE
>     ioctl which calls kni_ioctl_free().  This function removes the struct
>     kni_dev from the list of interfaces to poll (and kills the per-dev
>     kthread, if configured for multi-thread), then frees the memory in
>     the FIFOs.
>
>     Signed-off-by: Dan Gora <dg@adax.com>
>
> I'm not sure that this is exactly the problem that you're seeing, but
> it sounds like it to me.
>
> thanks
> dan
>
> On Mon, Mar 1, 2021 at 5:27 PM Dan Gora <dg@adax.com> wrote:
> >
> > Hi All,
> >
> > Sorry to butt in on this, but I fixed this same issue about 3 years
> > ago in my application, but I was never able to get the changes
> > integrated and eventually just gave up trying.
> >
> > The rule with KNI is:
> > 1) The app should have a separate control thread per rte_kni which
> > just spins calling rte_kni_handle_request().  This ensures that other
> > threads calling rte_kni_XXX functions will always get a response.
> >
> > 2) In order to deal with lockups and timeouts when closing the device, I
> sent
> > patches which separated the closing process into two steps:
> > rte_kni_release() which would unregister the underlying netdev, then
> > rte_kni_free() which would free the KNI portions of the KNI device.
> > When rte_kni_release() is called the kernel netdev is unregistered and
> > a response is sent back to the application, the control thread calling
> > rte_kni_handle_request() is still running, so the application will
> > still get a response back from the kernel and not lock up, the
> > application then kills the control thread so that
> > rte_kni_handle_request() is not called again, then the application
> > calls rte_kni_free() which frees all of the FIFOs and closes the
> > device.
> >
> > If anyone is interested the patches are probably still floating around
> > patchwork.  If not you can check them out here:
> >
> > https://github.com/danielgora/dpdk.git
> >
> > thanks-
> > dan
> >
> > On Mon, Mar 1, 2021 at 5:10 AM Igor Ryzhov <iryzhov@nfware.com> wrote:
> > >
> > > Stephen,
> > >
> > > No, I don't have a better proposal, but I think it is not correct to
> change
> > > the behavior of KNI (making link down without a real response).
> > > Even though we know that communicating with userspace under rtnl_lock
> is a
> > > bad idea, it works as it is for many years already.
> > >
> > > Elad,
> > >
> > > I agree with you that KNI should be removed from the main tree if it
> is not
> > > possible to fix this __dev_close_many issue.
> > > There were discussions about this multiple times already, but no one is
> > > working on this AFAIK.
> > > Last time the discussion was a month ago:
> > > https://www.mail-archive.com/dev@dpdk.org/msg196033.html
> > >
> > > Igor
> > >
> > > On Fri, Feb 26, 2021 at 8:43 PM Elad Nachman <eladv6@gmail.com> wrote:
> > >
> > > > The way the kernel handles its locks and lists for the dev close many
> > > > path, there is no way you can go around this with rtnl unlocked :
> > > > "
> > > >
> > > > There is a race condition in __dev_close_many() processing the
> > > > close_list while the application terminates.
> > > > It looks like if two vEth devices are terminating,
> > > > and one releases the rtnl lock, the other takes it,
> > > > updating the close_list in an unstable state,
> > > > causing the close_list to become a circular linked list,
> > > > hence list_for_each_entry() will endlessly loop inside
> > > > __dev_close_many() .
> > > >
> > > > "
> > > > And I don't expect David Miller will bend the kernel networking for
> DPDK
> > > > or KNI.
> > > >
> > > > But - Stephen - if you can personally convince David to accept a
> > > > kernel patch which will separate the close_list locking mechanism to
> a
> > > > separate (RCU?) lock, then I can introduce first a patch to the
> kernel
> > > > which will add a lock for the close_list, this way rtnl can be
> > > > unlocked for the if down case.
> > > >
> > > > After that kernel patch, your original patch + relocation of the sync
> > > > mutex locking will do the job .
> > > >lockups
> > > > Otherwise, rtnl has to be kept locked all of the way for the if down
> > > > case in order to prevent corruption causing a circular linked list
> out
> > > > of the close_list, causing a hang in the kernel.
> > > >lockups
> > > > Currently, the rtnl lock is the only thing keeping the close_list
> from
> > > > corruption.
> > > >
> > > > If you doubt rtnl cannot be unlocked for dev close path, you can
> > > > consult David for his opinion, as I think it is critical to
> understand
> > > > what the kernel can or cannot do, or expects to be done before we can
> > > > unlock its locks as we wish inside rte_kni.ko .
> > > >
> > > > Otherwise, if we are still in disagreement on how to patch this set
> of
> > > > problems, I think the responsible way around it is to completely
> > > > remove kni from the main dpdk tree and move it to dpdk-kmods
> > > > repository.
> > > >
> > > > I know BSD style open-source does not carry legal responsibility from
> > > > the developers, but I think when a bunch of developers know a piece
> of
> > > > code is highly buggy, they should not leave it for countless new
> users
> > > > to bounce their head desperately against, if they cannot agree on a
> > > > correct way to solve the bunch of problems, of which I think we all
> > > > agree exist (we just do not agree on the proper solution or patch)...
> > > >
> > > > That's my two cents,
> > > >
> > > > Elad.
> > > >
> > > > On Fri, Feb 26, 2021 at 5:49 PM Stephen Hemminger
> > > > <stephen@networkplumber.org> wrote:
> > > > >
> > > > > On Fri, 26 Feb 2021 00:01:01 +0300
> > > > > Igor Ryzhov <iryzhov@nfware.com> wrote:
> > > > >
> > > > > > Hi Elad,
> > > > > >
> > > > > > Thanks for the patch, but this is still NACK from me.
> > > > > >
> > > > > > The only real advantage of KNI over other exceptional-path
> techniques
> > > > > > like virtio-user is the ability to configure DPDK-managed
> interfaces
> > > > > > directly
> > > > > > from the kernel using well-known utils like iproute2. A very
> important
> > > > part
> > > > > > of this is getting responses from the DPDK app and knowing the
> actual
> > > > > > result of command execution.
> > > > > > If you're making async requests to the application and you don't
> know
> > > > > > the result, then what's the point of using KNI at all?
> > > > > >
> > > > > > Igor
> > > > >
> > > > > Do you have a better proposal that keeps the request result but
> does not
> > > > > call userspace with lock held.
> > > > >
> > > > > PS: I also have strong dislike of KNI, as designed it would have
> been
> > > > rejected
> > > > > by Linux kernel developers.  A better solution would be userspace
> > > > version of
> > > > > something like devlink devices. But doing control operations by
> proxy is
> > > > > a locking nightmare.
> > > >
>
>
>
> --
> Dan Gora
> Software Engineer
>
> Adax, Inc.
> Rua Dona Maria Alves, 1070 Casa 5
> Centro
> Ubatuba, SP
> CEP 11680-000
> Brasil
>
> Tel: +55 (12) 3833-1021  (Brazil and outside of US)
>     : +1 (510) 859-4801  (Inside of US)
>     : dan_gora (Skype)
>
> email: dg@adax.com
>
  
Ferruh Yigit March 15, 2021, 4:58 p.m. UTC | #9
On 3/1/2021 4:38 PM, Stephen Hemminger wrote:
> On Mon, 1 Mar 2021 11:10:01 +0300
> Igor Ryzhov <iryzhov@nfware.com> wrote:
> 
>> Stephen,
>>
>> No, I don't have a better proposal, but I think it is not correct to change
>> the behavior of KNI (making link down without a real response).
>> Even though we know that communicating with userspace under rtnl_lock is a
>> bad idea, it works as it is for many years already.
>>
>> Elad,
>>
>> I agree with you that KNI should be removed from the main tree if it is not
>> possible to fix this __dev_close_many issue.
>> There were discussions about this multiple times already, but no one is
>> working on this AFAIK.
>> Last time the discussion was a month ago:
>> https://www.mail-archive.com/dev@dpdk.org/msg196033.html
>>
>> Igor
> 
> The better proposal would be to make DPDK virtio smarter.
> There already is virtio devices that must handle this (VDPA) etc.
> And when you can control link through virtio, then put a big warning
> in KNI that says "Don't use this"
> 

Hi Igor, Elad,

I think it is reasonable to do the ifdown as async to solve the problem,
still we can make sync default, and async with kernel parameter, to cover both case.

I will put more details on the patches.
  
Ferruh Yigit March 15, 2021, 5:17 p.m. UTC | #10
On 2/25/2021 2:32 PM, Elad Nachman wrote:
> This part of the series includes my fixes for the issues reported
> by Ferruh and Igor (and Igor comments for v3 of the patch)
> on top of part 1 of the patch series:
> 
> A. KNI sync lock is being locked while rtnl is held.
> If two threads are calling kni_net_process_request() ,
> then the first one will take the sync lock, release rtnl lock then sleep.
> The second thread will try to lock sync lock while holding rtnl.
> The first thread will wake, and try to lock rtnl, resulting in a deadlock.
> The remedy is to release rtnl before locking the KNI sync lock.
> Since in between nothing is accessing Linux network-wise,
> no rtnl locking is needed.
> 
> B. There is a race condition in __dev_close_many() processing the
> close_list while the application terminates.
> It looks like if two vEth devices are terminating,
> and one releases the rtnl lock, the other takes it,
> updating the close_list in an unstable state,
> causing the close_list to become a circular linked list,
> hence list_for_each_entry() will endlessly loop inside
> __dev_close_many() .
> Since the description for the original patch indicate the
> original motivation was bringing the device up,
> I have changed kni_net_process_request() to hold the rtnl mutex
> in case of bringing the device down since this is the path called
> from __dev_close_many() , causing the corruption of the close_list.
> In order to prevent deadlock in Mellanox device in this case, the
> code has been modified not to wait for user-space while holding
> the rtnl lock.
> Instead, after the request has been sent, all locks are relinquished
> and the function exits immediately with return code of zero (success).
> 
> To summarize:
> request != interface down : unlock rtnl, send request to user-space,
> wait for response, send the response error code to caller in user-space.
> 
> request == interface down: send request to user-space, return immediately
> with error code of 0 (success) to user-space.
> 
> Signed-off-by: Elad Nachman <eladv6@gmail.com>
> 
> 
> ---
> v4:
> * for if down case, send asynchronously with rtnl locked and without
>    wait, returning immediately to avoid both kernel race conditions
>    and deadlock in user-space
> v3:
> * Include original patch and new patch as a series of patch, added a
>    comment to the new patch
> v2:
> * rebuild the patch as increment from patch 64106
> * fix comment and blank lines
> ---
>   kernel/linux/kni/kni_net.c      | 41 +++++++++++++++++++++++++++------
>   lib/librte_kni/rte_kni.c        |  7 ++++--
>   lib/librte_kni/rte_kni_common.h |  1 +
>   3 files changed, 40 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c
> index f0b6e9a8d..ba991802b 100644
> --- a/kernel/linux/kni/kni_net.c
> +++ b/kernel/linux/kni/kni_net.c
> @@ -110,12 +110,34 @@ kni_net_process_request(struct net_device *dev, struct rte_kni_request *req)
>   	void *resp_va;
>   	uint32_t num;
>   	int ret_val;
> +	int req_is_dev_stop = 0;
> +
> +	/* For configuring the interface to down,
> +	 * rtnl must be held all the way to prevent race condition
> +	 * inside __dev_close_many() between two netdev instances of KNI
> +	 */
> +	if (req->req_id == RTE_KNI_REQ_CFG_NETWORK_IF &&
> +			req->if_up == 0)
> +		req_is_dev_stop = 1;

Having this request type checks in the 'kni_net_process_request()' function 
looks like hack.
Since adding a new field into the "struct rte_kni_request", that can be a more 
generic 'asnyc' field, and the requested function, like 'kni_net_release()' can 
set it to support async requests.

And can you please separate the function to add a more generic async request 
support on its patch, which should do:
- add new 'asnyc' field to "struct rte_kni_request"
- in 'kni_net_process_request()', if 'req->async' set, do not wait for response
- in library, 'lib/librte_kni/rte_kni.c', in 'rte_kni_handle_request()' 
function, if the request is async don't put the response
(These are already done in this patch)

Overall it can be three patch set:
1) Function parameter change
2) Add generic async request support (with documentation update)
3) rtnl unlock and make 'kni_net_release()' request async (actual fix)
(We can discuss more if to make 'kni_net_release()' async with a kernel 
parameter or not)

What do you think, does it make sense?
  
Elad Nachman March 16, 2021, 6:35 p.m. UTC | #11
Hi,

Owing to my current development schedule and obligations, I see no
opportunity to make this set of changes in the near future.

Sorry,

Elad.

בתאריך יום ב׳, 15 במרץ 2021, 19:17, מאת Ferruh Yigit ‏<
ferruh.yigit@intel.com>:

> On 2/25/2021 2:32 PM, Elad Nachman wrote:
> > This part of the series includes my fixes for the issues reported
> > by Ferruh and Igor (and Igor comments for v3 of the patch)
> > on top of part 1 of the patch series:
> >
> > A. KNI sync lock is being locked while rtnl is held.
> > If two threads are calling kni_net_process_request() ,
> > then the first one will take the sync lock, release rtnl lock then sleep.
> > The second thread will try to lock sync lock while holding rtnl.
> > The first thread will wake, and try to lock rtnl, resulting in a
> deadlock.
> > The remedy is to release rtnl before locking the KNI sync lock.
> > Since in between nothing is accessing Linux network-wise,
> > no rtnl locking is needed.
> >
> > B. There is a race condition in __dev_close_many() processing the
> > close_list while the application terminates.
> > It looks like if two vEth devices are terminating,
> > and one releases the rtnl lock, the other takes it,
> > updating the close_list in an unstable state,
> > causing the close_list to become a circular linked list,
> > hence list_for_each_entry() will endlessly loop inside
> > __dev_close_many() .
> > Since the description for the original patch indicate the
> > original motivation was bringing the device up,
> > I have changed kni_net_process_request() to hold the rtnl mutex
> > in case of bringing the device down since this is the path called
> > from __dev_close_many() , causing the corruption of the close_list.
> > In order to prevent deadlock in Mellanox device in this case, the
> > code has been modified not to wait for user-space while holding
> > the rtnl lock.
> > Instead, after the request has been sent, all locks are relinquished
> > and the function exits immediately with return code of zero (success).
> >
> > To summarize:
> > request != interface down : unlock rtnl, send request to user-space,
> > wait for response, send the response error code to caller in user-space.
> >
> > request == interface down: send request to user-space, return immediately
> > with error code of 0 (success) to user-space.
> >
> > Signed-off-by: Elad Nachman <eladv6@gmail.com>
> >
> >
> > ---
> > v4:
> > * for if down case, send asynchronously with rtnl locked and without
> >    wait, returning immediately to avoid both kernel race conditions
> >    and deadlock in user-space
> > v3:
> > * Include original patch and new patch as a series of patch, added a
> >    comment to the new patch
> > v2:
> > * rebuild the patch as increment from patch 64106
> > * fix comment and blank lines
> > ---
> >   kernel/linux/kni/kni_net.c      | 41 +++++++++++++++++++++++++++------
> >   lib/librte_kni/rte_kni.c        |  7 ++++--
> >   lib/librte_kni/rte_kni_common.h |  1 +
> >   3 files changed, 40 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c
> > index f0b6e9a8d..ba991802b 100644
> > --- a/kernel/linux/kni/kni_net.c
> > +++ b/kernel/linux/kni/kni_net.c
> > @@ -110,12 +110,34 @@ kni_net_process_request(struct net_device *dev,
> struct rte_kni_request *req)
> >       void *resp_va;
> >       uint32_t num;
> >       int ret_val;
> > +     int req_is_dev_stop = 0;
> > +
> > +     /* For configuring the interface to down,
> > +      * rtnl must be held all the way to prevent race condition
> > +      * inside __dev_close_many() between two netdev instances of KNI
> > +      */
> > +     if (req->req_id == RTE_KNI_REQ_CFG_NETWORK_IF &&
> > +                     req->if_up == 0)
> > +             req_is_dev_stop = 1;
>
> Having this request type checks in the 'kni_net_process_request()'
> function
> looks like hack.
> Since adding a new field into the "struct rte_kni_request", that can be a
> more
> generic 'asnyc' field, and the requested function, like
> 'kni_net_release()' can
> set it to support async requests.
>
> And can you please separate the function to add a more generic async
> request
> support on its patch, which should do:
> - add new 'asnyc' field to "struct rte_kni_request"
> - in 'kni_net_process_request()', if 'req->async' set, do not wait for
> response
> - in library, 'lib/librte_kni/rte_kni.c', in 'rte_kni_handle_request()'
> function, if the request is async don't put the response
> (These are already done in this patch)
>
> Overall it can be three patch set:
> 1) Function parameter change
> 2) Add generic async request support (with documentation update)
> 3) rtnl unlock and make 'kni_net_release()' request async (actual fix)
> (We can discuss more if to make 'kni_net_release()' async with a kernel
> parameter or not)
>
> What do you think, does it make sense?
>
>
  
Ferruh Yigit March 16, 2021, 6:42 p.m. UTC | #12
On 3/16/2021 6:35 PM, Elad Nachman wrote:
> Hi,
> 
> Owing to my current development schedule and obligations, I see no opportunity 
> to make this set of changes in the near future.
> 

I can do on top of your work if you don't mind?

> Sorry,
> 
> Elad.
> 
> בתאריך יום ב׳, 15 במרץ 2021, 19:17, מאת Ferruh Yigit ‏<ferruh.yigit@intel.com 
> <mailto:ferruh.yigit@intel.com>>:
> 
>     On 2/25/2021 2:32 PM, Elad Nachman wrote:
>      > This part of the series includes my fixes for the issues reported
>      > by Ferruh and Igor (and Igor comments for v3 of the patch)
>      > on top of part 1 of the patch series:
>      >
>      > A. KNI sync lock is being locked while rtnl is held.
>      > If two threads are calling kni_net_process_request() ,
>      > then the first one will take the sync lock, release rtnl lock then sleep.
>      > The second thread will try to lock sync lock while holding rtnl.
>      > The first thread will wake, and try to lock rtnl, resulting in a deadlock.
>      > The remedy is to release rtnl before locking the KNI sync lock.
>      > Since in between nothing is accessing Linux network-wise,
>      > no rtnl locking is needed.
>      >
>      > B. There is a race condition in __dev_close_many() processing the
>      > close_list while the application terminates.
>      > It looks like if two vEth devices are terminating,
>      > and one releases the rtnl lock, the other takes it,
>      > updating the close_list in an unstable state,
>      > causing the close_list to become a circular linked list,
>      > hence list_for_each_entry() will endlessly loop inside
>      > __dev_close_many() .
>      > Since the description for the original patch indicate the
>      > original motivation was bringing the device up,
>      > I have changed kni_net_process_request() to hold the rtnl mutex
>      > in case of bringing the device down since this is the path called
>      > from __dev_close_many() , causing the corruption of the close_list.
>      > In order to prevent deadlock in Mellanox device in this case, the
>      > code has been modified not to wait for user-space while holding
>      > the rtnl lock.
>      > Instead, after the request has been sent, all locks are relinquished
>      > and the function exits immediately with return code of zero (success).
>      >
>      > To summarize:
>      > request != interface down : unlock rtnl, send request to user-space,
>      > wait for response, send the response error code to caller in user-space.
>      >
>      > request == interface down: send request to user-space, return immediately
>      > with error code of 0 (success) to user-space.
>      >
>      > Signed-off-by: Elad Nachman <eladv6@gmail.com <mailto:eladv6@gmail.com>>
>      >
>      >
>      > ---
>      > v4:
>      > * for if down case, send asynchronously with rtnl locked and without
>      >    wait, returning immediately to avoid both kernel race conditions
>      >    and deadlock in user-space
>      > v3:
>      > * Include original patch and new patch as a series of patch, added a
>      >    comment to the new patch
>      > v2:
>      > * rebuild the patch as increment from patch 64106
>      > * fix comment and blank lines
>      > ---
>      >   kernel/linux/kni/kni_net.c      | 41 +++++++++++++++++++++++++++------
>      >   lib/librte_kni/rte_kni.c        |  7 ++++--
>      >   lib/librte_kni/rte_kni_common.h |  1 +
>      >   3 files changed, 40 insertions(+), 9 deletions(-)
>      >
>      > diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c
>      > index f0b6e9a8d..ba991802b 100644
>      > --- a/kernel/linux/kni/kni_net.c
>      > +++ b/kernel/linux/kni/kni_net.c
>      > @@ -110,12 +110,34 @@ kni_net_process_request(struct net_device *dev,
>     struct rte_kni_request *req)
>      >       void *resp_va;
>      >       uint32_t num;
>      >       int ret_val;
>      > +     int req_is_dev_stop = 0;
>      > +
>      > +     /* For configuring the interface to down,
>      > +      * rtnl must be held all the way to prevent race condition
>      > +      * inside __dev_close_many() between two netdev instances of KNI
>      > +      */
>      > +     if (req->req_id == RTE_KNI_REQ_CFG_NETWORK_IF &&
>      > +                     req->if_up == 0)
>      > +             req_is_dev_stop = 1;
> 
>     Having this request type checks in the 'kni_net_process_request()' function
>     looks like hack.
>     Since adding a new field into the "struct rte_kni_request", that can be a more
>     generic 'asnyc' field, and the requested function, like 'kni_net_release()' can
>     set it to support async requests.
> 
>     And can you please separate the function to add a more generic async request
>     support on its patch, which should do:
>     - add new 'asnyc' field to "struct rte_kni_request"
>     - in 'kni_net_process_request()', if 'req->async' set, do not wait for response
>     - in library, 'lib/librte_kni/rte_kni.c', in 'rte_kni_handle_request()'
>     function, if the request is async don't put the response
>     (These are already done in this patch)
> 
>     Overall it can be three patch set:
>     1) Function parameter change
>     2) Add generic async request support (with documentation update)
>     3) rtnl unlock and make 'kni_net_release()' request async (actual fix)
>     (We can discuss more if to make 'kni_net_release()' async with a kernel
>     parameter or not)
> 
>     What do you think, does it make sense?
>
  

Patch

diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c
index f0b6e9a8d..ba991802b 100644
--- a/kernel/linux/kni/kni_net.c
+++ b/kernel/linux/kni/kni_net.c
@@ -110,12 +110,34 @@  kni_net_process_request(struct net_device *dev, struct rte_kni_request *req)
 	void *resp_va;
 	uint32_t num;
 	int ret_val;
+	int req_is_dev_stop = 0;
+
+	/* For configuring the interface to down,
+	 * rtnl must be held all the way to prevent race condition
+	 * inside __dev_close_many() between two netdev instances of KNI
+	 */
+	if (req->req_id == RTE_KNI_REQ_CFG_NETWORK_IF &&
+			req->if_up == 0)
+		req_is_dev_stop = 1;
 
 	ASSERT_RTNL();
 
+	/* Since we need to wait and RTNL mutex is held
+	 * drop the mutex and hold reference to keep device
+	 */
+	if (!req_is_dev_stop) {
+		dev_hold(dev);
+		rtnl_unlock();
+	}
+
 	mutex_lock(&kni->sync_lock);
 
-	/* Construct data */
+	/* Construct data, for dev stop send asynchronously
+	 * so instruct user-space not to sent response as no
+	 * one will be waiting for it.
+	 */
+	if (req_is_dev_stop)
+		req->skip_post_resp_to_q = 1;
 	memcpy(kni->sync_kva, req, sizeof(struct rte_kni_request));
 	num = kni_fifo_put(kni->req_q, &kni->sync_va, 1);
 	if (num < 1) {
@@ -124,16 +146,16 @@  kni_net_process_request(struct net_device *dev, struct rte_kni_request *req)
 		goto fail;
 	}
 
-	/* Since we need to wait and RTNL mutex is held
-	 * drop the mutex and hold refernce to keep device
+	/* No result available since request is handled
+	 * asynchronously. set response to success.
 	 */
-	dev_hold(dev);
-	rtnl_unlock();
+	if (req_is_dev_stop) {
+		req->result = 0;
+		goto async;
+	}
 
 	ret_val = wait_event_interruptible_timeout(kni->wq,
 			kni_fifo_count(kni->resp_q), 3 * HZ);
-	rtnl_lock();
-	dev_put(dev);
 
 	if (signal_pending(current) || ret_val <= 0) {
 		ret = -ETIME;
@@ -148,10 +170,15 @@  kni_net_process_request(struct net_device *dev, struct rte_kni_request *req)
 	}
 
 	memcpy(req, kni->sync_kva, sizeof(struct rte_kni_request));
+async:
 	ret = 0;
 
 fail:
 	mutex_unlock(&kni->sync_lock);
+	if (!req_is_dev_stop) {
+		rtnl_lock();
+		dev_put(dev);
+	}
 	return ret;
 }
 
diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
index 837d0217d..6d777266d 100644
--- a/lib/librte_kni/rte_kni.c
+++ b/lib/librte_kni/rte_kni.c
@@ -591,8 +591,11 @@  rte_kni_handle_request(struct rte_kni *kni)
 		break;
 	}
 
-	/* Construct response mbuf and put it back to resp_q */
-	ret = kni_fifo_put(kni->resp_q, (void **)&req, 1);
+	/* if needed, construct response mbuf and put it back to resp_q */
+	if (!req->skip_post_resp_to_q)
+		ret = kni_fifo_put(kni->resp_q, (void **)&req, 1);
+	else
+		ret = 1;
 	if (ret != 1) {
 		RTE_LOG(ERR, KNI, "Fail to put the muf back to resp_q\n");
 		return -1; /* It is an error of can't putting the mbuf back */
diff --git a/lib/librte_kni/rte_kni_common.h b/lib/librte_kni/rte_kni_common.h
index ffb318273..3b5d06850 100644
--- a/lib/librte_kni/rte_kni_common.h
+++ b/lib/librte_kni/rte_kni_common.h
@@ -48,6 +48,7 @@  struct rte_kni_request {
 		uint8_t promiscusity;/**< 1: promisc mode enable, 0: disable */
 		uint8_t allmulti;    /**< 1: all-multicast mode enable, 0: disable */
 	};
+	int32_t skip_post_resp_to_q; /**< 1: skip queue response 0: disable */
 	int32_t result;               /**< Result for processing request */
 } __attribute__((__packed__));