net/netvsc: subchannel configuration failed due to unexpected NVS response

Message ID CAEHUbGT6exgRiV4_5kjOZgfSwuyP53Lci6qAzvQowM8K=vn3_A@mail.gmail.com (mailing list archive)
State Rejected, archived
Delegated to: Ferruh Yigit
Headers
Series net/netvsc: subchannel configuration failed due to unexpected NVS response |

Checks

Context Check Description
ci/Intel-compilation fail apply issues

Commit Message

Min Tang Feb. 27, 2020, 4:16 p.m. UTC
  Hi Stephen:

I saw the following error messages when using DPDK 18.11.2 in Azure:

hn_nvs_execute(): unexpected NVS resp 0x6b, expect 0x85
hn_dev_configure(): subchannel configuration failed

It was not easy to reproduce it and it only occurred with multiple queues
enabled. In hn_nvs_execute it expects the response to match the request. In
the failed case, it was expecting NVS_TYPE_SUBCH_REQ (133 or 0x85) but
got NVS_TYPE_RNDIS(107 or 0x6b). Obviously somewhere the NVS_TYPE_RNDIS
message had been sent before the NVS_TYPE_SUBCH_REQ message was sent.  I
looked at the code and found that the NVS_TYPE_RNDIS message needs
completion response but it does not receive the response message anywhere.
The fix would be receiving and discarding the wrong response message(s).

I put the following patches and it has fixed the problem.
  

Comments

Stephen Hemminger Feb. 27, 2020, 5:47 p.m. UTC | #1
On Thu, 27 Feb 2020 11:16:01 -0500
Min Tang <tommytang@gmail.com> wrote:

> Hi Stephen:
> 
> I saw the following error messages when using DPDK 18.11.2 in Azure:
> 
> hn_nvs_execute(): unexpected NVS resp 0x6b, expect 0x85
> hn_dev_configure(): subchannel configuration failed
> 
> It was not easy to reproduce it and it only occurred with multiple queues
> enabled. In hn_nvs_execute it expects the response to match the request. In
> the failed case, it was expecting NVS_TYPE_SUBCH_REQ (133 or 0x85) but
> got NVS_TYPE_RNDIS(107 or 0x6b). Obviously somewhere the NVS_TYPE_RNDIS
> message had been sent before the NVS_TYPE_SUBCH_REQ message was sent.  I
> looked at the code and found that the NVS_TYPE_RNDIS message needs
> completion response but it does not receive the response message anywhere.
> The fix would be receiving and discarding the wrong response message(s).
> 
> I put the following patches and it has fixed the problem.
> 
> --- a/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:08:29.755530969 -0500
> +++ b/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:07:21.567371798 -0500
> @@ -92,7 +92,7 @@
>   if (hdr->type != type) {
>   PMD_DRV_LOG(ERR, "unexpected NVS resp %#x, expect %#x",
>      hdr->type, type);
> - goto retry;
> + return -EINVAL;
>   }
> 
>   if (len < resplen) {

Thanks for the analysis. Not sure if this the right fix.
Looks like the control channel needs additional locking.
Having two outstanding requests at once is not going to work well.
  
Min Tang Feb. 27, 2020, 6:24 p.m. UTC | #2
That quick fix was just to verify my guess. I agree that it needs more
comprehensive fix.

Yes, race condition is another issue here. In addition to that, I think in
the function that sends the NVS_TYPE_RNDIS message, it needs to drain the
response message.
I looked at the netvsc driver in Linux kernel, it receives all the VMBus
messages anachronously in another thread. That's probably something we can
think about in the DPDK driver.


On Thu, Feb 27, 2020 at 12:47 PM Stephen Hemminger <
stephen@networkplumber.org> wrote:

> On Thu, 27 Feb 2020 11:16:01 -0500
> Min Tang <tommytang@gmail.com> wrote:
>
> > Hi Stephen:
> >
> > I saw the following error messages when using DPDK 18.11.2 in Azure:
> >
> > hn_nvs_execute(): unexpected NVS resp 0x6b, expect 0x85
> > hn_dev_configure(): subchannel configuration failed
> >
> > It was not easy to reproduce it and it only occurred with multiple queues
> > enabled. In hn_nvs_execute it expects the response to match the request.
> In
> > the failed case, it was expecting NVS_TYPE_SUBCH_REQ (133 or 0x85) but
> > got NVS_TYPE_RNDIS(107 or 0x6b). Obviously somewhere the NVS_TYPE_RNDIS
> > message had been sent before the NVS_TYPE_SUBCH_REQ message was sent.  I
> > looked at the code and found that the NVS_TYPE_RNDIS message needs
> > completion response but it does not receive the response message
> anywhere.
> > The fix would be receiving and discarding the wrong response message(s).
> >
> > I put the following patches and it has fixed the problem.
> >
> > --- a/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:08:29.755530969 -0500
> > +++ b/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:07:21.567371798 -0500
> > @@ -92,7 +92,7 @@
> >   if (hdr->type != type) {
> >   PMD_DRV_LOG(ERR, "unexpected NVS resp %#x, expect %#x",
> >      hdr->type, type);
> > - goto retry;
> > + return -EINVAL;
> >   }
> >
> >   if (len < resplen) {
>
> Thanks for the analysis. Not sure if this the right fix.
> Looks like the control channel needs additional locking.
> Having two outstanding requests at once is not going to work well.
>
  
Stephen Hemminger March 1, 2020, 5:54 p.m. UTC | #3
On Thu, 27 Feb 2020 11:16:01 -0500
Min Tang <tommytang@gmail.com> wrote:

> Hi Stephen:
> 
> I saw the following error messages when using DPDK 18.11.2 in Azure:
> 
> hn_nvs_execute(): unexpected NVS resp 0x6b, expect 0x85
> hn_dev_configure(): subchannel configuration failed
> 
> It was not easy to reproduce it and it only occurred with multiple queues
> enabled. In hn_nvs_execute it expects the response to match the request. In
> the failed case, it was expecting NVS_TYPE_SUBCH_REQ (133 or 0x85) but
> got NVS_TYPE_RNDIS(107 or 0x6b). Obviously somewhere the NVS_TYPE_RNDIS
> message had been sent before the NVS_TYPE_SUBCH_REQ message was sent.  I
> looked at the code and found that the NVS_TYPE_RNDIS message needs
> completion response but it does not receive the response message anywhere.
> The fix would be receiving and discarding the wrong response message(s).
> 
> I put the following patches and it has fixed the problem.
> 
> --- a/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:08:29.755530969 -0500
> +++ b/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:07:21.567371798 -0500
> @@ -92,7 +92,7 @@
>   if (hdr->type != type) {
>   PMD_DRV_LOG(ERR, "unexpected NVS resp %#x, expect %#x",
>      hdr->type, type);
> - goto retry;
> + return -EINVAL;
>   }
> 
>   if (len < resplen) {


The situation is that NVS_TYPE_RNDIS is a receive packet that is
arriving while subchannel is being setup. For first channel this
doesn't happen because control operations at that level happen
before packets arrive.

Needs some more research before coming up with a good fix.
Either the processing of responses in nvs_execute needs to use
the same receive processing function as normal data. Which
means adding logic to wait for condition; or the incoming
packets there could be dropped; or the device needs to be
stopped before configuring sub channels.

Dropping is probably the easiest to implement.
  
Min Tang March 2, 2020, 3:40 p.m. UTC | #4
Hi Stephen:

If there is no intention to process the response message of NVS_TYPE_RNDIS,
would it be better to not set the flags to VMBUS_CHANPKT_FLAG_RC so that it
won't receive any response message?

Best Regards,
Min Tang

On Sun, Mar 1, 2020 at 12:54 PM Stephen Hemminger <
stephen@networkplumber.org> wrote:

> On Thu, 27 Feb 2020 11:16:01 -0500
> Min Tang <tommytang@gmail.com> wrote:
>
> > Hi Stephen:
> >
> > I saw the following error messages when using DPDK 18.11.2 in Azure:
> >
> > hn_nvs_execute(): unexpected NVS resp 0x6b, expect 0x85
> > hn_dev_configure(): subchannel configuration failed
> >
> > It was not easy to reproduce it and it only occurred with multiple queues
> > enabled. In hn_nvs_execute it expects the response to match the request.
> In
> > the failed case, it was expecting NVS_TYPE_SUBCH_REQ (133 or 0x85) but
> > got NVS_TYPE_RNDIS(107 or 0x6b). Obviously somewhere the NVS_TYPE_RNDIS
> > message had been sent before the NVS_TYPE_SUBCH_REQ message was sent.  I
> > looked at the code and found that the NVS_TYPE_RNDIS message needs
> > completion response but it does not receive the response message
> anywhere.
> > The fix would be receiving and discarding the wrong response message(s).
> >
> > I put the following patches and it has fixed the problem.
> >
> > --- a/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:08:29.755530969 -0500
> > +++ b/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:07:21.567371798 -0500
> > @@ -92,7 +92,7 @@
> >   if (hdr->type != type) {
> >   PMD_DRV_LOG(ERR, "unexpected NVS resp %#x, expect %#x",
> >      hdr->type, type);
> > - goto retry;
> > + return -EINVAL;
> >   }
> >
> >   if (len < resplen) {
>
>
> The situation is that NVS_TYPE_RNDIS is a receive packet that is
> arriving while subchannel is being setup. For first channel this
> doesn't happen because control operations at that level happen
> before packets arrive.
>
> Needs some more research before coming up with a good fix.
> Either the processing of responses in nvs_execute needs to use
> the same receive processing function as normal data. Which
> means adding logic to wait for condition; or the incoming
> packets there could be dropped; or the device needs to be
> stopped before configuring sub channels.
>
> Dropping is probably the easiest to implement.
>
>
>
  
Stephen Hemminger March 2, 2020, 4:07 p.m. UTC | #5
On Mon, 2 Mar 2020 10:40:17 -0500
Min Tang <tommytang@gmail.com> wrote:

> Hi Stephen:
> 
> If there is no intention to process the response message of NVS_TYPE_RNDIS,
> would it be better to not set the flags to VMBUS_CHANPKT_FLAG_RC so that it
> won't receive any response message?
> 
> Best Regards,
> Min Tang
> 
> On Sun, Mar 1, 2020 at 12:54 PM Stephen Hemminger <
> stephen@networkplumber.org> wrote:  
> 
> > On Thu, 27 Feb 2020 11:16:01 -0500
> > Min Tang <tommytang@gmail.com> wrote:
> >  
> > > Hi Stephen:
> > >
> > > I saw the following error messages when using DPDK 18.11.2 in Azure:
> > >
> > > hn_nvs_execute(): unexpected NVS resp 0x6b, expect 0x85
> > > hn_dev_configure(): subchannel configuration failed
> > >
> > > It was not easy to reproduce it and it only occurred with multiple queues
> > > enabled. In hn_nvs_execute it expects the response to match the request.  
> > In  
> > > the failed case, it was expecting NVS_TYPE_SUBCH_REQ (133 or 0x85) but
> > > got NVS_TYPE_RNDIS(107 or 0x6b). Obviously somewhere the NVS_TYPE_RNDIS
> > > message had been sent before the NVS_TYPE_SUBCH_REQ message was sent.  I
> > > looked at the code and found that the NVS_TYPE_RNDIS message needs
> > > completion response but it does not receive the response message  
> > anywhere.  
> > > The fix would be receiving and discarding the wrong response message(s).
> > >
> > > I put the following patches and it has fixed the problem.
> > >
> > > --- a/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:08:29.755530969 -0500
> > > +++ b/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:07:21.567371798 -0500
> > > @@ -92,7 +92,7 @@
> > >   if (hdr->type != type) {
> > >   PMD_DRV_LOG(ERR, "unexpected NVS resp %#x, expect %#x",
> > >      hdr->type, type);
> > > - goto retry;
> > > + return -EINVAL;
> > >   }
> > >
> > >   if (len < resplen) {  
> >
> >
> > The situation is that NVS_TYPE_RNDIS is a receive packet that is
> > arriving while subchannel is being setup. For first channel this
> > doesn't happen because control operations at that level happen
> > before packets arrive.
> >
> > Needs some more research before coming up with a good fix.
> > Either the processing of responses in nvs_execute needs to use
> > the same receive processing function as normal data. Which
> > means adding logic to wait for condition; or the incoming
> > packets there could be dropped; or the device needs to be
> > stopped before configuring sub channels.
> >
> > Dropping is probably the easiest to implement.
> >
> >
> >  

The way transmit works is that NVS_TYPE_RNDIS is sent and the
response indicates a transmit completed.
  

Patch

--- a/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:08:29.755530969 -0500
+++ b/drivers/net/netvsc/hn_nvs.c 2020-02-27 11:07:21.567371798 -0500
@@ -92,7 +92,7 @@ 
  if (hdr->type != type) {
  PMD_DRV_LOG(ERR, "unexpected NVS resp %#x, expect %#x",
     hdr->type, type);
- goto retry;
+ return -EINVAL;
  }

  if (len < resplen) {