[2/2] vdpa/mlx5: retry VAR allocation during vDPA restart

Message ID 20210923081758.178745-2-xuemingl@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Maxime Coquelin
Headers
Series [1/2] vdpa/mlx5: workaround FW first completion in start |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/github-robot: build success github build: passed

Commit Message

Xueming Li Sept. 23, 2021, 8:17 a.m. UTC
  VAR is the device memory space for the virtio queues doorbells, qemu
could mmap it to directly to speed up doorbell push.

On a busy system, Qemu takes time to release VAR resources during driver
shutdown. If vdpa restarted quickly, the VAR allocation failed with
error 28 since the VAR is singleton resource per device.

This patch adds retry mechanism for VAR allocation.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Reviewed-by: Matan Azrad <matan@nvidia.com>
---
 drivers/vdpa/mlx5/mlx5_vdpa.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)
  

Comments

Maxime Coquelin Oct. 13, 2021, 10:06 a.m. UTC | #1
On 9/23/21 10:17, Xueming Li wrote:
> VAR is the device memory space for the virtio queues doorbells, qemu
> could mmap it to directly to speed up doorbell push.
> 
> On a busy system, Qemu takes time to release VAR resources during driver
> shutdown. If vdpa restarted quickly, the VAR allocation failed with
> error 28 since the VAR is singleton resource per device.
> 
> This patch adds retry mechanism for VAR allocation.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Reviewed-by: Matan Azrad <matan@nvidia.com>
> ---
>   drivers/vdpa/mlx5/mlx5_vdpa.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/mlx5/mlx5_vdpa.c b/drivers/vdpa/mlx5/mlx5_vdpa.c
> index 6d17d7a6f3..991739e984 100644
> --- a/drivers/vdpa/mlx5/mlx5_vdpa.c
> +++ b/drivers/vdpa/mlx5/mlx5_vdpa.c
> @@ -693,7 +693,14 @@ mlx5_vdpa_dev_probe(struct rte_device *dev)
>   	if (attr.num_lag_ports == 0)
>   		priv->num_lag_ports = 1;
>   	priv->ctx = ctx;
> -	priv->var = mlx5_glue->dv_alloc_var(ctx, 0);
> +	for (retry = 0; retry < 7; retry++) {
> +		priv->var = mlx5_glue->dv_alloc_var(ctx, 0);
> +		if (priv->var != NULL)
> +			break;
> +		DRV_LOG(WARNING, "Failed to allocate VAR, retry %d.\n", retry);
> +		/* Wait Qemu release VAR during vdpa restart, 0.1 sec based. */
> +		usleep(100000U << retry);
> +	}
>   	if (!priv->var) {
>   		DRV_LOG(ERR, "Failed to allocate VAR %u.", errno);
>   		goto error;
> 

That looks fragile, but at least we have a warning we can rely on.
Shouldn't we have a way to wait for Qemu to release the resources at
vdpa driver shutdown time?

Also as on patch 1, please add a Fixes tag it you want it to be
backported.

Regards,
Maxime
  
Xueming Li Oct. 13, 2021, 10:14 a.m. UTC | #2
On Wed, 2021-10-13 at 12:06 +0200, Maxime Coquelin wrote:
> 
> On 9/23/21 10:17, Xueming Li wrote:
> > VAR is the device memory space for the virtio queues doorbells, qemu
> > could mmap it to directly to speed up doorbell push.
> > 
> > On a busy system, Qemu takes time to release VAR resources during driver
> > shutdown. If vdpa restarted quickly, the VAR allocation failed with
> > error 28 since the VAR is singleton resource per device.
> > 
> > This patch adds retry mechanism for VAR allocation.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Reviewed-by: Matan Azrad <matan@nvidia.com>
> > ---
> >   drivers/vdpa/mlx5/mlx5_vdpa.c | 9 ++++++++-
> >   1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vdpa/mlx5/mlx5_vdpa.c b/drivers/vdpa/mlx5/mlx5_vdpa.c
> > index 6d17d7a6f3..991739e984 100644
> > --- a/drivers/vdpa/mlx5/mlx5_vdpa.c
> > +++ b/drivers/vdpa/mlx5/mlx5_vdpa.c
> > @@ -693,7 +693,14 @@ mlx5_vdpa_dev_probe(struct rte_device *dev)
> >   	if (attr.num_lag_ports == 0)
> >   		priv->num_lag_ports = 1;
> >   	priv->ctx = ctx;
> > -	priv->var = mlx5_glue->dv_alloc_var(ctx, 0);
> > +	for (retry = 0; retry < 7; retry++) {
> > +		priv->var = mlx5_glue->dv_alloc_var(ctx, 0);
> > +		if (priv->var != NULL)
> > +			break;
> > +		DRV_LOG(WARNING, "Failed to allocate VAR, retry %d.\n", retry);
> > +		/* Wait Qemu release VAR during vdpa restart, 0.1 sec based. */
> > +		usleep(100000U << retry);
> > +	}
> >   	if (!priv->var) {
> >   		DRV_LOG(ERR, "Failed to allocate VAR %u.", errno);
> >   		goto error;
> > 
> 
> That looks fragile, but at least we have a warning we can rely on.
> Shouldn't we have a way to wait for Qemu to release the resources at
> vdpa driver shutdown time?

If dpdk-vdpa get killed and restart, qemu will shutdown device and
unmap the resources independently.

> 
> Also as on patch 1, please add a Fixes tag it you want it to be
> backported.

Agree to backport, but not a fix, I'll add cc:stable@dpdk.org, the
patch will be noticed by maintainer, thanks for the suggestion!

> 
> Regards,
> Maxime
>
  

Patch

diff --git a/drivers/vdpa/mlx5/mlx5_vdpa.c b/drivers/vdpa/mlx5/mlx5_vdpa.c
index 6d17d7a6f3..991739e984 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa.c
@@ -693,7 +693,14 @@  mlx5_vdpa_dev_probe(struct rte_device *dev)
 	if (attr.num_lag_ports == 0)
 		priv->num_lag_ports = 1;
 	priv->ctx = ctx;
-	priv->var = mlx5_glue->dv_alloc_var(ctx, 0);
+	for (retry = 0; retry < 7; retry++) {
+		priv->var = mlx5_glue->dv_alloc_var(ctx, 0);
+		if (priv->var != NULL)
+			break;
+		DRV_LOG(WARNING, "Failed to allocate VAR, retry %d.\n", retry);
+		/* Wait Qemu release VAR during vdpa restart, 0.1 sec based. */
+		usleep(100000U << retry);
+	}
 	if (!priv->var) {
 		DRV_LOG(ERR, "Failed to allocate VAR %u.", errno);
 		goto error;