net/mlx5: fix instruction hotspot on replenishing Rx buffer

Message ID 20190109085426.39965-1-yskoh@mellanox.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series net/mlx5: fix instruction hotspot on replenishing Rx buffer |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/intel-Performance-Testing success Performance Testing PASS
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/Intel-compilation success Compilation OK

Commit Message

Yongseok Koh Jan. 9, 2019, 8:54 a.m. UTC
  On replenishing Rx buffers for vectorized Rx, mbuf->buf_addr isn't needed
to be accessed as it is static and easily calculated from the mbuf address.
Accessing the mbuf content causes unnecessary load stall and it is worsened
on ARM.

Fixes: 545b884b1da3 ("net/mlx5: fix buffer address posting in SSE Rx")
Cc: stable@dpdk.org

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxtx_vec.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
  

Comments

David Marchand Jan. 9, 2019, 9:38 a.m. UTC | #1
On Wed, Jan 9, 2019 at 9:54 AM Yongseok Koh <yskoh@mellanox.com> wrote:

> On replenishing Rx buffers for vectorized Rx, mbuf->buf_addr isn't needed
> to be accessed as it is static and easily calculated from the mbuf address.
> Accessing the mbuf content causes unnecessary load stall and it is worsened
> on ARM.
>
> Fixes: 545b884b1da3 ("net/mlx5: fix buffer address posting in SSE Rx")
> Cc: stable@dpdk.org
>
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>  drivers/net/mlx5/mlx5_rxtx_vec.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h
> b/drivers/net/mlx5/mlx5_rxtx_vec.h
> index fda7004e2d..ced5547307 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
> @@ -102,8 +102,12 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> *rxq, uint16_t n)
>                 return;
>         }
>         for (i = 0; i < n; ++i) {
> -               wq[i].addr = rte_cpu_to_be_64((uintptr_t)elts[i]->buf_addr
> +
> -                                             RTE_PKTMBUF_HEADROOM);
> +               uintptr_t buf_addr =
> +                       (uintptr_t)elts[i] + sizeof(struct rte_mbuf) +
> +                       rte_pktmbuf_priv_size(rxq->mp) +
> RTE_PKTMBUF_HEADROOM;
> +
> +               assert(buf_addr == (uintptr_t)elts[i]->buf_addr);
> +               wq[i].addr = rte_cpu_to_be_64(buf_addr);
>                 /* If there's only one MR, no need to replace LKey in WQE.
> */
>                 if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) >
> 1))
>                         wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
> --
> 2.11.0
>
>
How about having a macro / inline in the mbuf api to get this information
in a consistent/unique way ?
I can see we have this calculation at least in rte_pktmbuf_init() and
rte_pktmbuf_detach().
  
Olivier Matz Jan. 9, 2019, 9:52 a.m. UTC | #2
On Wed, Jan 09, 2019 at 10:38:07AM +0100, David Marchand wrote:
> On Wed, Jan 9, 2019 at 9:54 AM Yongseok Koh <yskoh@mellanox.com> wrote:
> 
> > On replenishing Rx buffers for vectorized Rx, mbuf->buf_addr isn't needed
> > to be accessed as it is static and easily calculated from the mbuf address.
> > Accessing the mbuf content causes unnecessary load stall and it is worsened
> > on ARM.
> >
> > Fixes: 545b884b1da3 ("net/mlx5: fix buffer address posting in SSE Rx")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > ---
> >  drivers/net/mlx5/mlx5_rxtx_vec.h | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h
> > b/drivers/net/mlx5/mlx5_rxtx_vec.h
> > index fda7004e2d..ced5547307 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx_vec.h
> > +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
> > @@ -102,8 +102,12 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> > *rxq, uint16_t n)
> >                 return;
> >         }
> >         for (i = 0; i < n; ++i) {
> > -               wq[i].addr = rte_cpu_to_be_64((uintptr_t)elts[i]->buf_addr
> > +
> > -                                             RTE_PKTMBUF_HEADROOM);
> > +               uintptr_t buf_addr =
> > +                       (uintptr_t)elts[i] + sizeof(struct rte_mbuf) +
> > +                       rte_pktmbuf_priv_size(rxq->mp) +
> > RTE_PKTMBUF_HEADROOM;
> > +
> > +               assert(buf_addr == (uintptr_t)elts[i]->buf_addr);
> > +               wq[i].addr = rte_cpu_to_be_64(buf_addr);
> >                 /* If there's only one MR, no need to replace LKey in WQE.
> > */
> >                 if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) >
> > 1))
> >                         wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
> > --
> > 2.11.0
> >
> >
> How about having a macro / inline in the mbuf api to get this information
> in a consistent/unique way ?
> I can see we have this calculation at least in rte_pktmbuf_init() and
> rte_pktmbuf_detach().

Agree. Maybe rte_mbuf_default_buf_addr(m) ?

Side note, is the assert() correct in the patch? I'd say there's a
difference of RTE_PKTMBUF_HEADROOM between the 2 values.

Olivier
  
Yongseok Koh Jan. 9, 2019, 9:56 a.m. UTC | #3
> On Jan 9, 2019, at 1:52 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
> 
> On Wed, Jan 09, 2019 at 10:38:07AM +0100, David Marchand wrote:
>> On Wed, Jan 9, 2019 at 9:54 AM Yongseok Koh <yskoh@mellanox.com> wrote:
>> 
>>> On replenishing Rx buffers for vectorized Rx, mbuf->buf_addr isn't needed
>>> to be accessed as it is static and easily calculated from the mbuf address.
>>> Accessing the mbuf content causes unnecessary load stall and it is worsened
>>> on ARM.
>>> 
>>> Fixes: 545b884b1da3 ("net/mlx5: fix buffer address posting in SSE Rx")
>>> Cc: stable@dpdk.org
>>> 
>>> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
>>> ---
>>> drivers/net/mlx5/mlx5_rxtx_vec.h | 8 ++++++--
>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h
>>> b/drivers/net/mlx5/mlx5_rxtx_vec.h
>>> index fda7004e2d..ced5547307 100644
>>> --- a/drivers/net/mlx5/mlx5_rxtx_vec.h
>>> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
>>> @@ -102,8 +102,12 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
>>> *rxq, uint16_t n)
>>>                return;
>>>        }
>>>        for (i = 0; i < n; ++i) {
>>> -               wq[i].addr = rte_cpu_to_be_64((uintptr_t)elts[i]->buf_addr
>>> +
>>> -                                             RTE_PKTMBUF_HEADROOM);
>>> +               uintptr_t buf_addr =
>>> +                       (uintptr_t)elts[i] + sizeof(struct rte_mbuf) +
>>> +                       rte_pktmbuf_priv_size(rxq->mp) +
>>> RTE_PKTMBUF_HEADROOM;
>>> +
>>> +               assert(buf_addr == (uintptr_t)elts[i]->buf_addr);
>>> +               wq[i].addr = rte_cpu_to_be_64(buf_addr);
>>>                /* If there's only one MR, no need to replace LKey in WQE.
>>> */
>>>                if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) >
>>> 1))
>>>                        wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
>>> --
>>> 2.11.0
>>> 
>>> 
>> How about having a macro / inline in the mbuf api to get this information
>> in a consistent/unique way ?
>> I can see we have this calculation at least in rte_pktmbuf_init() and
>> rte_pktmbuf_detach().
> 
> Agree. Maybe rte_mbuf_default_buf_addr(m) ?

I'm also okay to add. Will come up with a new patch.

> Side note, is the assert() correct in the patch? I'd say there's a
> difference of RTE_PKTMBUF_HEADROOM between the 2 values.

Oops, my fault. Thanks for the catch, you saved a crash. :-)

Thanks,
Yongseok
  
David Marchand Jan. 9, 2019, 10:05 a.m. UTC | #4
On Wed, Jan 9, 2019 at 10:56 AM Yongseok Koh <yskoh@mellanox.com> wrote:

>
> > On Jan 9, 2019, at 1:52 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
> >
> > On Wed, Jan 09, 2019 at 10:38:07AM +0100, David Marchand wrote:
> >> On Wed, Jan 9, 2019 at 9:54 AM Yongseok Koh <yskoh@mellanox.com> wrote:
> >>
> >>> On replenishing Rx buffers for vectorized Rx, mbuf->buf_addr isn't
> needed
> >>> to be accessed as it is static and easily calculated from the mbuf
> address.
> >>> Accessing the mbuf content causes unnecessary load stall and it is
> worsened
> >>> on ARM.
> >>>
> >>> Fixes: 545b884b1da3 ("net/mlx5: fix buffer address posting in SSE Rx")
> >>> Cc: stable@dpdk.org
> >>>
> >>> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> >>> ---
> >>> drivers/net/mlx5/mlx5_rxtx_vec.h | 8 ++++++--
> >>> 1 file changed, 6 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h
> >>> b/drivers/net/mlx5/mlx5_rxtx_vec.h
> >>> index fda7004e2d..ced5547307 100644
> >>> --- a/drivers/net/mlx5/mlx5_rxtx_vec.h
> >>> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
> >>> @@ -102,8 +102,12 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> >>> *rxq, uint16_t n)
> >>>                return;
> >>>        }
> >>>        for (i = 0; i < n; ++i) {
> >>> -               wq[i].addr =
> rte_cpu_to_be_64((uintptr_t)elts[i]->buf_addr
> >>> +
> >>> -                                             RTE_PKTMBUF_HEADROOM);
> >>> +               uintptr_t buf_addr =
> >>> +                       (uintptr_t)elts[i] + sizeof(struct rte_mbuf) +
> >>> +                       rte_pktmbuf_priv_size(rxq->mp) +
> >>> RTE_PKTMBUF_HEADROOM;
> >>> +
> >>> +               assert(buf_addr == (uintptr_t)elts[i]->buf_addr);
> >>> +               wq[i].addr = rte_cpu_to_be_64(buf_addr);
> >>>                /* If there's only one MR, no need to replace LKey in
> WQE.
> >>> */
> >>>                if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) >
> >>> 1))
> >>>                        wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
> >>> --
> >>> 2.11.0
> >>>
> >>>
> >> How about having a macro / inline in the mbuf api to get this
> information
> >> in a consistent/unique way ?
> >> I can see we have this calculation at least in rte_pktmbuf_init() and
> >> rte_pktmbuf_detach().
> >
> > Agree. Maybe rte_mbuf_default_buf_addr(m) ?
>
> I'm also okay to add. Will come up with a new patch.
>
> > Side note, is the assert() correct in the patch? I'd say there's a
> > difference of RTE_PKTMBUF_HEADROOM between the 2 values.
>
> Oops, my fault. Thanks for the catch, you saved a crash. :-)
>

Is this assert really necessary if we have a common macro ?
I was under the impression that this assert is there to catch misalignement
between the mbuf api and the driver.
  
Yongseok Koh Jan. 9, 2019, 10:11 a.m. UTC | #5
> On Jan 9, 2019, at 2:05 AM, David Marchand <david.marchand@redhat.com> wrote:
> 
> 
> 
> On Wed, Jan 9, 2019 at 10:56 AM Yongseok Koh <yskoh@mellanox.com> wrote:
> 
> > On Jan 9, 2019, at 1:52 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
> > 
> > On Wed, Jan 09, 2019 at 10:38:07AM +0100, David Marchand wrote:
> >> On Wed, Jan 9, 2019 at 9:54 AM Yongseok Koh <yskoh@mellanox.com> wrote:
> >> 
> >>> On replenishing Rx buffers for vectorized Rx, mbuf->buf_addr isn't needed
> >>> to be accessed as it is static and easily calculated from the mbuf address.
> >>> Accessing the mbuf content causes unnecessary load stall and it is worsened
> >>> on ARM.
> >>> 
> >>> Fixes: 545b884b1da3 ("net/mlx5: fix buffer address posting in SSE Rx")
> >>> Cc: stable@dpdk.org
> >>> 
> >>> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> >>> ---
> >>> drivers/net/mlx5/mlx5_rxtx_vec.h | 8 ++++++--
> >>> 1 file changed, 6 insertions(+), 2 deletions(-)
> >>> 
> >>> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h
> >>> b/drivers/net/mlx5/mlx5_rxtx_vec.h
> >>> index fda7004e2d..ced5547307 100644
> >>> --- a/drivers/net/mlx5/mlx5_rxtx_vec.h
> >>> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
> >>> @@ -102,8 +102,12 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> >>> *rxq, uint16_t n)
> >>>                return;
> >>>        }
> >>>        for (i = 0; i < n; ++i) {
> >>> -               wq[i].addr = rte_cpu_to_be_64((uintptr_t)elts[i]->buf_addr
> >>> +
> >>> -                                             RTE_PKTMBUF_HEADROOM);
> >>> +               uintptr_t buf_addr =
> >>> +                       (uintptr_t)elts[i] + sizeof(struct rte_mbuf) +
> >>> +                       rte_pktmbuf_priv_size(rxq->mp) +
> >>> RTE_PKTMBUF_HEADROOM;
> >>> +
> >>> +               assert(buf_addr == (uintptr_t)elts[i]->buf_addr);
> >>> +               wq[i].addr = rte_cpu_to_be_64(buf_addr);
> >>>                /* If there's only one MR, no need to replace LKey in WQE.
> >>> */
> >>>                if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) >
> >>> 1))
> >>>                        wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
> >>> --
> >>> 2.11.0
> >>> 
> >>> 
> >> How about having a macro / inline in the mbuf api to get this information
> >> in a consistent/unique way ?
> >> I can see we have this calculation at least in rte_pktmbuf_init() and
> >> rte_pktmbuf_detach().
> > 
> > Agree. Maybe rte_mbuf_default_buf_addr(m) ?
> 
> I'm also okay to add. Will come up with a new patch.
> 
> > Side note, is the assert() correct in the patch? I'd say there's a
> > difference of RTE_PKTMBUF_HEADROOM between the 2 values.
> 
> Oops, my fault. Thanks for the catch, you saved a crash. :-)
> 
> Is this assert really necessary if we have a common macro ?
> I was under the impression that this assert is there to catch misalignement between the mbuf api and the driver.

It is still good to have. This can catch corruption of mbuf content which sometimes
happens due to wrong mbuf handling in PMD or potential HW memory corruption.


Thanks,
Yongseok
  

Patch

diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index fda7004e2d..ced5547307 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -102,8 +102,12 @@  mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq, uint16_t n)
 		return;
 	}
 	for (i = 0; i < n; ++i) {
-		wq[i].addr = rte_cpu_to_be_64((uintptr_t)elts[i]->buf_addr +
-					      RTE_PKTMBUF_HEADROOM);
+		uintptr_t buf_addr =
+			(uintptr_t)elts[i] + sizeof(struct rte_mbuf) +
+			rte_pktmbuf_priv_size(rxq->mp) + RTE_PKTMBUF_HEADROOM;
+
+		assert(buf_addr == (uintptr_t)elts[i]->buf_addr);
+		wq[i].addr = rte_cpu_to_be_64(buf_addr);
 		/* If there's only one MR, no need to replace LKey in WQE. */
 		if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) > 1))
 			wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);