[dpdk-dev,1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function

Message ID 1482127758-4904-1-git-send-email-jianbo.liu@linaro.org (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel compilation success Compilation OK

Commit Message

Jianbo Liu Dec. 19, 2016, 6:09 a.m. UTC
  To get better performance, Rx bulk alloc recv function will scan 8 descriptors
in one time, but the statuses are not consistent on ARM platform because
the memory allocated for Rx descriptors is cacheable hugepages.
This patch is to calculate the number of received packets by scanning DD bit
sequentially, and stops when meeting the first packet with DD bit unset.

Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)
  

Comments

Ananyev, Konstantin Feb. 1, 2017, 4:19 p.m. UTC | #1
Hi,

> -----Original Message-----
> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
> Sent: Monday, December 19, 2016 6:09 AM
> To: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> jerin.jacob@caviumnetworks.com
> Cc: Jianbo Liu <jianbo.liu@linaro.org>
> Subject: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
> 
> To get better performance, Rx bulk alloc recv function will scan 8 descriptors
> in one time, but the statuses are not consistent on ARM platform because
> the memory allocated for Rx descriptors is cacheable hugepages.
> This patch is to calculate the number of received packets by scanning DD bit
> sequentially, and stops when meeting the first packet with DD bit unset.
> 
> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
> ---
>  drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index b2d9f45..2866bdb 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1402,17 +1402,21 @@ ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)
>  	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>  	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>  		/* Read desc statuses backwards to avoid race condition */
> -		for (j = LOOK_AHEAD-1; j >= 0; --j)
> +		for (j = LOOK_AHEAD - 1; j >= 0; --j) {
>  			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
> -
> -		for (j = LOOK_AHEAD - 1; j >= 0; --j)
>  			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>  						       lo_dword.data);
> +		}
> +
> +		rte_smp_rmb();

If reads can be reordered, shouldn't we fill pkt_info[] after smp_rmb() here?
As another nit - with rmb() in and because you are looking the first gap in s[] now,
no need to read TXDs in backward order.
How it looks to me (as a suggestion):

for (j = 0; j != LOOK_AHEAD; j++)
	s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);

rte_smp_rmb();

for (j = 0; j < LOOK_AHEAD && (s[j] & IXGBE_RXDADV_STAT_DD) != 0; j++)
	;

for (j = 0; j < nb_dd; ++j) {
	pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.lo_dword.data);
               ....

Konstantin


> 
>  		/* Compute how many status bits were set */
>  		nb_dd = 0;
>  		for (j = 0; j < LOOK_AHEAD; ++j)
> -			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
> +			if (s[j] & IXGBE_RXDADV_STAT_DD)
> +				++nb_dd;
> +			else
> +				break;
> 
>  		nb_rx += nb_dd;
> 
> --
> 2.4.11
  
Jianbo Liu Feb. 3, 2017, 6:22 a.m. UTC | #2
On 2 February 2017 at 00:19, Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
> Hi,
>
>> -----Original Message-----
>> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> Sent: Monday, December 19, 2016 6:09 AM
>> To: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
>> jerin.jacob@caviumnetworks.com
>> Cc: Jianbo Liu <jianbo.liu@linaro.org>
>> Subject: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
>>
>> To get better performance, Rx bulk alloc recv function will scan 8 descriptors
>> in one time, but the statuses are not consistent on ARM platform because
>> the memory allocated for Rx descriptors is cacheable hugepages.
>> This patch is to calculate the number of received packets by scanning DD bit
>> sequentially, and stops when meeting the first packet with DD bit unset.
>>
>> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>> ---
>>  drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----
>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index b2d9f45..2866bdb 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1402,17 +1402,21 @@ ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)
>>       for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>>            i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>>               /* Read desc statuses backwards to avoid race condition */
>> -             for (j = LOOK_AHEAD-1; j >= 0; --j)
>> +             for (j = LOOK_AHEAD - 1; j >= 0; --j) {
>>                       s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>> -
>> -             for (j = LOOK_AHEAD - 1; j >= 0; --j)
>>                       pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>>                                                      lo_dword.data);
>> +             }
>> +
>> +             rte_smp_rmb();
>
> If reads can be reordered, shouldn't we fill pkt_info[] after smp_rmb() here?

The barrier is to forbid the reordering from the following readings,
which will count the number of actual received packets.
And as wb.uper and wb.lower of one descriptor are in the same
cacheline, could it be better to read them at the same time?.

> As another nit - with rmb() in and because you are looking the first gap in s[] now,
> no need to read TXDs in backward order.

Reading backward is just to keep as it is for x86 platform.

> How it looks to me (as a suggestion):
>
> for (j = 0; j != LOOK_AHEAD; j++)
>         s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>
> rte_smp_rmb();
>
> for (j = 0; j < LOOK_AHEAD && (s[j] & IXGBE_RXDADV_STAT_DD) != 0; j++)
>         ;
>
> for (j = 0; j < nb_dd; ++j) {
>         pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.lo_dword.data);
>                ....
>
> Konstantin
>
>
>>
>>               /* Compute how many status bits were set */
>>               nb_dd = 0;
>>               for (j = 0; j < LOOK_AHEAD; ++j)
>> -                     nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
>> +                     if (s[j] & IXGBE_RXDADV_STAT_DD)
>> +                             ++nb_dd;
>> +                     else
>> +                             break;
>>
>>               nb_rx += nb_dd;
>>
>> --
>> 2.4.11
>
  
Ananyev, Konstantin Feb. 3, 2017, 11:38 a.m. UTC | #3
> -----Original Message-----

> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> Sent: Friday, February 3, 2017 6:22 AM

> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>

> Cc: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; jerin.jacob@caviumnetworks.com

> Subject: Re: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function

> 

> On 2 February 2017 at 00:19, Ananyev, Konstantin

> <konstantin.ananyev@intel.com> wrote:

> > Hi,

> >

> >> -----Original Message-----

> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]

> >> Sent: Monday, December 19, 2016 6:09 AM

> >> To: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>;

> >> jerin.jacob@caviumnetworks.com

> >> Cc: Jianbo Liu <jianbo.liu@linaro.org>

> >> Subject: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function

> >>

> >> To get better performance, Rx bulk alloc recv function will scan 8 descriptors

> >> in one time, but the statuses are not consistent on ARM platform because

> >> the memory allocated for Rx descriptors is cacheable hugepages.

> >> This patch is to calculate the number of received packets by scanning DD bit

> >> sequentially, and stops when meeting the first packet with DD bit unset.

> >>

> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>

> >> ---

> >>  drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----

> >>  1 file changed, 8 insertions(+), 4 deletions(-)

> >>

> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c

> >> index b2d9f45..2866bdb 100644

> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> >> @@ -1402,17 +1402,21 @@ ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)

> >>       for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;

> >>            i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {

> >>               /* Read desc statuses backwards to avoid race condition */

> >> -             for (j = LOOK_AHEAD-1; j >= 0; --j)

> >> +             for (j = LOOK_AHEAD - 1; j >= 0; --j) {

> >>                       s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);

> >> -

> >> -             for (j = LOOK_AHEAD - 1; j >= 0; --j)

> >>                       pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.

> >>                                                      lo_dword.data);

> >> +             }

> >> +

> >> +             rte_smp_rmb();

> >

> > If reads can be reordered, shouldn't we fill pkt_info[] after smp_rmb() here?

> 

> The barrier is to forbid the reordering from the following readings,

> which will count the number of actual received packets.


What I meant is that if you'll keep reading from both rxdp[].wb.lower and rxdp[].wb.upper
before rmb, then nothing would prevent cpu from reorder these reads in any way it likes
(if we are talking about cpus with read reordering allowed), right?
So it can end up with the following order:

rxdp[N].wb.lower
rxdp[N].wb.upper

or even:

rxdp[N-1].wb.lower
rxdp[N].wb.lower
rxdp[N-1].wb.upper
rxdp[N].wb.upper

In such cases pkt_info[] may contain invalid data.

> And as wb.uper and wb.lower of one descriptor are in the same

> cacheline, could it be better to read them at the same time?.


It could be, but I think for the sake of data integrity we have to make sure that 
cpu would never read any other RXD field before wb.upper. status_error, see above.

BTW, the following code might re-read both wb.upper and wb.lower anyway.
So I don't think you'll save many cycles here anyway. 

> 

> > As another nit - with rmb() in and because you are looking the first gap in s[] now,

> > no need to read TXDs in backward order.

> 

> Reading backward is just to keep as it is for x86 platform.


With the change you introducing, I don't think it is necessary any more.

Konstantin

> 

> > How it looks to me (as a suggestion):

> >

> > for (j = 0; j != LOOK_AHEAD; j++)

> >         s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);

> >

> > rte_smp_rmb();

> >

> > for (j = 0; j < LOOK_AHEAD && (s[j] & IXGBE_RXDADV_STAT_DD) != 0; j++)

> >         ;

> >

> > for (j = 0; j < nb_dd; ++j) {

> >         pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.lo_dword.data);

> >                ....

> >

> > Konstantin

> >

> >

> >>

> >>               /* Compute how many status bits were set */

> >>               nb_dd = 0;

> >>               for (j = 0; j < LOOK_AHEAD; ++j)

> >> -                     nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;

> >> +                     if (s[j] & IXGBE_RXDADV_STAT_DD)

> >> +                             ++nb_dd;

> >> +                     else

> >> +                             break;

> >>

> >>               nb_rx += nb_dd;

> >>

> >> --

> >> 2.4.11

> >
  
Jianbo Liu Feb. 4, 2017, 3:37 a.m. UTC | #4
On 3 February 2017 at 19:38, Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> Sent: Friday, February 3, 2017 6:22 AM
>> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
>> Cc: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; jerin.jacob@caviumnetworks.com
>> Subject: Re: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
>>
>> On 2 February 2017 at 00:19, Ananyev, Konstantin
>> <konstantin.ananyev@intel.com> wrote:
>> > Hi,
>> >
>> >> -----Original Message-----
>> >> From: Jianbo Liu [mailto:jianbo.liu@linaro.org]
>> >> Sent: Monday, December 19, 2016 6:09 AM
>> >> To: dev@dpdk.org; Zhang, Helin <helin.zhang@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
>> >> jerin.jacob@caviumnetworks.com
>> >> Cc: Jianbo Liu <jianbo.liu@linaro.org>
>> >> Subject: [PATCH 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
>> >>
>> >> To get better performance, Rx bulk alloc recv function will scan 8 descriptors
>> >> in one time, but the statuses are not consistent on ARM platform because
>> >> the memory allocated for Rx descriptors is cacheable hugepages.
>> >> This patch is to calculate the number of received packets by scanning DD bit
>> >> sequentially, and stops when meeting the first packet with DD bit unset.
>> >>
>> >> Signed-off-by: Jianbo Liu <jianbo.liu@linaro.org>
>> >> ---
>> >>  drivers/net/ixgbe/ixgbe_rxtx.c | 12 ++++++++----
>> >>  1 file changed, 8 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> >> index b2d9f45..2866bdb 100644
>> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> >> @@ -1402,17 +1402,21 @@ ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)
>> >>       for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>> >>            i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>> >>               /* Read desc statuses backwards to avoid race condition */
>> >> -             for (j = LOOK_AHEAD-1; j >= 0; --j)
>> >> +             for (j = LOOK_AHEAD - 1; j >= 0; --j) {
>> >>                       s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>> >> -
>> >> -             for (j = LOOK_AHEAD - 1; j >= 0; --j)
>> >>                       pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>> >>                                                      lo_dword.data);
>> >> +             }
>> >> +
>> >> +             rte_smp_rmb();
>> >
>> > If reads can be reordered, shouldn't we fill pkt_info[] after smp_rmb() here?
>>
>> The barrier is to forbid the reordering from the following readings,
>> which will count the number of actual received packets.
>
> What I meant is that if you'll keep reading from both rxdp[].wb.lower and rxdp[].wb.upper
> before rmb, then nothing would prevent cpu from reorder these reads in any way it likes
> (if we are talking about cpus with read reordering allowed), right?
> So it can end up with the following order:
>
> rxdp[N].wb.lower
> rxdp[N].wb.upper
>
> or even:
>
> rxdp[N-1].wb.lower
> rxdp[N].wb.lower
> rxdp[N-1].wb.upper
> rxdp[N].wb.upper
>
> In such cases pkt_info[] may contain invalid data.

Yes, it's possible. I'll send v2.

Thanks!

>
>> And as wb.uper and wb.lower of one descriptor are in the same
>> cacheline, could it be better to read them at the same time?.
>
> It could be, but I think for the sake of data integrity we have to make sure that
> cpu would never read any other RXD field before wb.upper. status_error, see above.
>
> BTW, the following code might re-read both wb.upper and wb.lower anyway.
> So I don't think you'll save many cycles here anyway.
>
>>
>> > As another nit - with rmb() in and because you are looking the first gap in s[] now,
>> > no need to read TXDs in backward order.
>>
>> Reading backward is just to keep as it is for x86 platform.
>
> With the change you introducing, I don't think it is necessary any more.
>
> Konstantin
>
>>
>> > How it looks to me (as a suggestion):
>> >
>> > for (j = 0; j != LOOK_AHEAD; j++)
>> >         s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>> >
>> > rte_smp_rmb();
>> >
>> > for (j = 0; j < LOOK_AHEAD && (s[j] & IXGBE_RXDADV_STAT_DD) != 0; j++)
>> >         ;
>> >
>> > for (j = 0; j < nb_dd; ++j) {
>> >         pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.lo_dword.data);
>> >                ....
>> >
>> > Konstantin
>> >
>> >
>> >>
>> >>               /* Compute how many status bits were set */
>> >>               nb_dd = 0;
>> >>               for (j = 0; j < LOOK_AHEAD; ++j)
>> >> -                     nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
>> >> +                     if (s[j] & IXGBE_RXDADV_STAT_DD)
>> >> +                             ++nb_dd;
>> >> +                     else
>> >> +                             break;
>> >>
>> >>               nb_rx += nb_dd;
>> >>
>> >> --
>> >> 2.4.11
>> >
  

Patch

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index b2d9f45..2866bdb 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1402,17 +1402,21 @@  ixgbe_rx_scan_hw_ring(struct ixgbe_rx_queue *rxq)
 	for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
 	     i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
 		/* Read desc statuses backwards to avoid race condition */
-		for (j = LOOK_AHEAD-1; j >= 0; --j)
+		for (j = LOOK_AHEAD - 1; j >= 0; --j) {
 			s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
-
-		for (j = LOOK_AHEAD - 1; j >= 0; --j)
 			pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
 						       lo_dword.data);
+		}
+
+		rte_smp_rmb();
 
 		/* Compute how many status bits were set */
 		nb_dd = 0;
 		for (j = 0; j < LOOK_AHEAD; ++j)
-			nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
+			if (s[j] & IXGBE_RXDADV_STAT_DD)
+				++nb_dd;
+			else
+				break;
 
 		nb_rx += nb_dd;