[RFC,v2] net/af_packet: make stats reset reliable

Message ID 20240426143848.2280689-1-ferruh.yigit@amd.com (mailing list archive)
State New
Delegated to: Ferruh Yigit
Headers
Series [RFC,v2] net/af_packet: make stats reset reliable |

Checks

Context Check Description
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-compile-amd64-testing fail Testing issues
ci/iol-compile-arm64-testing success Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-sample-apps-testing success Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS

Commit Message

Ferruh Yigit April 26, 2024, 2:38 p.m. UTC
  For stats reset, use an offset instead of zeroing out actual stats values,
get_stats() displays diff between stats and offset.
This way stats only updated in datapath and offset only updated in stats
reset function. This makes stats reset function more reliable.

As stats only written by single thread, we can remove 'volatile' qualifier
which should improve the performance in datapath.

Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
---
Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Morten Brørup <mb@smartsharesystems.com>

This update triggered by mail list discussion [1].

[1]
https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/

v2:
* Remove wrapping check for stats
---
 drivers/net/af_packet/rte_eth_af_packet.c | 66 ++++++++++++++---------
 1 file changed, 41 insertions(+), 25 deletions(-)
  

Comments

Morten Brørup April 26, 2024, 2:47 p.m. UTC | #1
> +static uint64_t
> +stats_get_diff(uint64_t stats, uint64_t offset)
> +{
> +	return stats - offset;
> +}

No need for this function; just subtract inline instead.

With the suggested change,
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
  
Mattias Rönnblom April 28, 2024, 3:11 p.m. UTC | #2
On 2024-04-26 16:38, Ferruh Yigit wrote:
> For stats reset, use an offset instead of zeroing out actual stats values,
> get_stats() displays diff between stats and offset.
> This way stats only updated in datapath and offset only updated in stats
> reset function. This makes stats reset function more reliable.
> 
> As stats only written by single thread, we can remove 'volatile' qualifier
> which should improve the performance in datapath.
> 

volatile wouldn't help you if you had multiple writers, so that can't be 
the reason for its removal. It would be more accurate to say it should 
be replaced with atomic updates. If you don't use volatile and don't use 
atomics, you have to consider if the compiler can reach the conclusion 
that it does not need to store the counter value for future use *for 
that thread*. Since otherwise, I don't think the store actually needs to 
occur. Since DPDK statistics tend to work, it's pretty obvious that 
current compilers tend not to reach this conclusion.

If this should be done 100% properly, the update operation should be a 
non-atomic load, non-atomic add, and an atomic store. Similarly, for the 
reset, the offset store should be atomic.

Considered the state of the rest of the DPDK code base, I think a 
non-atomic, non-volatile solution is also fine.

(That said, I think we're better off just deprecating stats reset 
altogether, and returning -ENOTSUP here meanwhile.)

> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
> ---
> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Morten Brørup <mb@smartsharesystems.com>
> 
> This update triggered by mail list discussion [1].
> 
> [1]
> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
> 
> v2:
> * Remove wrapping check for stats
> ---
>   drivers/net/af_packet/rte_eth_af_packet.c | 66 ++++++++++++++---------
>   1 file changed, 41 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
> index 397a32db5886..10c8e1e50139 100644
> --- a/drivers/net/af_packet/rte_eth_af_packet.c
> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> @@ -51,8 +51,10 @@ struct pkt_rx_queue {
>   	uint16_t in_port;
>   	uint8_t vlan_strip;
>   
> -	volatile unsigned long rx_pkts;
> -	volatile unsigned long rx_bytes;
> +	uint64_t rx_pkts;
> +	uint64_t rx_bytes;
> +	uint64_t rx_pkts_offset;
> +	uint64_t rx_bytes_offset;

I suggest you introduce a separate struct for reset-able counters. It'll 
make things cleaner, and you can sneak in atomics without too much 
atomics-related bloat.

struct counter
{
	uint64_t count;
	uint64_t offset;
};

/../
	struct counter rx_pkts;
	struct counter rx_bytes;
/../

static uint64_t
counter_value(struct counter *counter)
{
	uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
	uint64_t offset = __atomic_load_n(&counter->offset, __ATOMIC_RELAXED);

	return count + offset;
}

static void
counter_reset(struct counter *counter)
{
	uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);

	__atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
}

static void
counter_add(struct counter *counter, uint64_t operand)
{
	__atomic_store_n(&counter->count, counter->count + operand, 
__ATOMIC_RELAXED);
}

You'd have to port this to <rte_stdatomic.h> calls, which prevents 
non-atomic loads from RTE_ATOMIC()s. The non-atomic reads above must be 
replaced with explicit relaxed non-atomic load. Otherwise, if you just 
use "counter->count", that would be an atomic load with sequential 
consistency memory order on C11 atomics-based builds, which would result 
in a barrier, at least on weakly ordered machines (e.g., ARM).

I would still use a struct and some helper-functions even for the less 
ambitious, non-atomic variant.

The only drawback of using GCC built-ins type atomics here, versus an 
atomic- and volatile-free approach, is that current compilers seems to 
refuse merging atomic stores. It's beyond me why this is the case. If 
you store to a variable twice in quick succession, it'll be two store 
machine instructions, even in cases where the compiler *knows* the value 
is identical. So volatile, even though you didn't ask for it. Weird.

So if you have a loop, you may want to make an "counter_add()" in the 
end from a temporary, to get the final 0.001% of performance.

If the tech board thinks MT-safe reset-able software-manage statistics 
is the future (as opposed to dropping reset support, for example), I 
think this stuff should go into a separate header file, so other PMDs 
can reuse it. Maybe out of scope for this patch.

>   };
>   
>   struct pkt_tx_queue {
> @@ -64,9 +66,12 @@ struct pkt_tx_queue {
>   	unsigned int framecount;
>   	unsigned int framenum;
>   
> -	volatile unsigned long tx_pkts;
> -	volatile unsigned long err_pkts;
> -	volatile unsigned long tx_bytes;
> +	uint64_t tx_pkts;
> +	uint64_t err_pkts;
> +	uint64_t tx_bytes;
> +	uint64_t tx_pkts_offset;
> +	uint64_t err_pkts_offset;
> +	uint64_t tx_bytes_offset;
>   };
>   
>   struct pmd_internals {
> @@ -385,8 +390,15 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
>   	return 0;
>   }
>   
> +
> +static uint64_t
> +stats_get_diff(uint64_t stats, uint64_t offset)
> +{
> +	return stats - offset;
> +}
> +
>   static int
> -eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
>   {
>   	unsigned i, imax;
>   	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
> @@ -396,27 +408,29 @@ eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
>   	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
>   	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
>   	for (i = 0; i < imax; i++) {
> -		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
> -		igb_stats->q_ibytes[i] = internal->rx_queue[i].rx_bytes;
> -		rx_total += igb_stats->q_ipackets[i];
> -		rx_bytes_total += igb_stats->q_ibytes[i];
> +		struct pkt_rx_queue *rxq = &internal->rx_queue[i];
> +		stats->q_ipackets[i] = stats_get_diff(rxq->rx_pkts, rxq->rx_pkts_offset);
> +		stats->q_ibytes[i] = stats_get_diff(rxq->rx_bytes, rxq->rx_bytes_offset);
> +		rx_total += stats->q_ipackets[i];
> +		rx_bytes_total += stats->q_ibytes[i];
>   	}
>   
>   	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
>   	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
>   	for (i = 0; i < imax; i++) {
> -		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
> -		igb_stats->q_obytes[i] = internal->tx_queue[i].tx_bytes;
> -		tx_total += igb_stats->q_opackets[i];
> -		tx_err_total += internal->tx_queue[i].err_pkts;
> -		tx_bytes_total += igb_stats->q_obytes[i];
> +		struct pkt_tx_queue *txq = &internal->tx_queue[i];
> +		stats->q_opackets[i] = stats_get_diff(txq->tx_pkts, txq->tx_pkts_offset);
> +		stats->q_obytes[i] = stats_get_diff(txq->tx_bytes, txq->tx_bytes_offset);
> +		tx_total += stats->q_opackets[i];
> +		tx_err_total += stats_get_diff(txq->err_pkts, txq->err_pkts_offset);
> +		tx_bytes_total += stats->q_obytes[i];
>   	}
>   
> -	igb_stats->ipackets = rx_total;
> -	igb_stats->ibytes = rx_bytes_total;
> -	igb_stats->opackets = tx_total;
> -	igb_stats->oerrors = tx_err_total;
> -	igb_stats->obytes = tx_bytes_total;
> +	stats->ipackets = rx_total;
> +	stats->ibytes = rx_bytes_total;
> +	stats->opackets = tx_total;
> +	stats->oerrors = tx_err_total;
> +	stats->obytes = tx_bytes_total;
>   	return 0;
>   }
>   
> @@ -427,14 +441,16 @@ eth_stats_reset(struct rte_eth_dev *dev)
>   	struct pmd_internals *internal = dev->data->dev_private;
>   
>   	for (i = 0; i < internal->nb_queues; i++) {
> -		internal->rx_queue[i].rx_pkts = 0;
> -		internal->rx_queue[i].rx_bytes = 0;
> +		struct pkt_rx_queue *rxq = &internal->rx_queue[i];
> +		rxq->rx_pkts_offset = rxq->rx_pkts;
> +		rxq->rx_bytes_offset = rxq->rx_bytes;
>   	}
>   
>   	for (i = 0; i < internal->nb_queues; i++) {
> -		internal->tx_queue[i].tx_pkts = 0;
> -		internal->tx_queue[i].err_pkts = 0;
> -		internal->tx_queue[i].tx_bytes = 0;
> +		struct pkt_tx_queue *txq = &internal->tx_queue[i];
> +		txq->tx_pkts_offset = txq->tx_pkts;
> +		txq->err_pkts_offset = txq->err_pkts;
> +		txq->tx_bytes_offset = txq->tx_bytes;
>   	}
>   
>   	return 0;
  
Ferruh Yigit May 1, 2024, 4:19 p.m. UTC | #3
On 4/28/2024 4:11 PM, Mattias Rönnblom wrote:
> On 2024-04-26 16:38, Ferruh Yigit wrote:
>> For stats reset, use an offset instead of zeroing out actual stats
>> values,
>> get_stats() displays diff between stats and offset.
>> This way stats only updated in datapath and offset only updated in stats
>> reset function. This makes stats reset function more reliable.
>>
>> As stats only written by single thread, we can remove 'volatile'
>> qualifier
>> which should improve the performance in datapath.
>>
> 
> volatile wouldn't help you if you had multiple writers, so that can't be
> the reason for its removal. It would be more accurate to say it should
> be replaced with atomic updates. If you don't use volatile and don't use
> atomics, you have to consider if the compiler can reach the conclusion
> that it does not need to store the counter value for future use *for
> that thread*. Since otherwise, I don't think the store actually needs to
> occur. Since DPDK statistics tend to work, it's pretty obvious that
> current compilers tend not to reach this conclusion.
> 

Thanks Mattias for clarifying why we need volatile or atomics even with
single writer.

> If this should be done 100% properly, the update operation should be a
> non-atomic load, non-atomic add, and an atomic store. Similarly, for the
> reset, the offset store should be atomic.
> 

ack

> Considered the state of the rest of the DPDK code base, I think a
> non-atomic, non-volatile solution is also fine.
> 

Yes, this seems working practically but I guess better to follow above
suggestion.

> (That said, I think we're better off just deprecating stats reset
> altogether, and returning -ENOTSUP here meanwhile.)
> 

As long as reset is reliable (here I mean it reset stats in every call)
and doesn't impact datapath performance, I am for to continue with it.
Returning non supported won't bring more benefit to users.

>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>> ---
>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>
>> This update triggered by mail list discussion [1].
>>
>> [1]
>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
>>
>> v2:
>> * Remove wrapping check for stats
>> ---
>>   drivers/net/af_packet/rte_eth_af_packet.c | 66 ++++++++++++++---------
>>   1 file changed, 41 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>> b/drivers/net/af_packet/rte_eth_af_packet.c
>> index 397a32db5886..10c8e1e50139 100644
>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>> @@ -51,8 +51,10 @@ struct pkt_rx_queue {
>>       uint16_t in_port;
>>       uint8_t vlan_strip;
>>   -    volatile unsigned long rx_pkts;
>> -    volatile unsigned long rx_bytes;
>> +    uint64_t rx_pkts;
>> +    uint64_t rx_bytes;
>> +    uint64_t rx_pkts_offset;
>> +    uint64_t rx_bytes_offset;
> 
> I suggest you introduce a separate struct for reset-able counters. It'll
> make things cleaner, and you can sneak in atomics without too much
> atomics-related bloat.
> 
> struct counter
> {
>     uint64_t count;
>     uint64_t offset;
> };
> 
> /../
>     struct counter rx_pkts;
>     struct counter rx_bytes;
> /../
> 
> static uint64_t
> counter_value(struct counter *counter)
> {
>     uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
>     uint64_t offset = __atomic_load_n(&counter->offset, __ATOMIC_RELAXED);
> 
>     return count + offset;
> }
> 
> static void
> counter_reset(struct counter *counter)
> {
>     uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
> 
>     __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
> }
> 
> static void
> counter_add(struct counter *counter, uint64_t operand)
> {
>     __atomic_store_n(&counter->count, counter->count + operand,
> __ATOMIC_RELAXED);
> }
> 

Ack for separate struct for reset-able counters.

> You'd have to port this to <rte_stdatomic.h> calls, which prevents
> non-atomic loads from RTE_ATOMIC()s. The non-atomic reads above must be
> replaced with explicit relaxed non-atomic load. Otherwise, if you just
> use "counter->count", that would be an atomic load with sequential
> consistency memory order on C11 atomics-based builds, which would result
> in a barrier, at least on weakly ordered machines (e.g., ARM).
> 

I am not sure I understand above.
As load and add will be non-atomic, why not access them directly, like:
`uint64_t count = counter->count;`

So my understanding is, remove `volatile`, load and add without atomics,
and only use relaxed ordered atomics for store (to ensure value in
register stored to memory).

I will send a new version of RFC with above understanding.

> I would still use a struct and some helper-functions even for the less
> ambitious, non-atomic variant.
> 
> The only drawback of using GCC built-ins type atomics here, versus an
> atomic- and volatile-free approach, is that current compilers seems to
> refuse merging atomic stores. It's beyond me why this is the case. If
> you store to a variable twice in quick succession, it'll be two store
> machine instructions, even in cases where the compiler *knows* the value
> is identical. So volatile, even though you didn't ask for it. Weird.
> 
> So if you have a loop, you may want to make an "counter_add()" in the
> end from a temporary, to get the final 0.001% of performance.
> 

ack

I can't really say which one of the following is better (because of
store in empty poll), but I will keep it as it is (b.):

a.
for (i < nb_pkt) {
	stats =+ 1;
}


b.
for (i < nb_pkt) {
	tmp =+ 1;
}
stats += tmp;


c.
for (i < nb_pkt) {
	tmp =+ 1;
}
if (tmp)
	stats += tmp;



> If the tech board thinks MT-safe reset-able software-manage statistics
> is the future (as opposed to dropping reset support, for example), I
> think this stuff should go into a separate header file, so other PMDs
> can reuse it. Maybe out of scope for this patch.
> 

I don't think we need MT-safe reset, the patch is already out to
document current status.
For HW stats reset is already reliable and for SW drives offset based
approach can make is reliable.

Unless you explicitly asked for it, I don't think this is in the agenda
of the techboard.


>>   };
>>     struct pkt_tx_queue {
>> @@ -64,9 +66,12 @@ struct pkt_tx_queue {
>>       unsigned int framecount;
>>       unsigned int framenum;
>>   -    volatile unsigned long tx_pkts;
>> -    volatile unsigned long err_pkts;
>> -    volatile unsigned long tx_bytes;
>> +    uint64_t tx_pkts;
>> +    uint64_t err_pkts;
>> +    uint64_t tx_bytes;
>> +    uint64_t tx_pkts_offset;
>> +    uint64_t err_pkts_offset;
>> +    uint64_t tx_bytes_offset;
>>   };
>>     struct pmd_internals {
>> @@ -385,8 +390,15 @@ eth_dev_info(struct rte_eth_dev *dev, struct
>> rte_eth_dev_info *dev_info)
>>       return 0;
>>   }
>>   +
>> +static uint64_t
>> +stats_get_diff(uint64_t stats, uint64_t offset)
>> +{
>> +    return stats - offset;
>> +}
>> +
>>   static int
>> -eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
>> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
>>   {
>>       unsigned i, imax;
>>       unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
>> @@ -396,27 +408,29 @@ eth_stats_get(struct rte_eth_dev *dev, struct
>> rte_eth_stats *igb_stats)
>>       imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
>>               internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
>>       for (i = 0; i < imax; i++) {
>> -        igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
>> -        igb_stats->q_ibytes[i] = internal->rx_queue[i].rx_bytes;
>> -        rx_total += igb_stats->q_ipackets[i];
>> -        rx_bytes_total += igb_stats->q_ibytes[i];
>> +        struct pkt_rx_queue *rxq = &internal->rx_queue[i];
>> +        stats->q_ipackets[i] = stats_get_diff(rxq->rx_pkts,
>> rxq->rx_pkts_offset);
>> +        stats->q_ibytes[i] = stats_get_diff(rxq->rx_bytes,
>> rxq->rx_bytes_offset);
>> +        rx_total += stats->q_ipackets[i];
>> +        rx_bytes_total += stats->q_ibytes[i];
>>       }
>>         imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
>>               internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
>>       for (i = 0; i < imax; i++) {
>> -        igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
>> -        igb_stats->q_obytes[i] = internal->tx_queue[i].tx_bytes;
>> -        tx_total += igb_stats->q_opackets[i];
>> -        tx_err_total += internal->tx_queue[i].err_pkts;
>> -        tx_bytes_total += igb_stats->q_obytes[i];
>> +        struct pkt_tx_queue *txq = &internal->tx_queue[i];
>> +        stats->q_opackets[i] = stats_get_diff(txq->tx_pkts,
>> txq->tx_pkts_offset);
>> +        stats->q_obytes[i] = stats_get_diff(txq->tx_bytes,
>> txq->tx_bytes_offset);
>> +        tx_total += stats->q_opackets[i];
>> +        tx_err_total += stats_get_diff(txq->err_pkts,
>> txq->err_pkts_offset);
>> +        tx_bytes_total += stats->q_obytes[i];
>>       }
>>   -    igb_stats->ipackets = rx_total;
>> -    igb_stats->ibytes = rx_bytes_total;
>> -    igb_stats->opackets = tx_total;
>> -    igb_stats->oerrors = tx_err_total;
>> -    igb_stats->obytes = tx_bytes_total;
>> +    stats->ipackets = rx_total;
>> +    stats->ibytes = rx_bytes_total;
>> +    stats->opackets = tx_total;
>> +    stats->oerrors = tx_err_total;
>> +    stats->obytes = tx_bytes_total;
>>       return 0;
>>   }
>>   @@ -427,14 +441,16 @@ eth_stats_reset(struct rte_eth_dev *dev)
>>       struct pmd_internals *internal = dev->data->dev_private;
>>         for (i = 0; i < internal->nb_queues; i++) {
>> -        internal->rx_queue[i].rx_pkts = 0;
>> -        internal->rx_queue[i].rx_bytes = 0;
>> +        struct pkt_rx_queue *rxq = &internal->rx_queue[i];
>> +        rxq->rx_pkts_offset = rxq->rx_pkts;
>> +        rxq->rx_bytes_offset = rxq->rx_bytes;
>>       }
>>         for (i = 0; i < internal->nb_queues; i++) {
>> -        internal->tx_queue[i].tx_pkts = 0;
>> -        internal->tx_queue[i].err_pkts = 0;
>> -        internal->tx_queue[i].tx_bytes = 0;
>> +        struct pkt_tx_queue *txq = &internal->tx_queue[i];
>> +        txq->tx_pkts_offset = txq->tx_pkts;
>> +        txq->err_pkts_offset = txq->err_pkts;
>> +        txq->tx_bytes_offset = txq->tx_bytes;
>>       }
>>         return 0;
  
Mattias Rönnblom May 2, 2024, 5:51 a.m. UTC | #4
On 2024-05-01 18:19, Ferruh Yigit wrote:
> On 4/28/2024 4:11 PM, Mattias Rönnblom wrote:
>> On 2024-04-26 16:38, Ferruh Yigit wrote:
>>> For stats reset, use an offset instead of zeroing out actual stats
>>> values,
>>> get_stats() displays diff between stats and offset.
>>> This way stats only updated in datapath and offset only updated in stats
>>> reset function. This makes stats reset function more reliable.
>>>
>>> As stats only written by single thread, we can remove 'volatile'
>>> qualifier
>>> which should improve the performance in datapath.
>>>
>>
>> volatile wouldn't help you if you had multiple writers, so that can't be
>> the reason for its removal. It would be more accurate to say it should
>> be replaced with atomic updates. If you don't use volatile and don't use
>> atomics, you have to consider if the compiler can reach the conclusion
>> that it does not need to store the counter value for future use *for
>> that thread*. Since otherwise, I don't think the store actually needs to
>> occur. Since DPDK statistics tend to work, it's pretty obvious that
>> current compilers tend not to reach this conclusion.
>>
> 
> Thanks Mattias for clarifying why we need volatile or atomics even with
> single writer.
> 
>> If this should be done 100% properly, the update operation should be a
>> non-atomic load, non-atomic add, and an atomic store. Similarly, for the
>> reset, the offset store should be atomic.
>>
> 
> ack
> 
>> Considered the state of the rest of the DPDK code base, I think a
>> non-atomic, non-volatile solution is also fine.
>>
> 
> Yes, this seems working practically but I guess better to follow above
> suggestion.
> 
>> (That said, I think we're better off just deprecating stats reset
>> altogether, and returning -ENOTSUP here meanwhile.)
>>
> 
> As long as reset is reliable (here I mean it reset stats in every call)
> and doesn't impact datapath performance, I am for to continue with it.
> Returning non supported won't bring more benefit to users.
> 
>>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>>> ---
>>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>>
>>> This update triggered by mail list discussion [1].
>>>
>>> [1]
>>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
>>>
>>> v2:
>>> * Remove wrapping check for stats
>>> ---
>>>    drivers/net/af_packet/rte_eth_af_packet.c | 66 ++++++++++++++---------
>>>    1 file changed, 41 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index 397a32db5886..10c8e1e50139 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -51,8 +51,10 @@ struct pkt_rx_queue {
>>>        uint16_t in_port;
>>>        uint8_t vlan_strip;
>>>    -    volatile unsigned long rx_pkts;
>>> -    volatile unsigned long rx_bytes;
>>> +    uint64_t rx_pkts;
>>> +    uint64_t rx_bytes;
>>> +    uint64_t rx_pkts_offset;
>>> +    uint64_t rx_bytes_offset;
>>
>> I suggest you introduce a separate struct for reset-able counters. It'll
>> make things cleaner, and you can sneak in atomics without too much
>> atomics-related bloat.
>>
>> struct counter
>> {
>>      uint64_t count;
>>      uint64_t offset;
>> };
>>
>> /../
>>      struct counter rx_pkts;
>>      struct counter rx_bytes;
>> /../
>>
>> static uint64_t
>> counter_value(struct counter *counter)
>> {
>>      uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
>>      uint64_t offset = __atomic_load_n(&counter->offset, __ATOMIC_RELAXED);
>>
>>      return count + offset;
>> }
>>
>> static void
>> counter_reset(struct counter *counter)
>> {
>>      uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
>>
>>      __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
>> }
>>
>> static void
>> counter_add(struct counter *counter, uint64_t operand)
>> {
>>      __atomic_store_n(&counter->count, counter->count + operand,
>> __ATOMIC_RELAXED);
>> }
>>
> 
> Ack for separate struct for reset-able counters.
> 
>> You'd have to port this to <rte_stdatomic.h> calls, which prevents
>> non-atomic loads from RTE_ATOMIC()s. The non-atomic reads above must be
>> replaced with explicit relaxed non-atomic load. Otherwise, if you just
>> use "counter->count", that would be an atomic load with sequential
>> consistency memory order on C11 atomics-based builds, which would result
>> in a barrier, at least on weakly ordered machines (e.g., ARM).
>>
> 
> I am not sure I understand above.
> As load and add will be non-atomic, why not access them directly, like:
> `uint64_t count = counter->count;`
> 

In case count is _Atomic (i.e., on enable_stdatomic=true builds), "count 
= counter->count" will imply a memory barrier. On x86_64, I think it 
will "only" be a compiler barrier (i.e., preventing optimization). On 
weakly ordered machines, it will result in a barrier-instruction (or an 
instruction-which-is-also-a-barrier, like in the example below).

include <stdatomic.h>

int relaxed_load(_Atomic int *p)
{
     atomic_load_explicit(p, memory_order_relaxed);
}

int direct_load(_Atomic int *p)
{
     return *p;
}

GCC 13.2 ARM64 ->

relaxed_load:
         ldr     w0, [x0]
         ret
direct_load:
         ldar    w0, [x0]
         ret

> So my understanding is, remove `volatile`, load and add without atomics,
> and only use relaxed ordered atomics for store (to ensure value in
> register stored to memory).
> 

Yes, that would be the best option, would the DPDK atomics API allow its 
implementation - but it doesn't. At least not if you care about what 
happens in enable_stdatomic=true builds.

The second-best option is to use a rte_memory_order_relaxed atomic load, 
a regular non-atomic add, and a relaxed atomic store.

> I will send a new version of RFC with above understanding.
> 
>> I would still use a struct and some helper-functions even for the less
>> ambitious, non-atomic variant.
>>
>> The only drawback of using GCC built-ins type atomics here, versus an
>> atomic- and volatile-free approach, is that current compilers seems to
>> refuse merging atomic stores. It's beyond me why this is the case. If
>> you store to a variable twice in quick succession, it'll be two store
>> machine instructions, even in cases where the compiler *knows* the value
>> is identical. So volatile, even though you didn't ask for it. Weird.
>>
>> So if you have a loop, you may want to make an "counter_add()" in the
>> end from a temporary, to get the final 0.001% of performance.
>>
> 
> ack
> 
> I can't really say which one of the following is better (because of
> store in empty poll), but I will keep it as it is (b.):
> 
> a.
> for (i < nb_pkt) {
> 	stats =+ 1;
> }
> 
> 
> b.
> for (i < nb_pkt) {
> 	tmp =+ 1;
> }
> stats += tmp;
> 
> 
> c.
> for (i < nb_pkt) {
> 	tmp =+ 1;
> }
> if (tmp)
> 	stats += tmp;
> 
> 
> 
>> If the tech board thinks MT-safe reset-able software-manage statistics
>> is the future (as opposed to dropping reset support, for example), I
>> think this stuff should go into a separate header file, so other PMDs
>> can reuse it. Maybe out of scope for this patch.
>>
> 
> I don't think we need MT-safe reset, the patch is already out to
> document current status.

Well, what you are working on is a MT-safe reset, in the sense it allows 
for one (1) resetting thread properly synchronize with multiple 
concurrent counter-updating threads.

It's not going to be completely MT safe, since you can't have two 
threads calling the reset function in parallel.

Any change to the API should make this clear.

> For HW stats reset is already reliable and for SW drives offset based
> approach can make is reliable.
> 
> Unless you explicitly asked for it, I don't think this is in the agenda
> of the techboard.
> 
> 
>>>    };
>>>      struct pkt_tx_queue {
>>> @@ -64,9 +66,12 @@ struct pkt_tx_queue {
>>>        unsigned int framecount;
>>>        unsigned int framenum;
>>>    -    volatile unsigned long tx_pkts;
>>> -    volatile unsigned long err_pkts;
>>> -    volatile unsigned long tx_bytes;
>>> +    uint64_t tx_pkts;
>>> +    uint64_t err_pkts;
>>> +    uint64_t tx_bytes;
>>> +    uint64_t tx_pkts_offset;
>>> +    uint64_t err_pkts_offset;
>>> +    uint64_t tx_bytes_offset;
>>>    };
>>>      struct pmd_internals {
>>> @@ -385,8 +390,15 @@ eth_dev_info(struct rte_eth_dev *dev, struct
>>> rte_eth_dev_info *dev_info)
>>>        return 0;
>>>    }
>>>    +
>>> +static uint64_t
>>> +stats_get_diff(uint64_t stats, uint64_t offset)
>>> +{
>>> +    return stats - offset;
>>> +}
>>> +
>>>    static int
>>> -eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
>>> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
>>>    {
>>>        unsigned i, imax;
>>>        unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
>>> @@ -396,27 +408,29 @@ eth_stats_get(struct rte_eth_dev *dev, struct
>>> rte_eth_stats *igb_stats)
>>>        imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
>>>                internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
>>>        for (i = 0; i < imax; i++) {
>>> -        igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
>>> -        igb_stats->q_ibytes[i] = internal->rx_queue[i].rx_bytes;
>>> -        rx_total += igb_stats->q_ipackets[i];
>>> -        rx_bytes_total += igb_stats->q_ibytes[i];
>>> +        struct pkt_rx_queue *rxq = &internal->rx_queue[i];
>>> +        stats->q_ipackets[i] = stats_get_diff(rxq->rx_pkts,
>>> rxq->rx_pkts_offset);
>>> +        stats->q_ibytes[i] = stats_get_diff(rxq->rx_bytes,
>>> rxq->rx_bytes_offset);
>>> +        rx_total += stats->q_ipackets[i];
>>> +        rx_bytes_total += stats->q_ibytes[i];
>>>        }
>>>          imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
>>>                internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
>>>        for (i = 0; i < imax; i++) {
>>> -        igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
>>> -        igb_stats->q_obytes[i] = internal->tx_queue[i].tx_bytes;
>>> -        tx_total += igb_stats->q_opackets[i];
>>> -        tx_err_total += internal->tx_queue[i].err_pkts;
>>> -        tx_bytes_total += igb_stats->q_obytes[i];
>>> +        struct pkt_tx_queue *txq = &internal->tx_queue[i];
>>> +        stats->q_opackets[i] = stats_get_diff(txq->tx_pkts,
>>> txq->tx_pkts_offset);
>>> +        stats->q_obytes[i] = stats_get_diff(txq->tx_bytes,
>>> txq->tx_bytes_offset);
>>> +        tx_total += stats->q_opackets[i];
>>> +        tx_err_total += stats_get_diff(txq->err_pkts,
>>> txq->err_pkts_offset);
>>> +        tx_bytes_total += stats->q_obytes[i];
>>>        }
>>>    -    igb_stats->ipackets = rx_total;
>>> -    igb_stats->ibytes = rx_bytes_total;
>>> -    igb_stats->opackets = tx_total;
>>> -    igb_stats->oerrors = tx_err_total;
>>> -    igb_stats->obytes = tx_bytes_total;
>>> +    stats->ipackets = rx_total;
>>> +    stats->ibytes = rx_bytes_total;
>>> +    stats->opackets = tx_total;
>>> +    stats->oerrors = tx_err_total;
>>> +    stats->obytes = tx_bytes_total;
>>>        return 0;
>>>    }
>>>    @@ -427,14 +441,16 @@ eth_stats_reset(struct rte_eth_dev *dev)
>>>        struct pmd_internals *internal = dev->data->dev_private;
>>>          for (i = 0; i < internal->nb_queues; i++) {
>>> -        internal->rx_queue[i].rx_pkts = 0;
>>> -        internal->rx_queue[i].rx_bytes = 0;
>>> +        struct pkt_rx_queue *rxq = &internal->rx_queue[i];
>>> +        rxq->rx_pkts_offset = rxq->rx_pkts;
>>> +        rxq->rx_bytes_offset = rxq->rx_bytes;
>>>        }
>>>          for (i = 0; i < internal->nb_queues; i++) {
>>> -        internal->tx_queue[i].tx_pkts = 0;
>>> -        internal->tx_queue[i].err_pkts = 0;
>>> -        internal->tx_queue[i].tx_bytes = 0;
>>> +        struct pkt_tx_queue *txq = &internal->tx_queue[i];
>>> +        txq->tx_pkts_offset = txq->tx_pkts;
>>> +        txq->err_pkts_offset = txq->err_pkts;
>>> +        txq->tx_bytes_offset = txq->tx_bytes;
>>>        }
>>>          return 0;
>
  
Ferruh Yigit May 2, 2024, 2:22 p.m. UTC | #5
On 5/2/2024 6:51 AM, Mattias Rönnblom wrote:
> On 2024-05-01 18:19, Ferruh Yigit wrote:
>> On 4/28/2024 4:11 PM, Mattias Rönnblom wrote:
>>> On 2024-04-26 16:38, Ferruh Yigit wrote:
>>>> For stats reset, use an offset instead of zeroing out actual stats
>>>> values,
>>>> get_stats() displays diff between stats and offset.
>>>> This way stats only updated in datapath and offset only updated in
>>>> stats
>>>> reset function. This makes stats reset function more reliable.
>>>>
>>>> As stats only written by single thread, we can remove 'volatile'
>>>> qualifier
>>>> which should improve the performance in datapath.
>>>>
>>>
>>> volatile wouldn't help you if you had multiple writers, so that can't be
>>> the reason for its removal. It would be more accurate to say it should
>>> be replaced with atomic updates. If you don't use volatile and don't use
>>> atomics, you have to consider if the compiler can reach the conclusion
>>> that it does not need to store the counter value for future use *for
>>> that thread*. Since otherwise, I don't think the store actually needs to
>>> occur. Since DPDK statistics tend to work, it's pretty obvious that
>>> current compilers tend not to reach this conclusion.
>>>
>>
>> Thanks Mattias for clarifying why we need volatile or atomics even with
>> single writer.
>>
>>> If this should be done 100% properly, the update operation should be a
>>> non-atomic load, non-atomic add, and an atomic store. Similarly, for the
>>> reset, the offset store should be atomic.
>>>
>>
>> ack
>>
>>> Considered the state of the rest of the DPDK code base, I think a
>>> non-atomic, non-volatile solution is also fine.
>>>
>>
>> Yes, this seems working practically but I guess better to follow above
>> suggestion.
>>
>>> (That said, I think we're better off just deprecating stats reset
>>> altogether, and returning -ENOTSUP here meanwhile.)
>>>
>>
>> As long as reset is reliable (here I mean it reset stats in every call)
>> and doesn't impact datapath performance, I am for to continue with it.
>> Returning non supported won't bring more benefit to users.
>>
>>>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>>>> ---
>>>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>>>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>>>
>>>> This update triggered by mail list discussion [1].
>>>>
>>>> [1]
>>>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
>>>>
>>>> v2:
>>>> * Remove wrapping check for stats
>>>> ---
>>>>    drivers/net/af_packet/rte_eth_af_packet.c | 66
>>>> ++++++++++++++---------
>>>>    1 file changed, 41 insertions(+), 25 deletions(-)
>>>>
>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> index 397a32db5886..10c8e1e50139 100644
>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> @@ -51,8 +51,10 @@ struct pkt_rx_queue {
>>>>        uint16_t in_port;
>>>>        uint8_t vlan_strip;
>>>>    -    volatile unsigned long rx_pkts;
>>>> -    volatile unsigned long rx_bytes;
>>>> +    uint64_t rx_pkts;
>>>> +    uint64_t rx_bytes;
>>>> +    uint64_t rx_pkts_offset;
>>>> +    uint64_t rx_bytes_offset;
>>>
>>> I suggest you introduce a separate struct for reset-able counters. It'll
>>> make things cleaner, and you can sneak in atomics without too much
>>> atomics-related bloat.
>>>
>>> struct counter
>>> {
>>>      uint64_t count;
>>>      uint64_t offset;
>>> };
>>>
>>> /../
>>>      struct counter rx_pkts;
>>>      struct counter rx_bytes;
>>> /../
>>>
>>> static uint64_t
>>> counter_value(struct counter *counter)
>>> {
>>>      uint64_t count = __atomic_load_n(&counter->count,
>>> __ATOMIC_RELAXED);
>>>      uint64_t offset = __atomic_load_n(&counter->offset,
>>> __ATOMIC_RELAXED);
>>>
>>>      return count + offset;
>>> }
>>>
>>> static void
>>> counter_reset(struct counter *counter)
>>> {
>>>      uint64_t count = __atomic_load_n(&counter->count,
>>> __ATOMIC_RELAXED);
>>>
>>>      __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
>>> }
>>>
>>> static void
>>> counter_add(struct counter *counter, uint64_t operand)
>>> {
>>>      __atomic_store_n(&counter->count, counter->count + operand,
>>> __ATOMIC_RELAXED);
>>> }
>>>
>>
>> Ack for separate struct for reset-able counters.
>>
>>> You'd have to port this to <rte_stdatomic.h> calls, which prevents
>>> non-atomic loads from RTE_ATOMIC()s. The non-atomic reads above must be
>>> replaced with explicit relaxed non-atomic load. Otherwise, if you just
>>> use "counter->count", that would be an atomic load with sequential
>>> consistency memory order on C11 atomics-based builds, which would result
>>> in a barrier, at least on weakly ordered machines (e.g., ARM).
>>>
>>
>> I am not sure I understand above.
>> As load and add will be non-atomic, why not access them directly, like:
>> `uint64_t count = counter->count;`
>>
> 
> In case count is _Atomic (i.e., on enable_stdatomic=true builds), "count
> = counter->count" will imply a memory barrier. On x86_64, I think it
> will "only" be a compiler barrier (i.e., preventing optimization). On
> weakly ordered machines, it will result in a barrier-instruction (or an
> instruction-which-is-also-a-barrier, like in the example below).
> 
> include <stdatomic.h>
> 
> int relaxed_load(_Atomic int *p)
> {
>     atomic_load_explicit(p, memory_order_relaxed);
> }
> 
> int direct_load(_Atomic int *p)
> {
>     return *p;
> }
> 
> GCC 13.2 ARM64 ->
> 
> relaxed_load:
>         ldr     w0, [x0]
>         ret
> direct_load:
>         ldar    w0, [x0]
>         ret
> 
>


Do we need to declare count as '_Atomic', I wasn't planning to make
variable _Atomic. This way assignment won't introduce any memory barrier.


>> So my understanding is, remove `volatile`, load and add without atomics,
>> and only use relaxed ordered atomics for store (to ensure value in
>> register stored to memory).
>>
> 
> Yes, that would be the best option, would the DPDK atomics API allow its
> implementation - but it doesn't. At least not if you care about what
> happens in enable_stdatomic=true builds.
> 
> The second-best option is to use a rte_memory_order_relaxed atomic load,
> a regular non-atomic add, and a relaxed atomic store.
> 
>> I will send a new version of RFC with above understanding.
>>
>>> I would still use a struct and some helper-functions even for the less
>>> ambitious, non-atomic variant.
>>>
>>> The only drawback of using GCC built-ins type atomics here, versus an
>>> atomic- and volatile-free approach, is that current compilers seems to
>>> refuse merging atomic stores. It's beyond me why this is the case. If
>>> you store to a variable twice in quick succession, it'll be two store
>>> machine instructions, even in cases where the compiler *knows* the value
>>> is identical. So volatile, even though you didn't ask for it. Weird.
>>>
>>> So if you have a loop, you may want to make an "counter_add()" in the
>>> end from a temporary, to get the final 0.001% of performance.
>>>
>>
>> ack
>>
>> I can't really say which one of the following is better (because of
>> store in empty poll), but I will keep it as it is (b.):
>>
>> a.
>> for (i < nb_pkt) {
>>     stats =+ 1;
>> }
>>
>>
>> b.
>> for (i < nb_pkt) {
>>     tmp =+ 1;
>> }
>> stats += tmp;
>>
>>
>> c.
>> for (i < nb_pkt) {
>>     tmp =+ 1;
>> }
>> if (tmp)
>>     stats += tmp;
>>
>>
>>
>>> If the tech board thinks MT-safe reset-able software-manage statistics
>>> is the future (as opposed to dropping reset support, for example), I
>>> think this stuff should go into a separate header file, so other PMDs
>>> can reuse it. Maybe out of scope for this patch.
>>>
>>
>> I don't think we need MT-safe reset, the patch is already out to
>> document current status.
> 
> Well, what you are working on is a MT-safe reset, in the sense it allows
> for one (1) resetting thread properly synchronize with multiple
> concurrent counter-updating threads.
> 
> It's not going to be completely MT safe, since you can't have two
> threads calling the reset function in parallel.
> 

This is what I meant with "MT-safe reset", so multiple threads not
allowed to call stats reset in parallel.

And for multiple concurrent counter-updating threads case, suggestion is
to stop forwarding.

Above two are the update I added to 'rte_eth_stats_reset()' API, and I
believe we can continue with this restriction for the API.

> Any change to the API should make this clear.
> 
>> For HW stats reset is already reliable and for SW drives offset based
>> approach can make is reliable.
>>
>> Unless you explicitly asked for it, I don't think this is in the agenda
>> of the techboard.
>>
>>
  
Stephen Hemminger May 2, 2024, 3:59 p.m. UTC | #6
On Thu, 2 May 2024 15:22:35 +0100
Ferruh Yigit <ferruh.yigit@amd.com> wrote:

> > 
> > It's not going to be completely MT safe, since you can't have two
> > threads calling the reset function in parallel.
> >   
> 
> This is what I meant with "MT-safe reset", so multiple threads not
> allowed to call stats reset in parallel.
> 
> And for multiple concurrent counter-updating threads case, suggestion is
> to stop forwarding.
> 
> Above two are the update I added to 'rte_eth_stats_reset()' API, and

I can't see the point of full MT-safe reset. No other driver does it.
The ethdev control side api's are not MT-safe and making them truly
MT-safe would add additional ref count and locking.
  
Mattias Rönnblom May 2, 2024, 5:37 p.m. UTC | #7
On 2024-05-02 16:22, Ferruh Yigit wrote:
> On 5/2/2024 6:51 AM, Mattias Rönnblom wrote:
>> On 2024-05-01 18:19, Ferruh Yigit wrote:
>>> On 4/28/2024 4:11 PM, Mattias Rönnblom wrote:
>>>> On 2024-04-26 16:38, Ferruh Yigit wrote:
>>>>> For stats reset, use an offset instead of zeroing out actual stats
>>>>> values,
>>>>> get_stats() displays diff between stats and offset.
>>>>> This way stats only updated in datapath and offset only updated in
>>>>> stats
>>>>> reset function. This makes stats reset function more reliable.
>>>>>
>>>>> As stats only written by single thread, we can remove 'volatile'
>>>>> qualifier
>>>>> which should improve the performance in datapath.
>>>>>
>>>>
>>>> volatile wouldn't help you if you had multiple writers, so that can't be
>>>> the reason for its removal. It would be more accurate to say it should
>>>> be replaced with atomic updates. If you don't use volatile and don't use
>>>> atomics, you have to consider if the compiler can reach the conclusion
>>>> that it does not need to store the counter value for future use *for
>>>> that thread*. Since otherwise, I don't think the store actually needs to
>>>> occur. Since DPDK statistics tend to work, it's pretty obvious that
>>>> current compilers tend not to reach this conclusion.
>>>>
>>>
>>> Thanks Mattias for clarifying why we need volatile or atomics even with
>>> single writer.
>>>
>>>> If this should be done 100% properly, the update operation should be a
>>>> non-atomic load, non-atomic add, and an atomic store. Similarly, for the
>>>> reset, the offset store should be atomic.
>>>>
>>>
>>> ack
>>>
>>>> Considered the state of the rest of the DPDK code base, I think a
>>>> non-atomic, non-volatile solution is also fine.
>>>>
>>>
>>> Yes, this seems working practically but I guess better to follow above
>>> suggestion.
>>>
>>>> (That said, I think we're better off just deprecating stats reset
>>>> altogether, and returning -ENOTSUP here meanwhile.)
>>>>
>>>
>>> As long as reset is reliable (here I mean it reset stats in every call)
>>> and doesn't impact datapath performance, I am for to continue with it.
>>> Returning non supported won't bring more benefit to users.
>>>
>>>>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>>>>> ---
>>>>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>>>>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>>>>
>>>>> This update triggered by mail list discussion [1].
>>>>>
>>>>> [1]
>>>>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
>>>>>
>>>>> v2:
>>>>> * Remove wrapping check for stats
>>>>> ---
>>>>>     drivers/net/af_packet/rte_eth_af_packet.c | 66
>>>>> ++++++++++++++---------
>>>>>     1 file changed, 41 insertions(+), 25 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> index 397a32db5886..10c8e1e50139 100644
>>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> @@ -51,8 +51,10 @@ struct pkt_rx_queue {
>>>>>         uint16_t in_port;
>>>>>         uint8_t vlan_strip;
>>>>>     -    volatile unsigned long rx_pkts;
>>>>> -    volatile unsigned long rx_bytes;
>>>>> +    uint64_t rx_pkts;
>>>>> +    uint64_t rx_bytes;
>>>>> +    uint64_t rx_pkts_offset;
>>>>> +    uint64_t rx_bytes_offset;
>>>>
>>>> I suggest you introduce a separate struct for reset-able counters. It'll
>>>> make things cleaner, and you can sneak in atomics without too much
>>>> atomics-related bloat.
>>>>
>>>> struct counter
>>>> {
>>>>       uint64_t count;
>>>>       uint64_t offset;
>>>> };
>>>>
>>>> /../
>>>>       struct counter rx_pkts;
>>>>       struct counter rx_bytes;
>>>> /../
>>>>
>>>> static uint64_t
>>>> counter_value(struct counter *counter)
>>>> {
>>>>       uint64_t count = __atomic_load_n(&counter->count,
>>>> __ATOMIC_RELAXED);
>>>>       uint64_t offset = __atomic_load_n(&counter->offset,
>>>> __ATOMIC_RELAXED);
>>>>
>>>>       return count + offset;
>>>> }
>>>>
>>>> static void
>>>> counter_reset(struct counter *counter)
>>>> {
>>>>       uint64_t count = __atomic_load_n(&counter->count,
>>>> __ATOMIC_RELAXED);
>>>>
>>>>       __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
>>>> }
>>>>
>>>> static void
>>>> counter_add(struct counter *counter, uint64_t operand)
>>>> {
>>>>       __atomic_store_n(&counter->count, counter->count + operand,
>>>> __ATOMIC_RELAXED);
>>>> }
>>>>
>>>
>>> Ack for separate struct for reset-able counters.
>>>
>>>> You'd have to port this to <rte_stdatomic.h> calls, which prevents
>>>> non-atomic loads from RTE_ATOMIC()s. The non-atomic reads above must be
>>>> replaced with explicit relaxed non-atomic load. Otherwise, if you just
>>>> use "counter->count", that would be an atomic load with sequential
>>>> consistency memory order on C11 atomics-based builds, which would result
>>>> in a barrier, at least on weakly ordered machines (e.g., ARM).
>>>>
>>>
>>> I am not sure I understand above.
>>> As load and add will be non-atomic, why not access them directly, like:
>>> `uint64_t count = counter->count;`
>>>
>>
>> In case count is _Atomic (i.e., on enable_stdatomic=true builds), "count
>> = counter->count" will imply a memory barrier. On x86_64, I think it
>> will "only" be a compiler barrier (i.e., preventing optimization). On
>> weakly ordered machines, it will result in a barrier-instruction (or an
>> instruction-which-is-also-a-barrier, like in the example below).
>>
>> include <stdatomic.h>
>>
>> int relaxed_load(_Atomic int *p)
>> {
>>      atomic_load_explicit(p, memory_order_relaxed);
>> }
>>
>> int direct_load(_Atomic int *p)
>> {
>>      return *p;
>> }
>>
>> GCC 13.2 ARM64 ->
>>
>> relaxed_load:
>>          ldr     w0, [x0]
>>          ret
>> direct_load:
>>          ldar    w0, [x0]
>>          ret
>>
>>
> 
> 
> Do we need to declare count as '_Atomic', I wasn't planning to make
> variable _Atomic. This way assignment won't introduce any memory barrier.
> 

To use atomics in DPDK, the current requirements seems to be to use 
RTE_ATOMIC(). That macro expands to _Atomic in enable_stdatomic=true 
builds, and nothing otherwise.

Carefully crafted code using atomics will achieved the same performance 
and be more correct than the non-atomic variant. However, in practice, I 
think the non-atomic variant is very likely to produce the desired results.

> 
>>> So my understanding is, remove `volatile`, load and add without atomics,
>>> and only use relaxed ordered atomics for store (to ensure value in
>>> register stored to memory).
>>>
>>
>> Yes, that would be the best option, would the DPDK atomics API allow its
>> implementation - but it doesn't. At least not if you care about what
>> happens in enable_stdatomic=true builds.
>>
>> The second-best option is to use a rte_memory_order_relaxed atomic load,
>> a regular non-atomic add, and a relaxed atomic store.
>>
>>> I will send a new version of RFC with above understanding.
>>>
>>>> I would still use a struct and some helper-functions even for the less
>>>> ambitious, non-atomic variant.
>>>>
>>>> The only drawback of using GCC built-ins type atomics here, versus an
>>>> atomic- and volatile-free approach, is that current compilers seems to
>>>> refuse merging atomic stores. It's beyond me why this is the case. If
>>>> you store to a variable twice in quick succession, it'll be two store
>>>> machine instructions, even in cases where the compiler *knows* the value
>>>> is identical. So volatile, even though you didn't ask for it. Weird.
>>>>
>>>> So if you have a loop, you may want to make an "counter_add()" in the
>>>> end from a temporary, to get the final 0.001% of performance.
>>>>
>>>
>>> ack
>>>
>>> I can't really say which one of the following is better (because of
>>> store in empty poll), but I will keep it as it is (b.):
>>>
>>> a.
>>> for (i < nb_pkt) {
>>>      stats =+ 1;
>>> }
>>>
>>>
>>> b.
>>> for (i < nb_pkt) {
>>>      tmp =+ 1;
>>> }
>>> stats += tmp;
>>>
>>>
>>> c.
>>> for (i < nb_pkt) {
>>>      tmp =+ 1;
>>> }
>>> if (tmp)
>>>      stats += tmp;
>>>
>>>
>>>
>>>> If the tech board thinks MT-safe reset-able software-manage statistics
>>>> is the future (as opposed to dropping reset support, for example), I
>>>> think this stuff should go into a separate header file, so other PMDs
>>>> can reuse it. Maybe out of scope for this patch.
>>>>
>>>
>>> I don't think we need MT-safe reset, the patch is already out to
>>> document current status.
>>
>> Well, what you are working on is a MT-safe reset, in the sense it allows
>> for one (1) resetting thread properly synchronize with multiple
>> concurrent counter-updating threads.
>>
>> It's not going to be completely MT safe, since you can't have two
>> threads calling the reset function in parallel.
>>
> 
> This is what I meant with "MT-safe reset", so multiple threads not
> allowed to call stats reset in parallel.
> 
> And for multiple concurrent counter-updating threads case, suggestion is
> to stop forwarding.
> 
> Above two are the update I added to 'rte_eth_stats_reset()' API, and I
> believe we can continue with this restriction for the API.
> 
>> Any change to the API should make this clear.
>>
>>> For HW stats reset is already reliable and for SW drives offset based
>>> approach can make is reliable.
>>>
>>> Unless you explicitly asked for it, I don't think this is in the agenda
>>> of the techboard.
>>>
>>>
>
  
Ferruh Yigit May 2, 2024, 6:20 p.m. UTC | #8
On 5/2/2024 4:59 PM, Stephen Hemminger wrote:
> On Thu, 2 May 2024 15:22:35 +0100
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> 
>>>
>>> It's not going to be completely MT safe, since you can't have two
>>> threads calling the reset function in parallel.
>>>   
>>
>> This is what I meant with "MT-safe reset", so multiple threads not
>> allowed to call stats reset in parallel.
>>
>> And for multiple concurrent counter-updating threads case, suggestion is
>> to stop forwarding.
>>
>> Above two are the update I added to 'rte_eth_stats_reset()' API, and
> 
> I can't see the point of full MT-safe reset. No other driver does it.
> The ethdev control side api's are not MT-safe and making them truly
> MT-safe would add additional ref count and locking.
>

Agree, that is why clarified this in ethdev API:
https:/dpdk.org/patch/139681
  
Stephen Hemminger May 2, 2024, 6:26 p.m. UTC | #9
On Thu, 2 May 2024 19:37:28 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> > 
> > Do we need to declare count as '_Atomic', I wasn't planning to make
> > variable _Atomic. This way assignment won't introduce any memory barrier.
> >   
> 
> To use atomics in DPDK, the current requirements seems to be to use 
> RTE_ATOMIC(). That macro expands to _Atomic in enable_stdatomic=true 
> builds, and nothing otherwise.
> 
> Carefully crafted code using atomics will achieved the same performance 
> and be more correct than the non-atomic variant. However, in practice, I 
> think the non-atomic variant is very likely to produce the desired results.

You are confusing atomic usage for thread safety, with the necessity
of compiler barriers. 

Stats should not be volatile.
  
Mattias Rönnblom May 2, 2024, 9:26 p.m. UTC | #10
On 2024-05-02 20:26, Stephen Hemminger wrote:
> On Thu, 2 May 2024 19:37:28 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>>>
>>> Do we need to declare count as '_Atomic', I wasn't planning to make
>>> variable _Atomic. This way assignment won't introduce any memory barrier.
>>>    
>>
>> To use atomics in DPDK, the current requirements seems to be to use
>> RTE_ATOMIC(). That macro expands to _Atomic in enable_stdatomic=true
>> builds, and nothing otherwise.
>>
>> Carefully crafted code using atomics will achieved the same performance
>> and be more correct than the non-atomic variant. However, in practice, I
>> think the non-atomic variant is very likely to produce the desired results.
> 
> You are confusing atomic usage for thread safety, with the necessity
> of compiler barriers.
> 

Are you suggesting that program-level C11 atomic stores risk being 
delayed, indefinitely? I could only find a draft version of the 
standard, but there 7.17.3 says "Implementations should make atomic 
stores visible to atomic loads within a reasonable amount of time."

An atomic relaxed store will be much cheaper than a compiler barrier.

> Stats should not be volatile.

Sure, and I also don't think compiler barriers should be needed.
  
Stephen Hemminger May 2, 2024, 9:46 p.m. UTC | #11
On Thu, 2 May 2024 23:26:31 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> > 
> > You are confusing atomic usage for thread safety, with the necessity
> > of compiler barriers.
> >   
> 
> Are you suggesting that program-level C11 atomic stores risk being 
> delayed, indefinitely? I could only find a draft version of the 
> standard, but there 7.17.3 says "Implementations should make atomic 
> stores visible to atomic loads within a reasonable amount of time."
> 
> An atomic relaxed store will be much cheaper than a compiler barrier.

There is a confusion between language designers C11 and system implementer and
CPU design. The language people confuse compiler with hardware in standards.
Because of course the compiler knows all (not).

Read the extended discussion on memory models in Linux kernel documentation.


https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-barriers.html
  
Mattias Rönnblom May 7, 2024, 7:23 a.m. UTC | #12
On 2024-04-28 17:11, Mattias Rönnblom wrote:
> On 2024-04-26 16:38, Ferruh Yigit wrote:
>> For stats reset, use an offset instead of zeroing out actual stats 
>> values,
>> get_stats() displays diff between stats and offset.
>> This way stats only updated in datapath and offset only updated in stats
>> reset function. This makes stats reset function more reliable.
>>
>> As stats only written by single thread, we can remove 'volatile' 
>> qualifier
>> which should improve the performance in datapath.
>>
> 
> volatile wouldn't help you if you had multiple writers, so that can't be 
> the reason for its removal. It would be more accurate to say it should 
> be replaced with atomic updates. If you don't use volatile and don't use 
> atomics, you have to consider if the compiler can reach the conclusion 
> that it does not need to store the counter value for future use *for 
> that thread*. Since otherwise, I don't think the store actually needs to 
> occur. Since DPDK statistics tend to work, it's pretty obvious that 
> current compilers tend not to reach this conclusion.
> 
> If this should be done 100% properly, the update operation should be a 
> non-atomic load, non-atomic add, and an atomic store. Similarly, for the 
> reset, the offset store should be atomic.
> 
> Considered the state of the rest of the DPDK code base, I think a 
> non-atomic, non-volatile solution is also fine.
> 
> (That said, I think we're better off just deprecating stats reset 
> altogether, and returning -ENOTSUP here meanwhile.)
> 
>> Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
>> ---
>> Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>
>> Cc: Morten Brørup <mb@smartsharesystems.com>
>>
>> This update triggered by mail list discussion [1].
>>
>> [1]
>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
>>
>> v2:
>> * Remove wrapping check for stats
>> ---
>>   drivers/net/af_packet/rte_eth_af_packet.c | 66 ++++++++++++++---------
>>   1 file changed, 41 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
>> b/drivers/net/af_packet/rte_eth_af_packet.c
>> index 397a32db5886..10c8e1e50139 100644
>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>> @@ -51,8 +51,10 @@ struct pkt_rx_queue {
>>       uint16_t in_port;
>>       uint8_t vlan_strip;
>> -    volatile unsigned long rx_pkts;
>> -    volatile unsigned long rx_bytes;
>> +    uint64_t rx_pkts;
>> +    uint64_t rx_bytes;
>> +    uint64_t rx_pkts_offset;
>> +    uint64_t rx_bytes_offset;
> 
> I suggest you introduce a separate struct for reset-able counters. It'll 
> make things cleaner, and you can sneak in atomics without too much 
> atomics-related bloat.
> 
> struct counter
> {
>      uint64_t count;
>      uint64_t offset;
> };
> 
> /../
>      struct counter rx_pkts;
>      struct counter rx_bytes;
> /../
> 
> static uint64_t
> counter_value(struct counter *counter)
> {
>      uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
>      uint64_t offset = __atomic_load_n(&counter->offset, __ATOMIC_RELAXED);
> 

Since the count and the offset are written to independently, without any 
ordering restrictions, an update and a reset in quick succession may 
cause the offset store to be globally visible before the new count. In 
such a scenario, a reader could see an offset > count.

Thus, unless I'm missing something, one should add a

if (unlikely(offset > count))
	return 0;

here. With the appropriate comment explaining why this might be.

Another approach would be to think about what memory barriers may be 
required to make sure one sees the count update before the offset 
update, but, intuitively, that seems like both more complex and more 
costly (performance-wise).

>      return count + offset;
> }
> 
> static void
> counter_reset(struct counter *counter)
> {
>      uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
> 
>      __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
> }
> 
> static void
> counter_add(struct counter *counter, uint64_t operand)
> {
>      __atomic_store_n(&counter->count, counter->count + operand, 
> __ATOMIC_RELAXED);
> }
> 
> You'd have to port this to <rte_stdatomic.h> calls, which prevents 
> non-atomic loads from RTE_ATOMIC()s. The non-atomic reads above must be 
> replaced with explicit relaxed non-atomic load. Otherwise, if you just 
> use "counter->count", that would be an atomic load with sequential 
> consistency memory order on C11 atomics-based builds, which would result 
> in a barrier, at least on weakly ordered machines (e.g., ARM).
> 
> I would still use a struct and some helper-functions even for the less 
> ambitious, non-atomic variant.
> 
> The only drawback of using GCC built-ins type atomics here, versus an 
> atomic- and volatile-free approach, is that current compilers seems to 
> refuse merging atomic stores. It's beyond me why this is the case. If 
> you store to a variable twice in quick succession, it'll be two store 
> machine instructions, even in cases where the compiler *knows* the value 
> is identical. So volatile, even though you didn't ask for it. Weird.
> 
> So if you have a loop, you may want to make an "counter_add()" in the 
> end from a temporary, to get the final 0.001% of performance.
> 
> If the tech board thinks MT-safe reset-able software-manage statistics 
> is the future (as opposed to dropping reset support, for example), I 
> think this stuff should go into a separate header file, so other PMDs 
> can reuse it. Maybe out of scope for this patch.
> 
>>   };
>>   struct pkt_tx_queue {
>> @@ -64,9 +66,12 @@ struct pkt_tx_queue {
>>       unsigned int framecount;
>>       unsigned int framenum;
>> -    volatile unsigned long tx_pkts;
>> -    volatile unsigned long err_pkts;
>> -    volatile unsigned long tx_bytes;
>> +    uint64_t tx_pkts;
>> +    uint64_t err_pkts;
>> +    uint64_t tx_bytes;
>> +    uint64_t tx_pkts_offset;
>> +    uint64_t err_pkts_offset;
>> +    uint64_t tx_bytes_offset;
>>   };
>>   struct pmd_internals {
>> @@ -385,8 +390,15 @@ eth_dev_info(struct rte_eth_dev *dev, struct 
>> rte_eth_dev_info *dev_info)
>>       return 0;
>>   }
>> +
>> +static uint64_t
>> +stats_get_diff(uint64_t stats, uint64_t offset)
>> +{
>> +    return stats - offset;
>> +}
>> +
>>   static int
>> -eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
>> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
>>   {
>>       unsigned i, imax;
>>       unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
>> @@ -396,27 +408,29 @@ eth_stats_get(struct rte_eth_dev *dev, struct 
>> rte_eth_stats *igb_stats)
>>       imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
>>               internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
>>       for (i = 0; i < imax; i++) {
>> -        igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
>> -        igb_stats->q_ibytes[i] = internal->rx_queue[i].rx_bytes;
>> -        rx_total += igb_stats->q_ipackets[i];
>> -        rx_bytes_total += igb_stats->q_ibytes[i];
>> +        struct pkt_rx_queue *rxq = &internal->rx_queue[i];
>> +        stats->q_ipackets[i] = stats_get_diff(rxq->rx_pkts, 
>> rxq->rx_pkts_offset);
>> +        stats->q_ibytes[i] = stats_get_diff(rxq->rx_bytes, 
>> rxq->rx_bytes_offset);
>> +        rx_total += stats->q_ipackets[i];
>> +        rx_bytes_total += stats->q_ibytes[i];
>>       }
>>       imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
>>               internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
>>       for (i = 0; i < imax; i++) {
>> -        igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
>> -        igb_stats->q_obytes[i] = internal->tx_queue[i].tx_bytes;
>> -        tx_total += igb_stats->q_opackets[i];
>> -        tx_err_total += internal->tx_queue[i].err_pkts;
>> -        tx_bytes_total += igb_stats->q_obytes[i];
>> +        struct pkt_tx_queue *txq = &internal->tx_queue[i];
>> +        stats->q_opackets[i] = stats_get_diff(txq->tx_pkts, 
>> txq->tx_pkts_offset);
>> +        stats->q_obytes[i] = stats_get_diff(txq->tx_bytes, 
>> txq->tx_bytes_offset);
>> +        tx_total += stats->q_opackets[i];
>> +        tx_err_total += stats_get_diff(txq->err_pkts, 
>> txq->err_pkts_offset);
>> +        tx_bytes_total += stats->q_obytes[i];
>>       }
>> -    igb_stats->ipackets = rx_total;
>> -    igb_stats->ibytes = rx_bytes_total;
>> -    igb_stats->opackets = tx_total;
>> -    igb_stats->oerrors = tx_err_total;
>> -    igb_stats->obytes = tx_bytes_total;
>> +    stats->ipackets = rx_total;
>> +    stats->ibytes = rx_bytes_total;
>> +    stats->opackets = tx_total;
>> +    stats->oerrors = tx_err_total;
>> +    stats->obytes = tx_bytes_total;
>>       return 0;
>>   }
>> @@ -427,14 +441,16 @@ eth_stats_reset(struct rte_eth_dev *dev)
>>       struct pmd_internals *internal = dev->data->dev_private;
>>       for (i = 0; i < internal->nb_queues; i++) {
>> -        internal->rx_queue[i].rx_pkts = 0;
>> -        internal->rx_queue[i].rx_bytes = 0;
>> +        struct pkt_rx_queue *rxq = &internal->rx_queue[i];
>> +        rxq->rx_pkts_offset = rxq->rx_pkts;
>> +        rxq->rx_bytes_offset = rxq->rx_bytes;
>>       }
>>       for (i = 0; i < internal->nb_queues; i++) {
>> -        internal->tx_queue[i].tx_pkts = 0;
>> -        internal->tx_queue[i].err_pkts = 0;
>> -        internal->tx_queue[i].tx_bytes = 0;
>> +        struct pkt_tx_queue *txq = &internal->tx_queue[i];
>> +        txq->tx_pkts_offset = txq->tx_pkts;
>> +        txq->err_pkts_offset = txq->err_pkts;
>> +        txq->tx_bytes_offset = txq->tx_bytes;
>>       }
>>       return 0;
  

Patch

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index 397a32db5886..10c8e1e50139 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -51,8 +51,10 @@  struct pkt_rx_queue {
 	uint16_t in_port;
 	uint8_t vlan_strip;
 
-	volatile unsigned long rx_pkts;
-	volatile unsigned long rx_bytes;
+	uint64_t rx_pkts;
+	uint64_t rx_bytes;
+	uint64_t rx_pkts_offset;
+	uint64_t rx_bytes_offset;
 };
 
 struct pkt_tx_queue {
@@ -64,9 +66,12 @@  struct pkt_tx_queue {
 	unsigned int framecount;
 	unsigned int framenum;
 
-	volatile unsigned long tx_pkts;
-	volatile unsigned long err_pkts;
-	volatile unsigned long tx_bytes;
+	uint64_t tx_pkts;
+	uint64_t err_pkts;
+	uint64_t tx_bytes;
+	uint64_t tx_pkts_offset;
+	uint64_t err_pkts_offset;
+	uint64_t tx_bytes_offset;
 };
 
 struct pmd_internals {
@@ -385,8 +390,15 @@  eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	return 0;
 }
 
+
+static uint64_t
+stats_get_diff(uint64_t stats, uint64_t offset)
+{
+	return stats - offset;
+}
+
 static int
-eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
+eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
 {
 	unsigned i, imax;
 	unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0;
@@ -396,27 +408,29 @@  eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats)
 	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
 	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
 	for (i = 0; i < imax; i++) {
-		igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts;
-		igb_stats->q_ibytes[i] = internal->rx_queue[i].rx_bytes;
-		rx_total += igb_stats->q_ipackets[i];
-		rx_bytes_total += igb_stats->q_ibytes[i];
+		struct pkt_rx_queue *rxq = &internal->rx_queue[i];
+		stats->q_ipackets[i] = stats_get_diff(rxq->rx_pkts, rxq->rx_pkts_offset);
+		stats->q_ibytes[i] = stats_get_diff(rxq->rx_bytes, rxq->rx_bytes_offset);
+		rx_total += stats->q_ipackets[i];
+		rx_bytes_total += stats->q_ibytes[i];
 	}
 
 	imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ?
 	        internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS);
 	for (i = 0; i < imax; i++) {
-		igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts;
-		igb_stats->q_obytes[i] = internal->tx_queue[i].tx_bytes;
-		tx_total += igb_stats->q_opackets[i];
-		tx_err_total += internal->tx_queue[i].err_pkts;
-		tx_bytes_total += igb_stats->q_obytes[i];
+		struct pkt_tx_queue *txq = &internal->tx_queue[i];
+		stats->q_opackets[i] = stats_get_diff(txq->tx_pkts, txq->tx_pkts_offset);
+		stats->q_obytes[i] = stats_get_diff(txq->tx_bytes, txq->tx_bytes_offset);
+		tx_total += stats->q_opackets[i];
+		tx_err_total += stats_get_diff(txq->err_pkts, txq->err_pkts_offset);
+		tx_bytes_total += stats->q_obytes[i];
 	}
 
-	igb_stats->ipackets = rx_total;
-	igb_stats->ibytes = rx_bytes_total;
-	igb_stats->opackets = tx_total;
-	igb_stats->oerrors = tx_err_total;
-	igb_stats->obytes = tx_bytes_total;
+	stats->ipackets = rx_total;
+	stats->ibytes = rx_bytes_total;
+	stats->opackets = tx_total;
+	stats->oerrors = tx_err_total;
+	stats->obytes = tx_bytes_total;
 	return 0;
 }
 
@@ -427,14 +441,16 @@  eth_stats_reset(struct rte_eth_dev *dev)
 	struct pmd_internals *internal = dev->data->dev_private;
 
 	for (i = 0; i < internal->nb_queues; i++) {
-		internal->rx_queue[i].rx_pkts = 0;
-		internal->rx_queue[i].rx_bytes = 0;
+		struct pkt_rx_queue *rxq = &internal->rx_queue[i];
+		rxq->rx_pkts_offset = rxq->rx_pkts;
+		rxq->rx_bytes_offset = rxq->rx_bytes;
 	}
 
 	for (i = 0; i < internal->nb_queues; i++) {
-		internal->tx_queue[i].tx_pkts = 0;
-		internal->tx_queue[i].err_pkts = 0;
-		internal->tx_queue[i].tx_bytes = 0;
+		struct pkt_tx_queue *txq = &internal->tx_queue[i];
+		txq->tx_pkts_offset = txq->tx_pkts;
+		txq->err_pkts_offset = txq->err_pkts;
+		txq->tx_bytes_offset = txq->tx_bytes;
 	}
 
 	return 0;