[v2] net/af_packet: fix ignoring full ring on tx

Message ID 1631540746-38443-1-git-send-email-tudor.cornea@gmail.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [v2] net/af_packet: fix ignoring full ring on tx |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/github-robot: build success github build: passed
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing fail Testing issues
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS

Commit Message

Tudor Cornea Sept. 13, 2021, 1:45 p.m. UTC
  The poll call can return POLLERR which is ignored, or it can return
POLLOUT, even if there are no free frames in the mmap-ed area.

We can account for both of these cases by re-checking if the next
frame is empty before writing into it.

Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
---
 drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
  

Comments

Ferruh Yigit Sept. 20, 2021, 5:44 p.m. UTC | #1
On 9/13/2021 2:45 PM, Tudor Cornea wrote:
> The poll call can return POLLERR which is ignored, or it can return
> POLLOUT, even if there are no free frames in the mmap-ed area.
> 
> We can account for both of these cases by re-checking if the next
> frame is empty before writing into it.
> 
> Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
> ---
>  drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
> index b73b211..087c196 100644
> --- a/drivers/net/af_packet/rte_eth_af_packet.c
> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
>  		    (poll(&pfd, 1, -1) < 0))
>  			break;
>  
> +		/*
> +		 * Poll can return POLLERR if the interface is down
> +		 *
> +		 * It will almost always return POLLOUT, even if there
> +		 * are no extra buffers available
> +		 *
> +		 * This happens, because packet_poll() calls datagram_poll()
> +		 * which checks the space left in the socket buffer and,
> +		 * in the case of packet_mmap, the default socket buffer length
> +		 * doesn't match the requested size for the tx_ring.
> +		 * As such, there is almost always space left in socket buffer,
> +		 * which doesn't seem to be correlated to the requested size
> +		 * for the tx_ring in packet_mmap.
> +		 *
> +		 * This results in poll() returning POLLOUT.
> +		 */
> +		if (ppd->tp_status != TP_STATUS_AVAILABLE)
> +			break;
> +

If 'POLLOUT' doesn't indicate that there is space in the buffer, what is the
point of the 'poll()' at all?

What can we test/reproduce the mentioned behavior? Or is there a way to fix the
behavior of poll() or use an alternative of it?


OK to break on the 'POLLERR', I guess it can be detected in the 'pfd.revent'.


>  		/* copy the tx frame data */
>  		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
>  			sizeof(struct sockaddr_ll);
>
  
Tudor Cornea Sept. 29, 2021, 10:03 a.m. UTC | #2
Hi Ferruh,

What you described above looks like a ring buffer with single producer and
> single consumer, and producer overwrites the not consumed items.


Indeed. This is also my understanding of the bug.
I am going to try to isolate the issue, and should probably be able to come
up with a script in a few days.

Our of curiosity, are you using an modified af_packet implementation in
> kernel
> for above described usage?


We are currently using an Ubuntu-based distro with a 4.15 Linux kernel.
We don't have any kernel patches for the af_packet implementation to my
knowledge (probably excepting patches that are back-ported by Ubuntu
maintainers from newer releases).


On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit <ferruh.yigit@intel.com> wrote:

> On 9/13/2021 2:45 PM, Tudor Cornea wrote:
> > The poll call can return POLLERR which is ignored, or it can return
> > POLLOUT, even if there are no free frames in the mmap-ed area.
> >
> > We can account for both of these cases by re-checking if the next
> > frame is empty before writing into it.
> >
> > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
> > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
> > ---
> >  drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
> b/drivers/net/af_packet/rte_eth_af_packet.c
> > index b73b211..087c196 100644
> > --- a/drivers/net/af_packet/rte_eth_af_packet.c
> > +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> > @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
> **bufs, uint16_t nb_pkts)
> >                   (poll(&pfd, 1, -1) < 0))
> >                       break;
> >
> > +             /*
> > +              * Poll can return POLLERR if the interface is down
> > +              *
> > +              * It will almost always return POLLOUT, even if there
> > +              * are no extra buffers available
> > +              *
> > +              * This happens, because packet_poll() calls
> datagram_poll()
> > +              * which checks the space left in the socket buffer and,
> > +              * in the case of packet_mmap, the default socket buffer
> length
> > +              * doesn't match the requested size for the tx_ring.
> > +              * As such, there is almost always space left in socket
> buffer,
> > +              * which doesn't seem to be correlated to the requested
> size
> > +              * for the tx_ring in packet_mmap.
> > +              *
> > +              * This results in poll() returning POLLOUT.
> > +              */
> > +             if (ppd->tp_status != TP_STATUS_AVAILABLE)
> > +                     break;
> > +
>
> If 'POLLOUT' doesn't indicate that there is space in the buffer, what is
> the
> point of the 'poll()' at all?
>
> What can we test/reproduce the mentioned behavior? Or is there a way to
> fix the
> behavior of poll() or use an alternative of it?
>
>
> OK to break on the 'POLLERR', I guess it can be detected in the
> 'pfd.revent'.
>
>
> >               /* copy the tx frame data */
> >               pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
> >                       sizeof(struct sockaddr_ll);
> >
>
>
  
Tudor Cornea Oct. 5, 2021, 3:11 p.m. UTC | #3
Hi Ferruh,

I have attempted to narrow down the issue.
I have the following bash script, which computes packet rates on an
interface.

[root@localhost ~]# cat compute-rates.sh
#!/usr/bin/env bash

if [[ ${#} -ne 2 ]]; then
    echo "Usage: ${0} <iface-name> <sleep-interval-seconds>"
    exit 1
fi

IFACE_NAME="${1}"
SLEEP_INTERVAL_SECONDS="${2}"
TMP_STATS_FILE="/tmp/netstat"

# Clear Previous stats file
echo "0 0 0 0" > "${TMP_STATS_FILE}"

echo "Press CTRL+C to exit..."

while true; do
    export "RxB=0" "RxP=0" "TxB=0" "TxP=0"

    # Extract Rx{Bytes,Packets} and Tx{Bytes,Packets} and
    # format the output. Individual fields will be exported
    export $(\
        ifconfig "${IFACE_NAME}" \
            | grep 'packets' \
            | awk '{print $5, $3}' \
            | xargs echo \
            | sed -E -e \
                "s/([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+)/RxB=\1 RxP=\2
TxB=\3 TxP=\4/")

    # Print Packet and Byte Rates
    # Format: | Rx Bytes | Rx Packets | Tx Bytes | Tx Packets |

    echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" $(cat "${TMP_STATS_FILE}") \
        | awk '{print "RxB="$1-$5, "RxP="$2-$6, "TxB="$3-$7, "TxP="$4-$8}'

    # Save the new values
    echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" > "${TMP_STATS_FILE}"

    sleep "${SLEEP_INTERVAL_SECONDS}"

done

On the transmit side, I'm using the engine behind [1] with the af_packet
PMD.

The configuration for the af_packet PMD is the following:
--vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0

I'm configuring a Tx rate of 335 packets / second and a packet size of 300
Bytes.
These seem to be the values using which we seem to have better chances of
seeing the problem. I suspect it might also be linked with the af_packet
configuration.

I'm starting traffic using the specified configuration, and in parallel,
running the script that computes the rates as follows:
./compute-rates.sh eth1 0.1

Initially, the packet rates seem steady

RxB=0 RxP=0 TxB=10952 TxP=37
RxB=0 RxP=0 TxB=10656 TxP=36
RxB=0 RxP=0 TxB=10656 TxP=36
RxB=0 RxP=0 TxB=10656 TxP=36
RxB=0 RxP=0 TxB=10952 TxP=37
RxB=0 RxP=0 TxB=10952 TxP=37
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10952 TxP=37

[...]

After a while, we toggle the interface up / down with a sleep between the
steps. I suspect the length of the sleep might be a variable in the
equation.

ifconfig eth1 down; sleep 7; ifconfig eth1 up


What we see, is that even after the interface is toggled back up, the rates
never seem to recover.

RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=2072 TxP=7
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=10360 TxP=35
RxB=0 RxP=0 TxB=521256 TxP=1761
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0

[...]


I've attempted to mirror the same behavior using dpdk-pktgen [2] on a
different machine (Ubuntu 20.04). This time, af_packet runs on top of
a Linux virtio_net interface.

I seem to be getting a  similar behavior. I have used the following
dpdk-pktgen configuration and run-time settings


pktgen \
    -l 1-4 \
    -n 4 \
    --proc-type=primary \
    --no-pci \
    --no-telemetry \
    --no-huge \
    -m 512 \
    --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0
\
    -- \
    -P \
    -T \
    -m "3.0" \
    -f themes/black-yellow.theme

set 0 size 300
set 0 rate 0.008
set 0 burst 1
start 0


[1] https://github.com/open-traffic-generator/ixia-c
[2] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md

On Wed, 29 Sept 2021 at 13:03, Tudor Cornea <tudor.cornea@gmail.com> wrote:

> Hi Ferruh,
>
> What you described above looks like a ring buffer with single producer and
>> single consumer, and producer overwrites the not consumed items.
>
>
> Indeed. This is also my understanding of the bug.
> I am going to try to isolate the issue, and should probably be able to
> come up with a script in a few days.
>
> Our of curiosity, are you using an modified af_packet implementation in
>> kernel
>> for above described usage?
>
>
> We are currently using an Ubuntu-based distro with a 4.15 Linux kernel.
> We don't have any kernel patches for the af_packet implementation to my
> knowledge (probably excepting patches that are back-ported by Ubuntu
> maintainers from newer releases).
>
>
> On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit <ferruh.yigit@intel.com>
> wrote:
>
>> On 9/13/2021 2:45 PM, Tudor Cornea wrote:
>> > The poll call can return POLLERR which is ignored, or it can return
>> > POLLOUT, even if there are no free frames in the mmap-ed area.
>> >
>> > We can account for both of these cases by re-checking if the next
>> > frame is empty before writing into it.
>> >
>> > Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
>> > Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
>> > ---
>> >  drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++
>> >  1 file changed, 19 insertions(+)
>> >
>> > diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>> b/drivers/net/af_packet/rte_eth_af_packet.c
>> > index b73b211..087c196 100644
>> > --- a/drivers/net/af_packet/rte_eth_af_packet.c
>> > +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>> > @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
>> **bufs, uint16_t nb_pkts)
>> >                   (poll(&pfd, 1, -1) < 0))
>> >                       break;
>> >
>> > +             /*
>> > +              * Poll can return POLLERR if the interface is down
>> > +              *
>> > +              * It will almost always return POLLOUT, even if there
>> > +              * are no extra buffers available
>> > +              *
>> > +              * This happens, because packet_poll() calls
>> datagram_poll()
>> > +              * which checks the space left in the socket buffer and,
>> > +              * in the case of packet_mmap, the default socket buffer
>> length
>> > +              * doesn't match the requested size for the tx_ring.
>> > +              * As such, there is almost always space left in socket
>> buffer,
>> > +              * which doesn't seem to be correlated to the requested
>> size
>> > +              * for the tx_ring in packet_mmap.
>> > +              *
>> > +              * This results in poll() returning POLLOUT.
>> > +              */
>> > +             if (ppd->tp_status != TP_STATUS_AVAILABLE)
>> > +                     break;
>> > +
>>
>> If 'POLLOUT' doesn't indicate that there is space in the buffer, what is
>> the
>> point of the 'poll()' at all?
>>
>> What can we test/reproduce the mentioned behavior? Or is there a way to
>> fix the
>> behavior of poll() or use an alternative of it?
>>
>>
>> OK to break on the 'POLLERR', I guess it can be detected in the
>> 'pfd.revent'.
>>
>>
>> >               /* copy the tx frame data */
>> >               pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
>> >                       sizeof(struct sockaddr_ll);
>> >
>>
>>
  
Ferruh Yigit Oct. 26, 2021, 2:30 p.m. UTC | #4
On 10/5/2021 4:11 PM, Tudor Cornea wrote:
> Hi Ferruh,
> 
> I have attempted to narrow down the issue.
> I have the following bash script, which computes packet rates on an
> interface.
> 
> [root@localhost ~]# cat compute-rates.sh
> #!/usr/bin/env bash
> 
> if [[ ${#} -ne 2 ]]; then
>      echo "Usage: ${0} <iface-name> <sleep-interval-seconds>"
>      exit 1
> fi
> 
> IFACE_NAME="${1}"
> SLEEP_INTERVAL_SECONDS="${2}"
> TMP_STATS_FILE="/tmp/netstat"
> 
> # Clear Previous stats file
> echo "0 0 0 0" > "${TMP_STATS_FILE}"
> 
> echo "Press CTRL+C to exit..."
> 
> while true; do
>      export "RxB=0" "RxP=0" "TxB=0" "TxP=0"
> 
>      # Extract Rx{Bytes,Packets} and Tx{Bytes,Packets} and
>      # format the output. Individual fields will be exported
>      export $(\
>          ifconfig "${IFACE_NAME}" \
>              | grep 'packets' \
>              | awk '{print $5, $3}' \
>              | xargs echo \
>              | sed -E -e \
>                  "s/([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+)/RxB=\1 RxP=\2
> TxB=\3 TxP=\4/")
> 
>      # Print Packet and Byte Rates
>      # Format: | Rx Bytes | Rx Packets | Tx Bytes | Tx Packets |
> 
>      echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" $(cat "${TMP_STATS_FILE}") \
>          | awk '{print "RxB="$1-$5, "RxP="$2-$6, "TxB="$3-$7, "TxP="$4-$8}'
> 
>      # Save the new values
>      echo "${RxB}" "${RxP}" "${TxB}" "${TxP}" > "${TMP_STATS_FILE}"
> 
>      sleep "${SLEEP_INTERVAL_SECONDS}"
> 
> done
> 
> On the transmit side, I'm using the engine behind [1] with the af_packet
> PMD.
> 
> The configuration for the af_packet PMD is the following:
> --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0
> 
> I'm configuring a Tx rate of 335 packets / second and a packet size of 300
> Bytes.
> These seem to be the values using which we seem to have better chances of
> seeing the problem. I suspect it might also be linked with the af_packet
> configuration.
> 
> I'm starting traffic using the specified configuration, and in parallel,
> running the script that computes the rates as follows:
> ./compute-rates.sh eth1 0.1
> 
> Initially, the packet rates seem steady
> 
> RxB=0 RxP=0 TxB=10952 TxP=37
> RxB=0 RxP=0 TxB=10656 TxP=36
> RxB=0 RxP=0 TxB=10656 TxP=36
> RxB=0 RxP=0 TxB=10656 TxP=36
> RxB=0 RxP=0 TxB=10952 TxP=37
> RxB=0 RxP=0 TxB=10952 TxP=37
> RxB=0 RxP=0 TxB=10360 TxP=35
> RxB=0 RxP=0 TxB=10952 TxP=37
> 
> [...]
> 
> After a while, we toggle the interface up / down with a sleep between the
> steps. I suspect the length of the sleep might be a variable in the
> equation.
> 
> ifconfig eth1 down; sleep 7; ifconfig eth1 up
> 
> 
> What we see, is that even after the interface is toggled back up, the rates
> never seem to recover.
> 
> RxB=0 RxP=0 TxB=0 TxP=0
> RxB=0 RxP=0 TxB=0 TxP=0
> RxB=0 RxP=0 TxB=0 TxP=0
> RxB=0 RxP=0 TxB=0 TxP=0
> RxB=0 RxP=0 TxB=2072 TxP=7
> RxB=0 RxP=0 TxB=10360 TxP=35
> RxB=0 RxP=0 TxB=10360 TxP=35
> RxB=0 RxP=0 TxB=10360 TxP=35
> RxB=0 RxP=0 TxB=10360 TxP=35
> RxB=0 RxP=0 TxB=10360 TxP=35
> RxB=0 RxP=0 TxB=10360 TxP=35
> RxB=0 RxP=0 TxB=10360 TxP=35
> RxB=0 RxP=0 TxB=10360 TxP=35
> RxB=0 RxP=0 TxB=521256 TxP=1761
> RxB=0 RxP=0 TxB=0 TxP=0
> RxB=0 RxP=0 TxB=0 TxP=0
> RxB=0 RxP=0 TxB=0 TxP=0
> 
> [...]
> 
> 
> I've attempted to mirror the same behavior using dpdk-pktgen [2] on a
> different machine (Ubuntu 20.04). This time, af_packet runs on top of
> a Linux virtio_net interface.
> 
> I seem to be getting a  similar behavior. I have used the following
> dpdk-pktgen configuration and run-time settings
> 
> 
> pktgen \
>      -l 1-4 \
>      -n 4 \
>      --proc-type=primary \
>      --no-pci \
>      --no-telemetry \
>      --no-huge \
>      -m 512 \
>      --vdev=net_af_packet0,iface=eth1,blocksz=16384,framesz=8192,framecnt=2048,qpairs=1,qdisc_bypass=0
> \
>      -- \
>      -P \
>      -T \
>      -m "3.0" \
>      -f themes/black-yellow.theme
> 
> set 0 size 300
> set 0 rate 0.008
> set 0 burst 1
> start 0
> 
> 
> [1] https://github.com/open-traffic-generator/ixia-c
> [2] http://code.dpdk.org/pktgen-dpdk/pktgen-20.11.2/source/INSTALL.md
> 
> On Wed, 29 Sept 2021 at 13:03, Tudor Cornea <tudor.cornea@gmail.com> wrote:
> 

Hi Tudor,

I have used testpmd, 'txonly' forwarding. Tx recovers after interface up,
but by adding some debug logs I can see 'poll()' returns with POLLOUT even
there is no space in the buffer.

According the logic in the PMD, when 'poll()' returns success, it expects
to have some space in the Tx buffer.

So I agree to add the check.

Only have a question on the POLLERR, should we separate the POLLERR check
to cover ifdown case, what do you think about following logic:

if (!TP_STATUS_AVAILABLE) {
     if (poll() < 0)
         break;
     if (pfd.revents & POLLERR)
         break;
}

if (!TP_STATUS_AVAILABLE)
     break;



>> Hi Ferruh,
>>
>> What you described above looks like a ring buffer with single producer and
>>> single consumer, and producer overwrites the not consumed items.
>>
>>
>> Indeed. This is also my understanding of the bug.
>> I am going to try to isolate the issue, and should probably be able to
>> come up with a script in a few days.
>>
>> Our of curiosity, are you using an modified af_packet implementation in
>>> kernel
>>> for above described usage?
>>
>>
>> We are currently using an Ubuntu-based distro with a 4.15 Linux kernel.
>> We don't have any kernel patches for the af_packet implementation to my
>> knowledge (probably excepting patches that are back-ported by Ubuntu
>> maintainers from newer releases).
>>
>>
>> On Mon, 20 Sept 2021 at 20:44, Ferruh Yigit <ferruh.yigit@intel.com>
>> wrote:
>>
>>> On 9/13/2021 2:45 PM, Tudor Cornea wrote:
>>>> The poll call can return POLLERR which is ignored, or it can return
>>>> POLLOUT, even if there are no free frames in the mmap-ed area.
>>>>
>>>> We can account for both of these cases by re-checking if the next
>>>> frame is empty before writing into it.
>>>>
>>>> Signed-off-by: Mihai Pogonaru <pogonarumihai@gmail.com>
>>>> Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
>>>> ---
>>>>   drivers/net/af_packet/rte_eth_af_packet.c | 19 +++++++++++++++++++
>>>>   1 file changed, 19 insertions(+)
>>>>
>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> index b73b211..087c196 100644
>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> @@ -216,6 +216,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf
>>> **bufs, uint16_t nb_pkts)
>>>>                    (poll(&pfd, 1, -1) < 0))
>>>>                        break;
>>>>
>>>> +             /*
>>>> +              * Poll can return POLLERR if the interface is down
>>>> +              *
>>>> +              * It will almost always return POLLOUT, even if there
>>>> +              * are no extra buffers available
>>>> +              *
>>>> +              * This happens, because packet_poll() calls
>>> datagram_poll()
>>>> +              * which checks the space left in the socket buffer and,
>>>> +              * in the case of packet_mmap, the default socket buffer
>>> length
>>>> +              * doesn't match the requested size for the tx_ring.
>>>> +              * As such, there is almost always space left in socket
>>> buffer,
>>>> +              * which doesn't seem to be correlated to the requested
>>> size
>>>> +              * for the tx_ring in packet_mmap.
>>>> +              *
>>>> +              * This results in poll() returning POLLOUT.
>>>> +              */
>>>> +             if (ppd->tp_status != TP_STATUS_AVAILABLE)
>>>> +                     break;
>>>> +
>>>
>>> If 'POLLOUT' doesn't indicate that there is space in the buffer, what is
>>> the
>>> point of the 'poll()' at all?
>>>
>>> What can we test/reproduce the mentioned behavior? Or is there a way to
>>> fix the
>>> behavior of poll() or use an alternative of it?
>>>
>>>
>>> OK to break on the 'POLLERR', I guess it can be detected in the
>>> 'pfd.revent'.
>>>
>>>
>>>>                /* copy the tx frame data */
>>>>                pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
>>>>                        sizeof(struct sockaddr_ll);
>>>>
>>>
>>>
  
Tudor Cornea Nov. 2, 2021, 3:24 p.m. UTC | #5
On Tue, 26 Oct 2021 at 17:41, Ferruh Yigit <ferruh.yigit@intel.com> wrote:

> Hi Tudor,
>
> I have used testpmd, 'txonly' forwarding. Tx recovers after interface up,
> but by adding some debug logs I can see 'poll()' returns with POLLOUT even
> there is no space in the buffer.
>
> According the logic in the PMD, when 'poll()' returns success, it expects
> to have some space in the Tx buffer.
>
> So I agree to add the check.
>
> Only have a question on the POLLERR, should we separate the POLLERR check
> to cover ifdown case, what do you think about following logic:
>
> if (!TP_STATUS_AVAILABLE) {
>      if (poll() < 0)
>          break;
>      if (pfd.revents & POLLERR)
>          break;
> }
>
> if (!TP_STATUS_AVAILABLE)
>      break;
>
>
Hi Ferruh,

Thanks for the suggestion.
I was thinking of adding this check, and intuitively, it seems correct. I
tried to do some further testing.

I tested with the POLLERR check, and I don't see the issue anymore.
However, if I remove the second if (!TP_STATUS_AVAILABLE), I seem to get
bursts of 2048 packets followed by periods of not sending anything when
toggling the interface link state.
Without the second if(!TP_STATUS_AVAILABLE) statement, the issue seems to
reproduce, regardless if I add the check for POLLERR or not.

RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=606208 TxP=2048
RxB=0 RxP=0 TxB=0 TxP=0
RxB=0 RxP=0 TxB=0 TxP=0

I will send an updated version of the patch.

Thanks,
Tudor
  

Patch

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index b73b211..087c196 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -216,6 +216,25 @@  eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		    (poll(&pfd, 1, -1) < 0))
 			break;
 
+		/*
+		 * Poll can return POLLERR if the interface is down
+		 *
+		 * It will almost always return POLLOUT, even if there
+		 * are no extra buffers available
+		 *
+		 * This happens, because packet_poll() calls datagram_poll()
+		 * which checks the space left in the socket buffer and,
+		 * in the case of packet_mmap, the default socket buffer length
+		 * doesn't match the requested size for the tx_ring.
+		 * As such, there is almost always space left in socket buffer,
+		 * which doesn't seem to be correlated to the requested size
+		 * for the tx_ring in packet_mmap.
+		 *
+		 * This results in poll() returning POLLOUT.
+		 */
+		if (ppd->tp_status != TP_STATUS_AVAILABLE)
+			break;
+
 		/* copy the tx frame data */
 		pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN -
 			sizeof(struct sockaddr_ll);