[dpdk-dev,v8,10/10] app/testpmd:test VxLAN Tx checksum offload

Message ID 1414376006-31402-11-git-send-email-jijiang.liu@intel.com (mailing list archive)
State Accepted, archived
Headers

Commit Message

Jijiang Liu Oct. 27, 2014, 2:13 a.m. UTC
  Add test cases in testpmd to test VxLAN Tx Checksum offload, which include
 - IPv4 and IPv6 packet
 - outer L3, inner L3 and L4 checksum offload for Tx side.

Signed-off-by: Jijiang Liu <jijiang.liu@intel.com>
---
 app/test-pmd/cmdline.c  |   13 ++-
 app/test-pmd/config.c   |    6 +-
 app/test-pmd/csumonly.c |  194 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 192 insertions(+), 21 deletions(-)
  

Comments

Olivier Matz Nov. 4, 2014, 8:19 a.m. UTC | #1
Hello Jijiang,

On 10/27/2014 03:13 AM, Jijiang Liu wrote:
> Add test cases in testpmd to test VxLAN Tx Checksum offload, which include
>   - IPv4 and IPv6 packet
>   - outer L3, inner L3 and L4 checksum offload for Tx side.
>
> Signed-off-by: Jijiang Liu <jijiang.liu@intel.com>

I'm trying to port the test of TSO in csum_only.c which was originally
part of this patch [1].

Meanwhile, the source was modified by the patch provided by the email
I'm replying to (also available at [2]), and I would like to understand
what is the purpose of it.

First, the code checks if the mbuf has the flag PKT_RX_TUNNEL_IPV4_HDR.
What is the meaning of this flag? It was added by [3], but there is no
description in comments or in the commit log explaining in which case
this flag is set by the driver. The name supposes that this flag is
set when the received packet is an IPv4 tunnel, but the commit log talks
about vxlan.

Then, if this flag was present, the code assumes it's a vxlan packet.
If one of inner checksum is specified by the user in cmdline, the flag
PKT_TX_VXLAN_CKSUM is added to the mbuf. This flag definition was added
by [4] (at the wrong place, I'll fix it in my patchset). What is the
meaning of this flag? Is it enough to checksum outer L3, inner L3, and
inner L4 as specified in commit log? If yes, why are the other flags
PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM, (...) added in the mbuf later?
In my comprehension, these flags are needed in addition to
PKT_TX_VXLAN_CKSUM to do the checksum of the inner headers.

In general, I need to understand how to use all the new offloading
stuff. In [5], new fields were added in the mbuf structure
(inner_l3_len and inner_l2_len). I'm not sure I understand which fields
have to be filled. Below is my understanding, can you please check that
it is correct?

A- To send a Ether/IP/TCP packet and process IP and TCP TX
    checksum in the NIC:

   - set l2_len = 14, l3_len = 20,
     ol_flags = PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM,
     write all checksums to 0 in the packet

B- To send a Ether/IP/UDP/vxlan/Ether/IP/TCP packet and
    process inner IP and inner TCP TX checksum in the NIC:

   - set l2_len = 14+20+8+8+14, l3_len = 20,
     ol_flags = PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM,
     write all checksums to 0 in the packet

C- To send a Ether/IP/UDP/vxlan/Ether/IP/TCP packet and
    process outer IP and UDP, plus inner IP and inner TCP TX
    checksum in the NIC:

   - set l2_len = 14, l3_len = 20, inner_l2_len = 14,
     inner_l3_len = 20,
     ol_flags = PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_VXLAN_CKSUM,
     write all checksums to 0 in the packet

D- To send a Ether/IP/UDP/vxlan/Ether/IP/TCP packet and
    process outer IP and UDP TX checksum in the NIC:

   - set l2_len = 14, l3_len = 20,
     ol_flags = PKT_TX_IP_CKSUM | PKT_TX_UDP_CKSUM,
     write all checksums to 0 in the packet

First, can you confirm this? I think it is very important to
document it, as this is a public API that can be used by DPDK users.

To validate my modifications, I will try to reproduce the test plan
described in [6]. The test report contains useful information but
I'm not sure to understand the following:

    Enable outer IP,UDP,TCP,SCTP and inner IP,UDP checksum offload
    when inner L4 protocal is UDP.
       testpmd>tx_checksum set 0 0xf3
    Enable outer IP,UDP,TCP,SCTP and inner IP,TCP,SCTP checksum offload
    when inner L4 protocal is TCP or SCTP.
       testpmd>tx_checksum set 0 0xfd

Can you give details about the signification of the bits used in the
tx_checksum command? For instance, there is only one flag to enable
tx checksum in the mbuf, so I don't understand how it's possible to
do 0x44 = (inner tcp + outer tcp).

Today, there are hard-coded values for flags: 4 bits (0-3) for legacy
checksums, 4 bits for inner checksums (4-7), and one bit (bit 11?!)
for vlan header. These bits are checked without defines at several
places in the code. Moreover, there are some places in code where
the testpmd port flags and the mbuf flags are mixed up. Example in
macswap.c : mb->ol_flags = txp->tx_ol_flags;

I suggest to remove all hardcoded values except in
cmd_tx_cksum_set_parsed() and replace them by the flags definitions
from rte_mbuf.h. As a result, the new possible arguments of
cmd_tx_cksum_set_parsed() will map the mbuf flags:
   - a flag to enable ip tx cksum
   - a flag to enable udp tx cksum
   - a flag to enable tcp tx cksum
   - a flag to enable sctp tx cksum
   - a flag to enable vxlan tx cksum
If vxlan tx cksum is set, therefore the other considered checksum
will concern the inner headers. Do you think it matches your needs?

I think that the csum_only forward engine is now a bit complicated.
As it's an example that shows how to configure the tx checksum, I
think its behavior should be described somewhere, maybe in a comment in 
the file.

By the way, I was looking at the mbuf structure, and I see that a new
packet_type field was added. There is no explanation about what this
field should contain and how we shall use it (in commit log or in
comments). Can you please explain it?

To conclude, I'd like to propose a merge of the TSO series but
I'm currently blocked by these questions. Please, could you
help me to progress on this?

Regards,
Olivier


[1] http://dpdk.org/ml/archives/dev/2014-May/002549.html
[2] http://dpdk.org/ml/archives/dev/2014-October/007156.html
[3] 
http://dpdk.org/browse/dpdk/commit/lib/librte_mbuf/rte_mbuf.h?id=15dbb63ef9e9f108e7dcd837b88234f27a1ec258
[4] 
http://dpdk.org/browse/dpdk/commit/lib/librte_mbuf/rte_mbuf.h?id=77b8301733c3cd946648ca4a1375e3db0a952216
[5] 
http://dpdk.org/browse/dpdk/commit/lib/librte_mbuf/rte_mbuf.h?id=73b7d59cf4f6faf5ccd83ce706473e75c6fb8c3b
[6] http://dpdk.org/ml/archives/dev/2014-October/007157.html
  
Jijiang Liu Nov. 5, 2014, 6:02 a.m. UTC | #2
Hi Olivier,


> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Tuesday, November 4, 2014 4:19 PM
> To: Liu, Jijiang
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
> 
> Hello Jijiang,
> 
> On 10/27/2014 03:13 AM, Jijiang Liu wrote:
> > Add test cases in testpmd to test VxLAN Tx Checksum offload, which include
> >   - IPv4 and IPv6 packet
> >   - outer L3, inner L3 and L4 checksum offload for Tx side.
> >
> > Signed-off-by: Jijiang Liu <jijiang.liu@intel.com>
> 
> I'm trying to port the test of TSO in csum_only.c which was originally part of this
> patch [1].
> 
> Meanwhile, the source was modified by the patch provided by the email I'm
> replying to (also available at [2]), and I would like to understand what is the
> purpose of it.
> 
> First, the code checks if the mbuf has the flag PKT_RX_TUNNEL_IPV4_HDR.
> What is the meaning of this flag? It was added by [3], but there is no description
> in comments or in the commit log explaining in which case this flag is set by the
> driver. The name supposes that this flag is set when the received packet is an IPv4
> tunnel, but the commit log talks about vxlan.

The flag PKT_RX_TUNNEL_IPV4_HDR can be used for all tunneling packet types with outer IPV4 header.
For example:
IPv4 --> GRE/Teredo/VXLAN --> MAC --> IPv4:
MAC, IPV4, GRENAT, MAC, IPV4, SCTP, PAY4 
MAC, IPV4, GRENAT, MAC, IPV6, UDP, PAY4
MAC, IPV4, GRENAT, MAC, IPV6, UDP, PAY4
These tunneling packet formats have a common point that is outer IPv4 header here.

Only VXLAN tunneling packet is supported in DPDK for i40e now, so  the commit log talks about VXLAN .
 

> Then, if this flag was present, the code assumes it's a vxlan packet.
> If one of inner checksum is specified by the user in cmdline, the flag
> PKT_TX_VXLAN_CKSUM is added to the mbuf. This flag definition was added by
> [4] (at the wrong place, I'll fix it in my patchset). 

> What is the meaning of this flag?
> Is it enough to checksum outer L3, inner L3, and inner L4 as specified in commit
> log? If yes, why are the other flags PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM,
> (...) added in the mbuf later?
> In my comprehension, these flags are needed in addition to
> PKT_TX_VXLAN_CKSUM to do the checksum of the inner headers.

Yes, these flags(PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM)  are needed by HW offload of non-tunneling and tunneling  packet.

> In general, I need to understand how to use all the new offloading stuff. In [5],
> new fields were added in the mbuf structure (inner_l3_len and inner_l2_len). I'm
> not sure I understand which fields have to be filled. Below is my understanding,
> can you please check that it is correct?
> 
> A- To send a Ether/IP/TCP packet and process IP and TCP TX
>     checksum in the NIC:
> 
>    - set l2_len = 14, l3_len = 20,
>      ol_flags = PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM,
>      write all checksums to 0 in the packet
   IP checksum is 0, but tcp checksum is not 0.
   tcp_hdr->cksum = get_ipv4_psd_sum(ipv4_hdr);

> B- To send a Ether/IP/UDP/vxlan/Ether/IP/TCP packet and
>     process inner IP and inner TCP TX checksum in the NIC:
> 
>    - set l2_len = 14+20+8+8+14, l3_len = 20,

No, l2_len is outer L2 length, its value also is 14. 
>      ol_flags = PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM,
>      write all checksums to 0 in the packet

   IP checksum is 0, but tcp checksum is not 0.
   tcp_hdr->cksum = get_ipv4_psd_sum(ipv4_hdr);

> C- To send a Ether/IP/UDP/vxlan/Ether/IP/TCP packet and
>     process outer IP and UDP, plus inner IP and inner TCP TX
>     checksum in the NIC:
> 
>    - set l2_len = 14, l3_len = 20, inner_l2_len = 14,
>      inner_l3_len = 20,
Yes
>      ol_flags = PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM |
> PKT_TX_VXLAN_CKSUM,
Yes
>      write all checksums to 0 in the packet

   Outer and inner IP checksum is 0, but tcp checksum is not 0.
   tcp_hdr->cksum = get_ipv4_psd_sum(ipv4_hdr);


> D- To send a Ether/IP/UDP/vxlan/Ether/IP/TCP packet and
>     process outer IP and UDP TX checksum in the NIC:
> 
>    - set l2_len = 14, l3_len = 20,
>      ol_flags = PKT_TX_IP_CKSUM | PKT_TX_UDP_CKSUM,

Yes
>      write all checksums to 0 in the packet
Outer  IP checksum is 0, but tcp checksum is not 0.
   tcp_hdr->cksum = get_ipv4_psd_sum(ipv4_hdr);


> First, can you confirm this? I think it is very important to document it, as this is a
> public API that can be used by DPDK users.

These should be included in documents.

> To validate my modifications, I will try to reproduce the test plan described in [6].
> The test report contains useful information but I'm not sure to understand the
> following:
> 
>     Enable outer IP,UDP,TCP,SCTP and inner IP,UDP checksum offload
>     when inner L4 protocal is UDP.
>        testpmd>tx_checksum set 0 0xf3

"tx_checksum set 0 0xf3" should be "tx_checksum set 0 0x33", but the tester use 0xFX (that is to say, enable all inner TX flag)in order to write test script easily.

>     Enable outer IP,UDP,TCP,SCTP and inner IP,TCP,SCTP checksum offload
>     when inner L4 protocal is TCP or SCTP.
>        testpmd>tx_checksum set 0 0xfd
> 
> Can you give details about the signification of the bits used in the tx_checksum
> command? For instance, there is only one flag to enable tx checksum in the mbuf,
> so I don't understand how it's possible to do 0x44 = (inner tcp + outer tcp).

These bits meanings/help information  were provided in the patch
http://dpdk.org/ml/archives/dev/2014-October/007156.html 


> Today, there are hard-coded values for flags: 4 bits (0-3) for legacy checksums, 4
> bits for inner checksums (4-7), and one bit (bit 11?!) for vlan header. These bits
> are checked without defines at several places in the code. Moreover, there are
> some places in code where the testpmd port flags and the mbuf flags are mixed

> up. Example in macswap.c : mb->ol_flags = txp->tx_ol_flags;
I thought this is an issue.

> I suggest to remove all hardcoded values except in
> cmd_tx_cksum_set_parsed() and replace them by the flags definitions from
> rte_mbuf.h. As a result, the new possible arguments of
> cmd_tx_cksum_set_parsed() will map the mbuf flags:
>    - a flag to enable ip tx cksum
>    - a flag to enable udp tx cksum
>    - a flag to enable tcp tx cksum
>    - a flag to enable sctp tx cksum
>    - a flag to enable vxlan tx cksum
> If vxlan tx cksum is set, therefore the other considered checksum will concern the
> inner headers. Do you think it matches your needs?

As to HW TX checksum offload, do you have special requirement for implementing TSO?

I suggest to keep on using bit to indicate different protocol flag. this approach has good flexibility and extensibility.

For example, a VXLAN packet format: MAC/IP/UDP/VXLAN/inner MAC/inner IP/inner TCP/PAY4.

If the ip tx cksum and vxlan tx cksum  are set, but tcp tx cksum is not set,  and HW offload  of inner TCP TX checksum won't work.

In addition, sometimes we want to get the following performance data using testpmd
1. only enable outer IP TX checksum  
2. only enable inner IP TX checksum
3. only enable inner TCP/UDP/SCTP checksum
4. only enable inner IP TX checksum and inner TCP/UDP/SCTP checksum.

Now we have provided inner BIT masks, we can use these combinations to test it easily.

Actually, we just need to know that if  the PKT_TX_IPV4_CSUM, PKT_TX_TCP_CKSUM, PKT_TX_SCTP_CKSUM, PKT_TX_UDP_CKSUM and PKT_TX_VXLAN_CKSUM are set in ol_flags in the i40e driver.
Could you please look at the function i40e_txd_enable_checksum() .

> I think that the csum_only forward engine is now a bit complicated.
> As it's an example that shows how to configure the tx checksum, I think its
> behavior should be described somewhere, maybe in a comment in the file.
I agree.

> By the way, I was looking at the mbuf structure, and I see that a new packet_type
> field was added. There is no explanation about what this field should contain and
> how we shall use it (in commit log or in comments). Can you please explain it?

+	/**
+	 * The packet type, which is used to indicate ordinary packet and also
+	 * tunneled packet format, i.e. each number is represented a type of
+	 * packet.
+	 */
+	uint16_t packet_type
Regarding adding a packet_type in mbuf, we ever had a lot of discussions as follows:

http://dpdk.org/ml/archives/dev/2014-October/007027.html
http://dpdk.org/ml/archives/dev/2014-September/005240.html
http://dpdk.org/ml/archives/dev/2014-September/005241.html
http://dpdk.org/ml/archives/dev/2014-September/005274.html


> To conclude, I'd like to propose a merge of the TSO series but I'm currently
> blocked by these questions. Please, could you help me to progress on this?

Let me know if you have any questions? 

> Regards,
> Olivier
> 
> 
> [1] http://dpdk.org/ml/archives/dev/2014-May/002549.html
> [2] http://dpdk.org/ml/archives/dev/2014-October/007156.html
> [3]
> http://dpdk.org/browse/dpdk/commit/lib/librte_mbuf/rte_mbuf.h?id=15dbb63ef
> 9e9f108e7dcd837b88234f27a1ec258
> [4]
> http://dpdk.org/browse/dpdk/commit/lib/librte_mbuf/rte_mbuf.h?id=77b83017
> 33c3cd946648ca4a1375e3db0a952216
> [5]
> http://dpdk.org/browse/dpdk/commit/lib/librte_mbuf/rte_mbuf.h?id=73b7d59cf
> 4f6faf5ccd83ce706473e75c6fb8c3b
> [6] http://dpdk.org/ml/archives/dev/2014-October/007157.html
  
Olivier Matz Nov. 5, 2014, 10:28 a.m. UTC | #3
Hi Jijiang,

Thank you for your answer. Please find some comments below.

On 11/05/2014 07:02 AM, Liu, Jijiang wrote:
>> First, the code checks if the mbuf has the flag PKT_RX_TUNNEL_IPV4_HDR.
>> What is the meaning of this flag? It was added by [3], but there is no description
>> in comments or in the commit log explaining in which case this flag is set by the
>> driver. The name supposes that this flag is set when the received packet is an IPv4
>> tunnel, but the commit log talks about vxlan.
>
> The flag PKT_RX_TUNNEL_IPV4_HDR can be used for all tunneling packet types with outer IPV4 header.
> For example:
> IPv4 --> GRE/Teredo/VXLAN --> MAC --> IPv4:
> MAC, IPV4, GRENAT, MAC, IPV4, SCTP, PAY4
> MAC, IPV4, GRENAT, MAC, IPV6, UDP, PAY4
> MAC, IPV4, GRENAT, MAC, IPV6, UDP, PAY4
> These tunneling packet formats have a common point that is outer IPv4 header here.
>
> Only VXLAN tunneling packet is supported in DPDK for i40e now, so  the commit log talks about VXLAN .

Is it possible to have a more formal definition? For instance, is the
following definition below correct?

  "the PKT_RX_TUNNEL_IPV4_HDR flag CAN be set by a driver if the packet
   contains a tunneling protocol inside an IPv4 header".

If the definition above is correct, I don't see how this flag can help
an application to run faster. There is already a flag telling if there
is a valid IPv4 header (PKT_RX_IPV4_HDR). As the PKT_RX_TUNNEL_IPV4_HDR
flag does not tell what is ip->proto, the work done by an application
to dissect a packet would be exactly the same with or without this flag.

Please, can you give an example showing in which conditions this flag
can help an application?


>> What is the meaning of this flag?
>> Is it enough to checksum outer L3, inner L3, and inner L4 as specified in commit
>> log? If yes, why are the other flags PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM,
>> (...) added in the mbuf later?
>> In my comprehension, these flags are needed in addition to
>> PKT_TX_VXLAN_CKSUM to do the checksum of the inner headers.
>
> Yes, these flags(PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM)  are needed by HW offload of non-tunneling and tunneling  packet.

OK, so I understand that when PKT_TX_VXLAN_CKSUM is set, if the
driver supports it, it will process IP and UDP checksum of outer
header, using l2_len and l3_len.


>> In general, I need to understand how to use all the new offloading stuff. In [5],
>> new fields were added in the mbuf structure (inner_l3_len and inner_l2_len). I'm
>> not sure I understand which fields have to be filled. Below is my understanding,
>> can you please check that it is correct?
>>
>> A- To send a Ether/IP/TCP packet and process IP and TCP TX
>>      checksum in the NIC:
>>
>>     - set l2_len = 14, l3_len = 20,
>>       ol_flags = PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM,
>>       write all checksums to 0 in the packet
>     IP checksum is 0, but tcp checksum is not 0.
>     tcp_hdr->cksum = get_ipv4_psd_sum(ipv4_hdr);

OK, right. I forgot it but indeed it's in csumonly.c


>> B- To send a Ether/IP/UDP/vxlan/Ether/IP/TCP packet and
>>      process inner IP and inner TCP TX checksum in the NIC:
>>
>>     - set l2_len = 14+20+8+8+14, l3_len = 20,
>>       ol_flags = PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM,
>>       write all checksums to 0 in the packet
>
> No, l2_len is outer L2 length, its value also is 14.

If you set l2_len to 14, how could the driver guess the offset of
the inner IP header? I'm pretty sure it should work with 14+20+8+8+14.

>> C- To send a Ether/IP/UDP/vxlan/Ether/IP/TCP packet and
>>      process outer IP and UDP, plus inner IP and inner TCP TX
>>      checksum in the NIC:
>>
>>     - set l2_len = 14, l3_len = 20, inner_l2_len = 14,
>>       inner_l3_len = 20,
> Yes
>>       ol_flags = PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM |
>> PKT_TX_VXLAN_CKSUM,
> Yes
>>       write all checksums to 0 in the packet
>
>     Outer and inner IP checksum is 0, but tcp checksum is not 0.
>     tcp_hdr->cksum = get_ipv4_psd_sum(ipv4_hdr);

What about outer udp checksum? Why shouldn't we set it to the
phdr checksum too?


>> D- To send a Ether/IP/UDP/vxlan/Ether/IP/TCP packet and
>>      process outer IP and UDP TX checksum in the NIC:
>>
>>     - set l2_len = 14, l3_len = 20,
>>       ol_flags = PKT_TX_IP_CKSUM | PKT_TX_UDP_CKSUM,
>
> Yes
>>       write all checksums to 0 in the packet
> Outer  IP checksum is 0, but tcp checksum is not 0.
>     tcp_hdr->cksum = get_ipv4_psd_sum(ipv4_hdr);
>
>> First, can you confirm this? I think it is very important to document it, as this is a
>> public API that can be used by DPDK users.
>
> These should be included in documents.

Another thing is surprising me.

- if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
   driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.

- if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
   inner_l{23}_len instead of l{23}_len for the same operation.

Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and l3_len.
To fix this, I suggest to remove the new fields inner_l{23}_len then
add outer_l{23}_len instead. Therefore, the semantic of l2_len and
l3_len would not change, and a driver would always use the same field
for a specific offload.

For my TSO development, I will follow the current semantic.


>> To validate my modifications, I will try to reproduce the test plan described in [6].
>> The test report contains useful information but I'm not sure to understand the
>> following:
>>
>>      Enable outer IP,UDP,TCP,SCTP and inner IP,UDP checksum offload
>>      when inner L4 protocal is UDP.
>>         testpmd>tx_checksum set 0 0xf3
>
> "tx_checksum set 0 0xf3" should be "tx_checksum set 0 0x33", but the tester use 0xFX (that is to say, enable all inner TX flag)in order to write test script easily.

To me, there is a strange thing:

- the mbuf flags only allows to checksum l3 (ip), l4 (tcp/udp/sctp),
   and vxlan.
- the user api in command line talks about inner and outer l3 + l4
   layers + vxlan.

Therefore many combinations are impossible to do or are non-sense.
Examples:

- outer IP + outer TCP + inner IP  -> no sense, there is no supported
   tunnel protocol above TCP
- outer IP + inner IP -> impossible to set up in mbuf flags
- outer IP: what is the meaning? same than inner IP? or does it
   means that the application has to parse the packet to find a
   tunnel? In this case, which tunnels are supported by the application?
- ...

Anyway, I will try to continue to use the current logic and document
it as much as I can.


>>      Enable outer IP,UDP,TCP,SCTP and inner IP,TCP,SCTP checksum offload
>>      when inner L4 protocal is TCP or SCTP.
>>         testpmd>tx_checksum set 0 0xfd
>>
>> Can you give details about the signification of the bits used in the tx_checksum
>> command? For instance, there is only one flag to enable tx checksum in the mbuf,
>> so I don't understand how it's possible to do 0x44 = (inner tcp + outer tcp).
>
> These bits meanings/help information  were provided in the patch
> http://dpdk.org/ml/archives/dev/2014-October/007156.html

Sorry, but the commit provides very few information about how to use
these flags (although it was already the case before your commit). I
will try to document it from what I understand from the code.


>> Today, there are hard-coded values for flags: 4 bits (0-3) for legacy checksums, 4
>> bits for inner checksums (4-7), and one bit (bit 11?!) for vlan header. These bits
>> are checked without defines at several places in the code. Moreover, there are
>> some places in code where the testpmd port flags and the mbuf flags are mixed
>
>> up. Example in macswap.c : mb->ol_flags = txp->tx_ol_flags;
> I thought this is an issue.
>
>> I suggest to remove all hardcoded values except in
>> cmd_tx_cksum_set_parsed() and replace them by the flags definitions from
>> rte_mbuf.h. As a result, the new possible arguments of
>> cmd_tx_cksum_set_parsed() will map the mbuf flags:
>>     - a flag to enable ip tx cksum
>>     - a flag to enable udp tx cksum
>>     - a flag to enable tcp tx cksum
>>     - a flag to enable sctp tx cksum
>>     - a flag to enable vxlan tx cksum
>> If vxlan tx cksum is set, therefore the other considered checksum will concern the
>> inner headers. Do you think it matches your needs?
>
> As to HW TX checksum offload, do you have special requirement for implementing TSO?

Yes. TSO implies TX TCP and IP checksum offload.

My first requirement is to understand what is currently implemented, and
the second one is to do build an API that is easily understandable and
usable by someone else.


>> By the way, I was looking at the mbuf structure, and I see that a new packet_type
>> field was added. There is no explanation about what this field should contain and
>> how we shall use it (in commit log or in comments). Can you please explain it?
>
> +	/**
> +	 * The packet type, which is used to indicate ordinary packet and also
> +	 * tunneled packet format, i.e. each number is represented a type of
> +	 * packet.
> +	 */
> +	uint16_t packet_type
> Regarding adding a packet_type in mbuf, we ever had a lot of discussions as follows:
>
> http://dpdk.org/ml/archives/dev/2014-October/007027.html
> http://dpdk.org/ml/archives/dev/2014-September/005240.html
> http://dpdk.org/ml/archives/dev/2014-September/005241.html
> http://dpdk.org/ml/archives/dev/2014-September/005274.html

Sure, there was a lot of discussions... and I still don't understand how
I can use it in my application. The reason is simple: I don't know
what kind of data is stored in this field (no #define in rte_mbuf.h).
This generic field is filled by reading a very specific register of i40e
driver. So:

- I cannot code an application that use it
- I cannot code a pmd that would fill this field

Conclusion: this field is absolutely useless for anyone that has not
read the datasheets of i40e... which is probably the case for 95% of
DPDK users.


Regards,
Olivier
  
Jijiang Liu Nov. 6, 2014, 11:24 a.m. UTC | #4
> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Wednesday, November 5, 2014 6:28 PM
> To: Liu, Jijiang
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
> 
> Hi Jijiang,
> 
> Thank you for your answer. Please find some comments below.
> 
> On 11/05/2014 07:02 AM, Liu, Jijiang wrote:
> >> First, the code checks if the mbuf has the flag PKT_RX_TUNNEL_IPV4_HDR.
> >> What is the meaning of this flag? It was added by [3], but there is
> >> no description in comments or in the commit log explaining in which
> >> case this flag is set by the driver. The name supposes that this flag
> >> is set when the received packet is an IPv4 tunnel, but the commit log talks
> about vxlan.
> >
> > The flag PKT_RX_TUNNEL_IPV4_HDR can be used for all tunneling packet types
> with outer IPV4 header.
> > For example:
> > IPv4 --> GRE/Teredo/VXLAN --> MAC --> IPv4:
> > MAC, IPV4, GRENAT, MAC, IPV4, SCTP, PAY4 MAC, IPV4, GRENAT, MAC, IPV6,
> > UDP, PAY4 MAC, IPV4, GRENAT, MAC, IPV6, UDP, PAY4 These tunneling
> > packet formats have a common point that is outer IPv4 header here.
> >
> > Only VXLAN tunneling packet is supported in DPDK for i40e now, so  the commit
> log talks about VXLAN .
> 
> Is it possible to have a more formal definition? For instance, is the following
> definition below correct?
> 
>   "the PKT_RX_TUNNEL_IPV4_HDR flag CAN be set by a driver if the packet
>    contains a tunneling protocol inside an IPv4 header".

Yes, correct.

> If the definition above is correct, I don't see how this flag can help an application
> to run faster. There is already a flag telling if there is a valid IPv4 header
> (PKT_RX_IPV4_HDR). As the PKT_RX_TUNNEL_IPV4_HDR flag does not tell what
> is ip->proto, the work done by an application to dissect a packet would be exactly
> the same with or without this flag.

If the PKT_RX_TUNNEL_IPV4_HDR flag is set, which means driver tell application that incoming packet is encapsulated packet, and application will process / analyse the packet according to tunneling format indicated by packet_type.

In terms of VXLAN packet format (MAC,IPv4,UDP,VXLAN,MAC,IP,TCP,PAY4), if only the PKT_RX_IPV4_HDR flag is set, and application regard its payload as "from VXLAN to PAY4", but actually, the real payload is PAY4.
  
> Please, can you give an example showing in which conditions this flag can help an
> application?

http://dpdk.org/ml/archives/dev/2014-October/007151.html
http://dpdk.org/ml/archives/dev/2014-October/007156.html

We used the PKT_RX_TUNNEL_IPV4_HDR in the two patches to help application identify incoming packet is tunneling packet.
  
Olivier Matz Nov. 6, 2014, 1:08 p.m. UTC | #5
Hello Jijiang,

On 11/06/2014 12:24 PM, Liu, Jijiang wrote:
>> Is it possible to have a more formal definition? For instance, is the following
>> definition below correct?
>>
>>   "the PKT_RX_TUNNEL_IPV4_HDR flag CAN be set by a driver if the packet
>>    contains a tunneling protocol inside an IPv4 header".
> 
> Yes, correct.
> 
>> If the definition above is correct, I don't see how this flag can help an application
>> to run faster. There is already a flag telling if there is a valid IPv4 header
>> (PKT_RX_IPV4_HDR). As the PKT_RX_TUNNEL_IPV4_HDR flag does not tell what
>> is ip->proto, the work done by an application to dissect a packet would be exactly
>> the same with or without this flag.
> 
> If the PKT_RX_TUNNEL_IPV4_HDR flag is set, which means driver tell application that incoming packet is encapsulated packet, and application will process / analyse the packet according to tunneling format indicated by packet_type.

Where is it written that when the PKT_RX_TUNNEL_IPV4_HDR flag is set,
the packet_type is also set?

To which header packet_type refers to? Inner or Outer? Depends?

What are the possible values for packet_type?

Is the PKT_RX_TUNNEL_IPV4_HDR flag set in mbuf related to the commands
rx_vxlan_port add|del? If yes, it should be written in the API!
(assuming this is the right API design)

When the PKT_RX_TUNNEL_IPV4_HDR flag is set, does PKT_RX_IPV4_HDR or
PKT_RX_VLAN_PKT concerns the inner or outer headers? I hope it still
concerns the first one, else it would break many applications relying
on the these flags.

As you can see, today, an application cannot use PKT_RX_TUNNEL_IPV4_HDR
or m->packet_type because it is not documented.


> In terms of VXLAN packet format (MAC,IPv4,UDP,VXLAN,MAC,IP,TCP,PAY4), if only the PKT_RX_IPV4_HDR flag is set, and application regard its payload as "from VXLAN to PAY4", but actually, the real payload is PAY4.
>   
>> Please, can you give an example showing in which conditions this flag can help an
>> application?
> 
> http://dpdk.org/ml/archives/dev/2014-October/007151.html
> http://dpdk.org/ml/archives/dev/2014-October/007156.html
> 
> We used the PKT_RX_TUNNEL_IPV4_HDR in the two patches to help application identify incoming packet is tunneling packet.

As you agreed on "the PKT_RX_TUNNEL_IPV4_HDR flag CAN be set by a
driver", it means that if the flag is not present, the application
should do the check in software. And there are several reasons why
the flag may not be present:
 - the packet is not a VxLAN packet
 - the hw or driver was not able to recognize it (I don't know, maybe
   if there are IP options the hw will not recognize it?)
 - the hw or driver does not support it (all drivers except i40e)

So the application has to provide the software equivalent code
to process PAY4.

The "csum" testpmd forwarding engine is now a bad example because it
is not able to do the same processing in software or hardware. It
now only works with an i40e driver, which was not the case before. Also,
the semantic of the command line arguments changed. Before, the meaning
was "if the flag is set, process the checksum in the NIC, else in SW".
Now, it's "huh... it depends on the flag."

I will submit a rework of the csum fowarding engine to clarify its
behavior.

Regards,
Olivier
  
Jijiang Liu Nov. 6, 2014, 2:27 p.m. UTC | #6
Hi Olivier,

> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Thursday, November 6, 2014 9:09 PM
> To: Liu, Jijiang
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
> 
> Hello Jijiang,
> 
> On 11/06/2014 12:24 PM, Liu, Jijiang wrote:
> >> Is it possible to have a more formal definition? For instance, is the
> >> following definition below correct?
> >>
> >>   "the PKT_RX_TUNNEL_IPV4_HDR flag CAN be set by a driver if the packet
> >>    contains a tunneling protocol inside an IPv4 header".
> >
> > Yes, correct.
> >
> >> If the definition above is correct, I don't see how this flag can
> >> help an application to run faster. There is already a flag telling if
> >> there is a valid IPv4 header (PKT_RX_IPV4_HDR). As the
> >> PKT_RX_TUNNEL_IPV4_HDR flag does not tell what is ip->proto, the work
> >> done by an application to dissect a packet would be exactly the same with or
> without this flag.
> >
> > If the PKT_RX_TUNNEL_IPV4_HDR flag is set, which means driver tell
> application that incoming packet is encapsulated packet, and application will
> process / analyse the packet according to tunneling format indicated by
> packet_type.
> 
> Where is it written that when the PKT_RX_TUNNEL_IPV4_HDR flag is set, the
> packet_type is also set?
> 
> To which header packet_type refers to? Inner or Outer? Depends?
> 
> What are the possible values for packet_type?
> 
> Is the PKT_RX_TUNNEL_IPV4_HDR flag set in mbuf related to the commands
> rx_vxlan_port add|del? If yes, it should be written in the API!
> (assuming this is the right API design)
> 
> When the PKT_RX_TUNNEL_IPV4_HDR flag is set, does PKT_RX_IPV4_HDR or
> PKT_RX_VLAN_PKT concerns the inner or outer headers? I hope it still concerns
> the first one, else it would break many applications relying on the these flags.
> 
> As you can see, today, an application cannot use PKT_RX_TUNNEL_IPV4_HDR or
> m->packet_type because it is not documented.
> 
> 
> > In terms of VXLAN packet format (MAC,IPv4,UDP,VXLAN,MAC,IP,TCP,PAY4), if
> only the PKT_RX_IPV4_HDR flag is set, and application regard its payload as "from
> VXLAN to PAY4", but actually, the real payload is PAY4.
> >
> >> Please, can you give an example showing in which conditions this flag
> >> can help an application?
> >
> > http://dpdk.org/ml/archives/dev/2014-October/007151.html
> > http://dpdk.org/ml/archives/dev/2014-October/007156.html
> >
> > We used the PKT_RX_TUNNEL_IPV4_HDR in the two patches to help
> application identify incoming packet is tunneling packet.
> 
> As you agreed on "the PKT_RX_TUNNEL_IPV4_HDR flag CAN be set by a driver",
> it means that if the flag is not present, the application should do the check in
> software. And there are several reasons why the flag may not be present:
>  - the packet is not a VxLAN packet
As long as it is tunneling packet with IPv4/6 header, the flag should be set by driver.

>  - the hw or driver was not able to recognize it (I don't know, maybe
>    if there are IP options the hw will not recognize it?) 
>  - the hw or driver does not support it (all drivers except i40e)
E1000/ixgbe don't support VXLAN packet and another tunneling packet, so driver don't need to set this flag.
As to other NICs that support tunneling packet , I don't why HW or driver can't recognize it.

> So the application has to provide the software equivalent code to process PAY4.
> 
> The "csum" testpmd forwarding engine is now a bad example because it is not
> able to do the same processing in software or hardware. It now only works with
> an i40e driver, which was not the case before. Also, the semantic of the command
> line arguments changed. Before, the meaning was "if the flag is set, process the
> checksum in the NIC, else in SW".
> Now, it's "huh... it depends on the flag."


Currently, If the packet is non-tunneling packet, I believe the  "csum" testpmd forwarding engine also works well as before.
we changed the engine as follows, which is compatible with previous implementation.
-		if (pkt_ol_flags & PKT_RX_IPV4_HDR) {
+		if (pkt_ol_flags & (PKT_RX_IPV4_HDR | PKT_RX_TUNNEL_IPV4_HDR)) {
...

-		else if (pkt_ol_flags & PKT_RX_IPV6_HDR) {
+		} else if (pkt_ol_flags & (PKT_RX_IPV6_HDR | PKT_RX_TUNNEL_IPV6_HDR)) {


> I will submit a rework of the csum fowarding engine to clarify its behavior.
OK. good.

> Regards,
> Olivier
  
Yong Wang Nov. 7, 2014, 12:43 a.m. UTC | #7
>> As to HW TX checksum offload, do you have special requirement for implementing TSO?

> Yes. TSO implies TX TCP and IP checksum offload.

Is this a general requirement or something specific to ixgbe/i40e? FWIW, vmxnet3 device does not support tx IP checksum offload but doe support TSO.  In that case, we cannot leave IP checksum field as 0 (the correct checksum needs to be filled in the header) before passing it the the NIC when TSO is enabled.

Yong
  
Olivier Matz Nov. 7, 2014, 5:16 p.m. UTC | #8
Hello Yong,

On 11/07/2014 01:43 AM, Yong Wang wrote:
>>> As to HW TX checksum offload, do you have special requirement for implementing TSO?
> 
>> Yes. TSO implies TX TCP and IP checksum offload.
> 
> Is this a general requirement or something specific to ixgbe/i40e? FWIW,
> vmxnet3 device does not support tx IP checksum offload but doe support
> TSO.  In that case, we cannot leave IP checksum field as 0 (the correct
> checksum needs to be filled in the header) before passing it the the NIC
> when TSO is enabled.

This is a good question because we need to define the proper API that
will work on other PMDs in the future.

Indeed, there is a hardware specificity in ixgbe: when TSO is enabled,
the IP checksum flag must also be passed to the driver if it's IPv4.
From 82599 datasheets (7.2.3.2.4 Advanced Transmit Data Descriptor):

    IXSM (bit 0) — Insert IP Checksum: This field indicates that IP
    checksum must be inserted. In IPv6 mode, it must be reset to 0b.
    If DCMD.TSE and TUCMD.IPV4 are set, IXSM must be set as well.
    If this bit is set, the packet should at least contain an
    IP header.

If we allow the user to give the TSO flag without the IP checksum
flag in mbuf flags, the ixgbe driver would have to set the IP checksum
flag in hardware descriptors if the packet is IPv4. The driver would
have to parse the IP header: this is not a problem as we already need
it for TCP checksum.

To summarize, I think we have 3 options when transmitting a packet to be
segmented using TSO:

- set IP checksum to 0 in the application: in this case, it would
  require additional work in virtual drivers if the peer expects
  to receive a packet with a valid IP checksum. But I'm wondering
  what is the need for calculating a checksum when transmitting on
  a virtual device (the peer receiving the packet knows that the
  packet is not corrupted as it comes from memory). Moreover, if the
  device advertise TSO, I assume it can also advertise IP checksum
  offload.

- calculate the IP checksum in the application. It would take additional
  cycles although it may not be needed as the driver probably knows
  how to calculate it.

- if the driver supports both TSO and IP checksum, the 2 flags MUST
  be given to the driver and the IP checksum must be set to 0 and the
  checksum cannot be calculated in software. If the driver only
  supports TSO, the checksum has to be calculated in software.

Currently, I choosen the first solution, but I'm open to change the
design. Maybe the 3rd one is also a good solution.

By the way, we had the same kind of discussion with Konstantin [1]
about what to do with the TCP checksum. My feeling is that setting it
to the pseudo-header checksum is the best we can do:
 - linux does that
 - many hardware requires that (this is not the case for ixgbe, which
   need a pshdr checksum without the IP len)
 - it can be reused if received by a virtual device and sent to a
   physical device supporting TSO

Best regards,
Olivier


[1] http://dpdk.org/ml/archives/dev/2014-May/002766.html
  
Jijiang Liu Nov. 10, 2014, 6:03 a.m. UTC | #9
> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Wednesday, November 5, 2014 6:28 PM
> To: Liu, Jijiang
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
> 
> Hi Jijiang,
> 
> Thank you for your answer. Please find some comments below.
> 
> 
> Another thing is surprising me.
> 
> - if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
>    driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.
If the flag is not set, and imply that it is not VXLAN packet, 
 and do TX checksum offload as regular packet.

> - if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
>    inner_l{23}_len instead of l{23}_len for the same operation.
Your understanding is not fully correct.
The l{23}_len is still used for TX checksum offload, please refer to i40e_txd_enable_checksum()  implementation.


> Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and l3_len.
> To fix this, I suggest to remove the new fields inner_l{23}_len then add
> outer_l{23}_len instead. Therefore, the semantic of l2_len and l3_len would not
> change, and a driver would always use the same field for a specific offload.
Oh...
> For my TSO development, I will follow the current semantic.
For TSO, you still can use l{2,3} _len .
When I develop tunneling TSO, I will use inner_l3_len/inner_l4_len.

>
  
Ananyev, Konstantin Nov. 10, 2014, 11:39 a.m. UTC | #10
Hi Oliver,

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier MATZ
> Sent: Friday, November 07, 2014 5:16 PM
> To: Yong Wang; Liu, Jijiang
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum offload
> 
> Hello Yong,
> 
> On 11/07/2014 01:43 AM, Yong Wang wrote:
> >>> As to HW TX checksum offload, do you have special requirement for implementing TSO?
> >
> >> Yes. TSO implies TX TCP and IP checksum offload.
> >
> > Is this a general requirement or something specific to ixgbe/i40e? FWIW,
> > vmxnet3 device does not support tx IP checksum offload but doe support
> > TSO.  In that case, we cannot leave IP checksum field as 0 (the correct
> > checksum needs to be filled in the header) before passing it the the NIC
> > when TSO is enabled.
> 
> This is a good question because we need to define the proper API that
> will work on other PMDs in the future.
> 
> Indeed, there is a hardware specificity in ixgbe: when TSO is enabled,
> the IP checksum flag must also be passed to the driver if it's IPv4.
> From 82599 datasheets (7.2.3.2.4 Advanced Transmit Data Descriptor):
> 
>     IXSM (bit 0) - Insert IP Checksum: This field indicates that IP
>     checksum must be inserted. In IPv6 mode, it must be reset to 0b.
>     If DCMD.TSE and TUCMD.IPV4 are set, IXSM must be set as well.
>     If this bit is set, the packet should at least contain an
>     IP header.
> 
> If we allow the user to give the TSO flag without the IP checksum
> flag in mbuf flags, the ixgbe driver would have to set the IP checksum
> flag in hardware descriptors if the packet is IPv4. The driver would
> have to parse the IP header: this is not a problem as we already need
> it for TCP checksum.
> 
> To summarize, I think we have 3 options when transmitting a packet to be
> segmented using TSO:
> 
> - set IP checksum to 0 in the application: in this case, it would
>   require additional work in virtual drivers if the peer expects
>   to receive a packet with a valid IP checksum. But I'm wondering
>   what is the need for calculating a checksum when transmitting on
>   a virtual device (the peer receiving the packet knows that the
>   packet is not corrupted as it comes from memory). Moreover, if the
>   device advertise TSO, I assume it can also advertise IP checksum
>   offload.
> 
> - calculate the IP checksum in the application. It would take additional
>   cycles although it may not be needed as the driver probably knows
>   how to calculate it.
> 
> - if the driver supports both TSO and IP checksum, the 2 flags MUST
>   be given to the driver and the IP checksum must be set to 0 and the
>   checksum cannot be calculated in software. If the driver only
>   supports TSO, the checksum has to be calculated in software.
> 
> Currently, I choosen the first solution, but I'm open to change the
> design. Maybe the 3rd one is also a good solution.
> 
> By the way, we had the same kind of discussion with Konstantin [1]
> about what to do with the TCP checksum. My feeling is that setting it
> to the pseudo-header checksum is the best we can do:
>  - linux does that
>  - many hardware requires that (this is not the case for ixgbe, which
>    need a pshdr checksum without the IP len)
>  - it can be reused if received by a virtual device and sent to a
>    physical device supporting TSO

Yes, I remember that discussion.
I still think we better avoid any read/write access of the packet data inside PMD TX routine.
(packet header parsing and/or pseudo-header checksum calculations).
As I said before - if different HW have different requirements of what have to be recalculated for HW TX offloads -
why not introduce a new function dev_prep_tx(portid, queueid, mbuf[], num)?
PMD developer can put all necessary calculations/updates of the packet data and related mbuf fields inside that function.
It would be then a PMD responsibility to provide that function and it would be an app layer responsibility to call it for
mbufs with TX offload flags before calling tx_burst().

Konstantin

> 
> Best regards,
> Olivier
> 
> 
> [1] http://dpdk.org/ml/archives/dev/2014-May/002766.html
  
Olivier Matz Nov. 10, 2014, 3:57 p.m. UTC | #11
Hello Konstantin,

>> By the way, we had the same kind of discussion with Konstantin [1]
>> about what to do with the TCP checksum. My feeling is that setting it
>> to the pseudo-header checksum is the best we can do:
>>  - linux does that
>>  - many hardware requires that (this is not the case for ixgbe, which
>>    need a pshdr checksum without the IP len)
>>  - it can be reused if received by a virtual device and sent to a
>>    physical device supporting TSO
> 
> Yes, I remember that discussion.
> I still think we better avoid any read/write access of the packet data inside PMD TX routine.
> (packet header parsing and/or pseudo-header checksum calculations).
> As I said before - if different HW have different requirements of what have to be recalculated for HW TX offloads -
> why not introduce a new function dev_prep_tx(portid, queueid, mbuf[], num)?
> PMD developer can put all necessary calculations/updates of the packet data and related mbuf fields inside that function.
> It would be then a PMD responsibility to provide that function and it would be an app layer responsibility to call it for
> mbufs with TX offload flags before calling tx_burst().

I think I understand your point: you don't want to touch the packet
in the PMD because the lcore that transmits the packet can be different
than the one that built it. In this case (i.e. a pipeline case),
reading or writing the packet can produce a cache miss, is it correct?

From an API perspective, it looks a bit more complex to have to call
dev_prep_tx() before sending the packets if they have been flagged
for offload processing. But I admit I have no other argument. I'll be
happy to have more comments from other people on the list.

I'm sending a first version of the patchset now as it's ready, it does
not take in account this comment, but I'm open to add it in a v2 if
there is a consensus on this.

Now, knowing that:
- adding dev_prep_tx() will also concern hw checksum (TCP L4 checksum
  already requires to set the TCP pseudo header checksum), so adding
  this will change the API of an existing feature
- TSO is a new feature expected for 1.8 (which should be out soon)

Do you think we need to include this for 1.8 or can we postpone your
proposition for after the 1.8 release?


Thank you for your comments,
Regards,
Olivier



>> [1] http://dpdk.org/ml/archives/dev/2014-May/002766.html
  
Olivier Matz Nov. 10, 2014, 4:17 p.m. UTC | #12
Hi Jijiang,

On 11/10/2014 07:03 AM, Liu, Jijiang wrote:
>> Another thing is surprising me.
>>
>> - if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
>>    driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.
> If the flag is not set, and imply that it is not VXLAN packet, 
>  and do TX checksum offload as regular packet.
> 
>> - if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
>>    inner_l{23}_len instead of l{23}_len for the same operation.
> Your understanding is not fully correct.
> The l{23}_len is still used for TX checksum offload, please refer to i40e_txd_enable_checksum()  implementation.

This fields are part of public mbuf API. You cannot say to refer to
i40e PMD code to understand how to use it.

>> Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and l3_len.
>> To fix this, I suggest to remove the new fields inner_l{23}_len then add
>> outer_l{23}_len instead. Therefore, the semantic of l2_len and l3_len would not
>> change, and a driver would always use the same field for a specific offload.
> Oh...

Does it mean you agree?

>> For my TSO development, I will follow the current semantic.
> For TSO, you still can use l{2,3} _len .
> When I develop tunneling TSO, I will use inner_l3_len/inner_l4_len.

I've just submitted a first version, please feel free to comment it.


Regards,
Olivier
  
Yong Wang Nov. 11, 2014, 12:07 a.m. UTC | #13
On 11/7/14, 9:16 AM, "Olivier MATZ" <olivier.matz@6wind.com> wrote:

>Hello Yong,

>

>On 11/07/2014 01:43 AM, Yong Wang wrote:

>>>> As to HW TX checksum offload, do you have special requirement for

>>>>implementing TSO?

>>

>>> Yes. TSO implies TX TCP and IP checksum offload.

>>

>> Is this a general requirement or something specific to ixgbe/i40e? FWIW,

>> vmxnet3 device does not support tx IP checksum offload but doe support

>> TSO.  In that case, we cannot leave IP checksum field as 0 (the correct

>> checksum needs to be filled in the header) before passing it the the NIC

>> when TSO is enabled.

>

>This is a good question because we need to define the proper API that

>will work on other PMDs in the future.

>

>Indeed, there is a hardware specificity in ixgbe: when TSO is enabled,

>the IP checksum flag must also be passed to the driver if it's IPv4.

>From 82599 datasheets (7.2.3.2.4 Advanced Transmit Data Descriptor):

>

>    IXSM (bit 0) ‹ Insert IP Checksum: This field indicates that IP

>    checksum must be inserted. In IPv6 mode, it must be reset to 0b.

>    If DCMD.TSE and TUCMD.IPV4 are set, IXSM must be set as well.

>    If this bit is set, the packet should at least contain an

>    IP header.

>

>If we allow the user to give the TSO flag without the IP checksum

>flag in mbuf flags, the ixgbe driver would have to set the IP checksum

>flag in hardware descriptors if the packet is IPv4. The driver would

>have to parse the IP header: this is not a problem as we already need

>it for TCP checksum.

>

>To summarize, I think we have 3 options when transmitting a packet to be

>segmented using TSO:

>

>- set IP checksum to 0 in the application: in this case, it would

>  require additional work in virtual drivers if the peer expects

>  to receive a packet with a valid IP checksum. But I'm wondering

>  what is the need for calculating a checksum when transmitting on

>  a virtual device (the peer receiving the packet knows that the

>  packet is not corrupted as it comes from memory). Moreover, if the

>  device advertise TSO, I assume it can also advertise IP checksum

>  offload.


Checksum is still needed if the packet has to be transmitted over the wire.

The device is capable of IP checksum but for various reasons, it is
designed to only support TSO and TCP/UDP checksum. So I guess we still
have to deal with this discrepancy.
 
>

>- calculate the IP checksum in the application. It would take additional

>  cycles although it may not be needed as the driver probably knows

>  how to calculate it.

>

>- if the driver supports both TSO and IP checksum, the 2 flags MUST

>  be given to the driver and the IP checksum must be set to 0 and the

>  checksum cannot be calculated in software. If the driver only

>  supports TSO, the checksum has to be calculated in software.

>

>Currently, I choosen the first solution, but I'm open to change the

>design. Maybe the 3rd one is also a good solution.


I think option (3) is cleaner and can accommodate device differences
without requiring a new API.  But I don’t really have a strong preference
here and I am fine with option (1) or a new API (dev_prep_tx()) as long as
the assumptions/requirements are clearly documented.

Thanks,
Yong

>

>By the way, we had the same kind of discussion with Konstantin [1]

>about what to do with the TCP checksum. My feeling is that setting it

>to the pseudo-header checksum is the best we can do:

> - linux does that

> - many hardware requires that (this is not the case for ixgbe, which

>   need a pshdr checksum without the IP len)

> - it can be reused if received by a virtual device and sent to a

>   physical device supporting TSO

>

>Best regards,

>Olivier

>

>

>[1]

>https://urldefense.proofpoint.com/v2/url?u=http-3A__dpdk.org_ml_archives_d

>ev_2014-2DMay_002766.html&d=AAID-g&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNt

>Xt-uEs&r=44mSO5N5yEs4CeCdtQE0xt0F7J0p67_mApYVAzyYms0&m=Sb_uMbXc4QNWb6fbk2n

>yDga1IfEZQeJUbx731-gSHU4&s=p3oIaLnY_38j2i4oxMGmtBAoQsQbeko01aEUojzSnIo&e=
  
Jijiang Liu Nov. 11, 2014, 5:29 a.m. UTC | #14
> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Tuesday, November 11, 2014 12:17 AM
> To: Liu, Jijiang
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
> 
> Hi Jijiang,
> 
> On 11/10/2014 07:03 AM, Liu, Jijiang wrote:
> >> Another thing is surprising me.
> >>
> >> - if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
> >>    driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.
> > If the flag is not set, and imply that it is not VXLAN packet,  and do
> > TX checksum offload as regular packet.
> >
> >> - if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
> >>    inner_l{23}_len instead of l{23}_len for the same operation.
> > Your understanding is not fully correct.
> > The l{23}_len is still used for TX checksum offload, please refer to
> i40e_txd_enable_checksum()  implementation.
> 
> This fields are part of public mbuf API. You cannot say to refer to i40e PMD code
> to understand how to use it.
> 
> >> Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and l3_len.
> >> To fix this, I suggest to remove the new fields inner_l{23}_len then
> >> add outer_l{23}_len instead. Therefore, the semantic of l2_len and
> >> l3_len would not change, and a driver would always use the same field for a
> specific offload.
> > Oh...
> 
> Does it mean you agree?

I don't agree to change inner_l{23}_len the name.
The reason is that using the "inner" word means  incoming  packet is tunneling packet or encapsulation packet.
if we add  "outer"{2,3}_len  , which will cause confusion when processing non-tunneling packet.


> >> For my TSO development, I will follow the current semantic.
> > For TSO, you still can use l{2,3} _len .
> > When I develop tunneling TSO, I will use inner_l3_len/inner_l4_len.
> 
> I've just submitted a first version, please feel free to comment it.
> 
> 
> Regards,
> Olivier
  
Ananyev, Konstantin Nov. 12, 2014, 9:55 a.m. UTC | #15
Hi Oliver,

> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Monday, November 10, 2014 3:58 PM
> To: Ananyev, Konstantin; Yong Wang; Liu, Jijiang
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum offload
> 
> Hello Konstantin,
> 
> >> By the way, we had the same kind of discussion with Konstantin [1]
> >> about what to do with the TCP checksum. My feeling is that setting it
> >> to the pseudo-header checksum is the best we can do:
> >>  - linux does that
> >>  - many hardware requires that (this is not the case for ixgbe, which
> >>    need a pshdr checksum without the IP len)
> >>  - it can be reused if received by a virtual device and sent to a
> >>    physical device supporting TSO
> >
> > Yes, I remember that discussion.
> > I still think we better avoid any read/write access of the packet data inside PMD TX routine.
> > (packet header parsing and/or pseudo-header checksum calculations).
> > As I said before - if different HW have different requirements of what have to be recalculated for HW TX offloads -
> > why not introduce a new function dev_prep_tx(portid, queueid, mbuf[], num)?
> > PMD developer can put all necessary calculations/updates of the packet data and related mbuf fields inside that function.
> > It would be then a PMD responsibility to provide that function and it would be an app layer responsibility to call it for
> > mbufs with TX offload flags before calling tx_burst().
> 
> I think I understand your point: you don't want to touch the packet
> in the PMD because the lcore that transmits the packet can be different
> than the one that built it. In this case (i.e. a pipeline case),
> reading or writing the packet can produce a cache miss, is it correct?

Yes, it is correct. 
That's one of the main reason why current implementations of TX routines avoid touching packet data. 

> From an API perspective, it looks a bit more complex to have to call
> dev_prep_tx() before sending the packets if they have been flagged
> for offload processing. But I admit I have no other argument. I'll be
> happy to have more comments from other people on the list.
> 
> I'm sending a first version of the patchset now as it's ready, it does
> not take in account this comment, but I'm open to add it in a v2 if
> there is a consensus on this.
> 
> Now, knowing that:
> - adding dev_prep_tx() will also concern hw checksum (TCP L4 checksum
>   already requires to set the TCP pseudo header checksum), so adding
>   this will change the API of an existing feature
> - TSO is a new feature expected for 1.8 (which should be out soon)
> 
> Do you think we need to include this for 1.8 or can we postpone your
> proposition for after the 1.8 release?

I'd say it would be good to have it done together with TSO feature.
About changing API: I think existing applications shouldn't be affected.
For existing PMDs/TX offloads we don't change  any rules what need to be filled by the app.
We just add a new function that can do that for user.
If the app fills required manually (as all apps have to do now) it would keep working as expected. 

If you feel like it is too much work for 1.8 timeframe -
can we at least move fix_tcp_phdr_cksum() out of TX PMD as a temporary measure?
Let say create a function  get_ipv4_udptcp_checksum(struct rte_mbuf *m) (in librte_net ?).
It will calculate PSD checksum for both TSO and non-TSO case based on given mbuf flags/fields.
Then we can update testpmd/csumonly.c to use it.

Thanks
Konstantin
 
> Thank you for your comments,
> Regards,
> Olivier
> 
> 
> 
> >> [1] http://dpdk.org/ml/archives/dev/2014-May/002766.html
  
Olivier Matz Nov. 12, 2014, 1:05 p.m. UTC | #16
Hi Konstantin,

On 11/12/2014 10:55 AM, Ananyev, Konstantin wrote:
>>  From an API perspective, it looks a bit more complex to have to call
>> dev_prep_tx() before sending the packets if they have been flagged
>> for offload processing. But I admit I have no other argument. I'll be
>> happy to have more comments from other people on the list.
>>
>> I'm sending a first version of the patchset now as it's ready, it does
>> not take in account this comment, but I'm open to add it in a v2 if
>> there is a consensus on this.
>>
>> Now, knowing that:
>> - adding dev_prep_tx() will also concern hw checksum (TCP L4 checksum
>>    already requires to set the TCP pseudo header checksum), so adding
>>    this will change the API of an existing feature
>> - TSO is a new feature expected for 1.8 (which should be out soon)
>>
>> Do you think we need to include this for 1.8 or can we postpone your
>> proposition for after the 1.8 release?
>
> I'd say it would be good to have it done together with TSO feature.
> About changing API: I think existing applications shouldn't be affected.
> For existing PMDs/TX offloads we don't change  any rules what need to be filled by the app.
> We just add a new function that can do that for user.
> If the app fills required manually (as all apps have to do now) it would keep working as expected.

I agree, this proposition could work without changing the current
applications.

> If you feel like it is too much work for 1.8 timeframe -
> can we at least move fix_tcp_phdr_cksum() out of TX PMD as a temporary measure?
> Let say create a function  get_ipv4_udptcp_checksum(struct rte_mbuf *m) (in librte_net ?).
> It will calculate PSD checksum for both TSO and non-TSO case based on given mbuf flags/fields.
> Then we can update testpmd/csumonly.c to use it.

I'm not sure having get_ipv4_udptcp_checksum() in librte_net would
help. The value we have to set in the TCP checksum field depends on the
PMD (altought only ixgbe is supported now). So, it would require
another parameter <portid> and a new PMD eth_ops... which looks very
similar to dev_prep_tx() (except that dev_prep_tx() can be bulked).
I think a stack will not be able to call get_udptcp_checksum(m ,port)
because it does not know the physical port at the time the packet is
built. Moreover, calling a function through a pointer is more efficient
when bulked. So I think the dev_prep_tx() you initially describe is
a better answer to the problem.

I don't know what is the exact timeframe for 1.8, maybe Thomas can help
on this? Depending on it, we have several options:

- implement dev_prep_tx() for 1.8 in the TSO series: this implies that
   the community agrees on this new API. We need to check that it will
   be faster in a pipeline model (I think this is obvious) but also that
   it does not penalize the run-to-completion model: introducing another
   function dev_prep_tx() can result in duplicated tests in the driver
   (ex: test the offload flag values).

- postpone dev_prep_tx() or similar to next version and push the current
   TSO patchset (including the comments done on the list). It does not
   modify the current offload API, it provides the TSO feature on ixgbe
   based on a similar API concept (set the TCP phdr cksum). The drawback
   is a potential performance loss when using a pipeline model.

- another option that you may prefer is to bind the API behavior to
   ixgbe (for 1.8): we can ask the application to set the pseudo-header
   checksum without the IP len when doing TSO, as required by the ixgbe
   driver. Then, for next release, we can think about dev_prep_tx(). The
   drawback of this solution is that we may go back on this choice if the
   dev_prep_tx() approach is not validated by the community.


Regards,
Olivier
  
Thomas Monjalon Nov. 12, 2014, 1:40 p.m. UTC | #17
2014-11-12 14:05, Olivier MATZ:
> On 11/12/2014 10:55 AM, Ananyev, Konstantin wrote:
> >>  From an API perspective, it looks a bit more complex to have to call
> >> dev_prep_tx() before sending the packets if they have been flagged
> >> for offload processing. But I admit I have no other argument. I'll be
> >> happy to have more comments from other people on the list.
> >>
> >> I'm sending a first version of the patchset now as it's ready, it does
> >> not take in account this comment, but I'm open to add it in a v2 if
> >> there is a consensus on this.
> >>
> >> Now, knowing that:
> >> - adding dev_prep_tx() will also concern hw checksum (TCP L4 checksum
> >>    already requires to set the TCP pseudo header checksum), so adding
> >>    this will change the API of an existing feature
> >> - TSO is a new feature expected for 1.8 (which should be out soon)
> >>
> >> Do you think we need to include this for 1.8 or can we postpone your
> >> proposition for after the 1.8 release?
> >
> > I'd say it would be good to have it done together with TSO feature.
> > About changing API: I think existing applications shouldn't be affected.
> > For existing PMDs/TX offloads we don't change  any rules what need to be filled by the app.
> > We just add a new function that can do that for user.
> > If the app fills required manually (as all apps have to do now) it would keep working as expected.
> 
> I agree, this proposition could work without changing the current
> applications.
> 
> > If you feel like it is too much work for 1.8 timeframe -
> > can we at least move fix_tcp_phdr_cksum() out of TX PMD as a temporary measure?
> > Let say create a function  get_ipv4_udptcp_checksum(struct rte_mbuf *m) (in librte_net ?).
> > It will calculate PSD checksum for both TSO and non-TSO case based on given mbuf flags/fields.
> > Then we can update testpmd/csumonly.c to use it.
> 
> I'm not sure having get_ipv4_udptcp_checksum() in librte_net would
> help. The value we have to set in the TCP checksum field depends on the
> PMD (altought only ixgbe is supported now). So, it would require
> another parameter <portid> and a new PMD eth_ops... which looks very
> similar to dev_prep_tx() (except that dev_prep_tx() can be bulked).
> I think a stack will not be able to call get_udptcp_checksum(m ,port)
> because it does not know the physical port at the time the packet is
> built. Moreover, calling a function through a pointer is more efficient
> when bulked. So I think the dev_prep_tx() you initially describe is
> a better answer to the problem.
> 
> I don't know what is the exact timeframe for 1.8, maybe Thomas can help
> on this? Depending on it, we have several options:
> 
> - implement dev_prep_tx() for 1.8 in the TSO series: this implies that
>    the community agrees on this new API. We need to check that it will
>    be faster in a pipeline model (I think this is obvious) but also that
>    it does not penalize the run-to-completion model: introducing another
>    function dev_prep_tx() can result in duplicated tests in the driver
>    (ex: test the offload flag values).
> 
> - postpone dev_prep_tx() or similar to next version and push the current
>    TSO patchset (including the comments done on the list). It does not
>    modify the current offload API, it provides the TSO feature on ixgbe
>    based on a similar API concept (set the TCP phdr cksum). The drawback
>    is a potential performance loss when using a pipeline model.
> 
> - another option that you may prefer is to bind the API behavior to
>    ixgbe (for 1.8): we can ask the application to set the pseudo-header
>    checksum without the IP len when doing TSO, as required by the ixgbe
>    driver. Then, for next release, we can think about dev_prep_tx(). The
>    drawback of this solution is that we may go back on this choice if the
>    dev_prep_tx() approach is not validated by the community.

I feel this question is really important and we need more people to review
the API. We'll also need more validation tests and performance checks with
several use cases.

Release is already late and I'm not comfortable with such change now.
The only chance to have dev_prep_tx() in 1.8 would be to quickly have a large
consensus and some benchmarks in pipeline and run to completion models.

Conclusion: we should integrate TSO without dev_prep_tx (option 2 or 3) and
then speed up dev & tests for dev_prep_tx(). This improvement for pipeline
model could go in 2.0 if it's considered too short or risky for 1.8.
Konstantin, could you be in charge of dev_prep_tx() works?

Thanks for the good discussion
  
Ananyev, Konstantin Nov. 12, 2014, 2:39 p.m. UTC | #18
> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Wednesday, November 12, 2014 1:06 PM
> To: Ananyev, Konstantin; Yong Wang; Liu, Jijiang
> Cc: dev@dpdk.org; Thomas Monjalon
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum offload
> 
> Hi Konstantin,
> 
> On 11/12/2014 10:55 AM, Ananyev, Konstantin wrote:
> >>  From an API perspective, it looks a bit more complex to have to call
> >> dev_prep_tx() before sending the packets if they have been flagged
> >> for offload processing. But I admit I have no other argument. I'll be
> >> happy to have more comments from other people on the list.
> >>
> >> I'm sending a first version of the patchset now as it's ready, it does
> >> not take in account this comment, but I'm open to add it in a v2 if
> >> there is a consensus on this.
> >>
> >> Now, knowing that:
> >> - adding dev_prep_tx() will also concern hw checksum (TCP L4 checksum
> >>    already requires to set the TCP pseudo header checksum), so adding
> >>    this will change the API of an existing feature
> >> - TSO is a new feature expected for 1.8 (which should be out soon)
> >>
> >> Do you think we need to include this for 1.8 or can we postpone your
> >> proposition for after the 1.8 release?
> >
> > I'd say it would be good to have it done together with TSO feature.
> > About changing API: I think existing applications shouldn't be affected.
> > For existing PMDs/TX offloads we don't change  any rules what need to be filled by the app.
> > We just add a new function that can do that for user.
> > If the app fills required manually (as all apps have to do now) it would keep working as expected.
> 
> I agree, this proposition could work without changing the current
> applications.
> 
> > If you feel like it is too much work for 1.8 timeframe -
> > can we at least move fix_tcp_phdr_cksum() out of TX PMD as a temporary measure?
> > Let say create a function  get_ipv4_udptcp_checksum(struct rte_mbuf *m) (in librte_net ?).
> > It will calculate PSD checksum for both TSO and non-TSO case based on given mbuf flags/fields.
> > Then we can update testpmd/csumonly.c to use it.
> 
> I'm not sure having get_ipv4_udptcp_checksum() in librte_net would
> help. The value we have to set in the TCP checksum field depends on the
> PMD (altought only ixgbe is supported now). So, it would require
> another parameter <portid> and a new PMD eth_ops... which looks very
> similar to dev_prep_tx() (except that dev_prep_tx() can be bulked).
> I think a stack will not be able to call get_udptcp_checksum(m ,port)
> because it does not know the physical port at the time the packet is
> built. Moreover, calling a function through a pointer is more efficient
> when bulked. So I think the dev_prep_tx() you initially describe is
> a better answer to the problem.

Yes I understand that it might not be applicable for non-Intel NICs.
Though I thought it is ok as a temporary measure as right now we
support these offloads for Intel NICs only.
Basically my thought was what you proposed as option 3 below.
Why common function in librte_net?
So people don't need to write their own each time.
Plus as I remember all 3 Intel NIC types (ixgbe/igb/i40e) we support have similar 
requirements for what need to be set/calculated for these TX offloads.
So my thought was that having a common function might help to avoid code duplication in future,
If/when will implement dev_prep_tx(). 

> 
> I don't know what is the exact timeframe for 1.8, maybe Thomas can help
> on this? Depending on it, we have several options:
> 
> - implement dev_prep_tx() for 1.8 in the TSO series: this implies that
>    the community agrees on this new API. We need to check that it will
>    be faster in a pipeline model (I think this is obvious) but also that
>    it does not penalize the run-to-completion model: introducing another
>    function dev_prep_tx() can result in duplicated tests in the driver
>    (ex: test the offload flag values).
> 
> - postpone dev_prep_tx() or similar to next version and push the current
>    TSO patchset (including the comments done on the list). It does not
>    modify the current offload API, it provides the TSO feature on ixgbe
>    based on a similar API concept (set the TCP phdr cksum). The drawback
>    is a potential performance loss when using a pipeline model.
> 
> - another option that you may prefer is to bind the API behavior to
>    ixgbe (for 1.8): we can ask the application to set the pseudo-header
>    checksum without the IP len when doing TSO, as required by the ixgbe
>    driver. Then, for next release, we can think about dev_prep_tx(). The
>    drawback of this solution is that we may go back on this choice if the
>    dev_prep_tx() approach is not validated by the community.


My vote would be for option 3 then.

Thanks
Konstantin

> 
> 
> Regards,
> Olivier
  
Olivier Matz Nov. 12, 2014, 2:56 p.m. UTC | #19
Hi Konstantin,

On 11/12/2014 03:39 PM, Ananyev, Konstantin wrote:
>> I'm not sure having get_ipv4_udptcp_checksum() in librte_net would
>> help. The value we have to set in the TCP checksum field depends on the
>> PMD (altought only ixgbe is supported now). So, it would require
>> another parameter <portid> and a new PMD eth_ops... which looks very
>> similar to dev_prep_tx() (except that dev_prep_tx() can be bulked).
>> I think a stack will not be able to call get_udptcp_checksum(m ,port)
>> because it does not know the physical port at the time the packet is
>> built. Moreover, calling a function through a pointer is more efficient
>> when bulked. So I think the dev_prep_tx() you initially describe is
>> a better answer to the problem.
>
> Yes I understand that it might not be applicable for non-Intel NICs.
> Though I thought it is ok as a temporary measure as right now we
> support these offloads for Intel NICs only.
> Basically my thought was what you proposed as option 3 below.
> Why common function in librte_net?
> So people don't need to write their own each time.
> Plus as I remember all 3 Intel NIC types (ixgbe/igb/i40e) we support have similar
> requirements for what need to be set/calculated for these TX offloads.
> So my thought was that having a common function might help to avoid code duplication in future,
> If/when will implement dev_prep_tx().

OK, now I understand better what you suggest. I'll try to implement
the option 3 (below) with a common checksum function in librte_net
in the next version of the TSO patchset.

Regards,
Olivier


>
>>
>> I don't know what is the exact timeframe for 1.8, maybe Thomas can help
>> on this? Depending on it, we have several options:
>>
>> - implement dev_prep_tx() for 1.8 in the TSO series: this implies that
>>     the community agrees on this new API. We need to check that it will
>>     be faster in a pipeline model (I think this is obvious) but also that
>>     it does not penalize the run-to-completion model: introducing another
>>     function dev_prep_tx() can result in duplicated tests in the driver
>>     (ex: test the offload flag values).
>>
>> - postpone dev_prep_tx() or similar to next version and push the current
>>     TSO patchset (including the comments done on the list). It does not
>>     modify the current offload API, it provides the TSO feature on ixgbe
>>     based on a similar API concept (set the TCP phdr cksum). The drawback
>>     is a potential performance loss when using a pipeline model.
>>
>> - another option that you may prefer is to bind the API behavior to
>>     ixgbe (for 1.8): we can ask the application to set the pseudo-header
>>     checksum without the IP len when doing TSO, as required by the ixgbe
>>     driver. Then, for next release, we can think about dev_prep_tx(). The
>>     drawback of this solution is that we may go back on this choice if the
>>     dev_prep_tx() approach is not validated by the community.
  
Thomas Monjalon Nov. 12, 2014, 5:26 p.m. UTC | #20
2014-11-11 05:29, Liu, Jijiang:
> From: Olivier MATZ
> > On 11/10/2014 07:03 AM, Liu, Jijiang wrote:
> > > > - if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
> > > >    driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.
> > >
> > > If the flag is not set, and imply that it is not VXLAN packet,  and do
> > > TX checksum offload as regular packet.
> > >
> > > > - if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
> > > >    inner_l{23}_len instead of l{23}_len for the same operation.
> > > 
> > > Your understanding is not fully correct.
> > > The l{23}_len is still used for TX checksum offload, please refer to
> > > i40e_txd_enable_checksum()  implementation.
> > 
> > This fields are part of public mbuf API. You cannot say to refer to i40e PMD code
> > to understand how to use it.
> > 
> > > > Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and l3_len.
> > > > To fix this, I suggest to remove the new fields inner_l{23}_len then
> > > > add outer_l{23}_len instead. Therefore, the semantic of l2_len and
> > > > l3_len would not change, and a driver would always use the same field for a
> > > > specific offload.
> > > 
> > > Oh...
> > 
> > Does it mean you agree?
> 
> I don't agree to change inner_l{23}_len the name.
> The reason is that using the "inner" word means incoming packet is tunneling packet or encapsulation packet.
> if we add "outer"{2,3}_len, which will cause confusion when processing non-tunneling packet.

Sorry Jijiang, maybe I don't understand what you are saying, but I think
you missed something. Let me explain the problem.

For PKT_TX_IP_CKSUM, we must set l{2,3}_len.
When PKT_TX_VXLAN_CKSUM is set, PKT_TX_IP_CKSUM is related to inner IP, right?
So we must set inner_l{2,3}_len.
It means that PKT_TX_IP_CKSUM requires different fields to be set, depending
of PKT_TX_VXLAN_CKSUM. That's what Olivier calls a semantic change.
It's not acceptable for an API.

PKT_TX_IP_CKSUM should always be related to l{2,3}_len.
When PKT_TX_VXLAN_CKSUM is set, we should add outer_l{2,3}_len.

Please, correct me if I'm wrong or fix the API.

Thanks
  
Ananyev, Konstantin Nov. 12, 2014, 11:14 p.m. UTC | #21
Hi Thomas,

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, November 12, 2014 1:41 PM
> To: Olivier MATZ; Ananyev, Konstantin; dev@dpdk.org
> Cc: Yong Wang
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum offload
> 
> 2014-11-12 14:05, Olivier MATZ:
> > On 11/12/2014 10:55 AM, Ananyev, Konstantin wrote:
> > >>  From an API perspective, it looks a bit more complex to have to call
> > >> dev_prep_tx() before sending the packets if they have been flagged
> > >> for offload processing. But I admit I have no other argument. I'll be
> > >> happy to have more comments from other people on the list.
> > >>
> > >> I'm sending a first version of the patchset now as it's ready, it does
> > >> not take in account this comment, but I'm open to add it in a v2 if
> > >> there is a consensus on this.
> > >>
> > >> Now, knowing that:
> > >> - adding dev_prep_tx() will also concern hw checksum (TCP L4 checksum
> > >>    already requires to set the TCP pseudo header checksum), so adding
> > >>    this will change the API of an existing feature
> > >> - TSO is a new feature expected for 1.8 (which should be out soon)
> > >>
> > >> Do you think we need to include this for 1.8 or can we postpone your
> > >> proposition for after the 1.8 release?
> > >
> > > I'd say it would be good to have it done together with TSO feature.
> > > About changing API: I think existing applications shouldn't be affected.
> > > For existing PMDs/TX offloads we don't change  any rules what need to be filled by the app.
> > > We just add a new function that can do that for user.
> > > If the app fills required manually (as all apps have to do now) it would keep working as expected.
> >
> > I agree, this proposition could work without changing the current
> > applications.
> >
> > > If you feel like it is too much work for 1.8 timeframe -
> > > can we at least move fix_tcp_phdr_cksum() out of TX PMD as a temporary measure?
> > > Let say create a function  get_ipv4_udptcp_checksum(struct rte_mbuf *m) (in librte_net ?).
> > > It will calculate PSD checksum for both TSO and non-TSO case based on given mbuf flags/fields.
> > > Then we can update testpmd/csumonly.c to use it.
> >
> > I'm not sure having get_ipv4_udptcp_checksum() in librte_net would
> > help. The value we have to set in the TCP checksum field depends on the
> > PMD (altought only ixgbe is supported now). So, it would require
> > another parameter <portid> and a new PMD eth_ops... which looks very
> > similar to dev_prep_tx() (except that dev_prep_tx() can be bulked).
> > I think a stack will not be able to call get_udptcp_checksum(m ,port)
> > because it does not know the physical port at the time the packet is
> > built. Moreover, calling a function through a pointer is more efficient
> > when bulked. So I think the dev_prep_tx() you initially describe is
> > a better answer to the problem.
> >
> > I don't know what is the exact timeframe for 1.8, maybe Thomas can help
> > on this? Depending on it, we have several options:
> >
> > - implement dev_prep_tx() for 1.8 in the TSO series: this implies that
> >    the community agrees on this new API. We need to check that it will
> >    be faster in a pipeline model (I think this is obvious) but also that
> >    it does not penalize the run-to-completion model: introducing another
> >    function dev_prep_tx() can result in duplicated tests in the driver
> >    (ex: test the offload flag values).
> >
> > - postpone dev_prep_tx() or similar to next version and push the current
> >    TSO patchset (including the comments done on the list). It does not
> >    modify the current offload API, it provides the TSO feature on ixgbe
> >    based on a similar API concept (set the TCP phdr cksum). The drawback
> >    is a potential performance loss when using a pipeline model.
> >
> > - another option that you may prefer is to bind the API behavior to
> >    ixgbe (for 1.8): we can ask the application to set the pseudo-header
> >    checksum without the IP len when doing TSO, as required by the ixgbe
> >    driver. Then, for next release, we can think about dev_prep_tx(). The
> >    drawback of this solution is that we may go back on this choice if the
> >    dev_prep_tx() approach is not validated by the community.
> 
> I feel this question is really important and we need more people to review
> the API. We'll also need more validation tests and performance checks with
> several use cases.
> 
> Release is already late and I'm not comfortable with such change now.
> The only chance to have dev_prep_tx() in 1.8 would be to quickly have a large
> consensus and some benchmarks in pipeline and run to completion models.
> 
> Conclusion: we should integrate TSO without dev_prep_tx (option 2 or 3) and
> then speed up dev & tests for dev_prep_tx(). This improvement for pipeline
> model could go in 2.0 if it's considered too short or risky for 1.8.
> Konstantin, could you be in charge of dev_prep_tx() works?

I can have a look at it in 2.0 timeframe.
Unless someone else is interested in doing it before that :)
Konstantin

> 
> Thanks for the good discussion
> --
> Thomas
  
Jijiang Liu Nov. 13, 2014, 5:35 a.m. UTC | #22
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Thursday, November 13, 2014 1:26 AM
> To: Liu, Jijiang
> Cc: dev@dpdk.org; Olivier MATZ
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
> 
> 2014-11-11 05:29, Liu, Jijiang:
> > From: Olivier MATZ
> > > On 11/10/2014 07:03 AM, Liu, Jijiang wrote:
> > > > > - if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
> > > > >    driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.
> > > >
> > > > If the flag is not set, and imply that it is not VXLAN packet,
> > > > and do TX checksum offload as regular packet.
> > > >
> > > > > - if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
> > > > >    inner_l{23}_len instead of l{23}_len for the same operation.
> > > >
> > > > Your understanding is not fully correct.
> > > > The l{23}_len is still used for TX checksum offload, please refer
> > > > to
> > > > i40e_txd_enable_checksum()  implementation.
> > >
> > > This fields are part of public mbuf API. You cannot say to refer to
> > > i40e PMD code to understand how to use it.
> > >
> > > > > Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and
> l3_len.
> > > > > To fix this, I suggest to remove the new fields inner_l{23}_len
> > > > > then add outer_l{23}_len instead. Therefore, the semantic of
> > > > > l2_len and l3_len would not change, and a driver would always
> > > > > use the same field for a specific offload.
> > > >
> > > > Oh...
> > >
> > > Does it mean you agree?
> >
> > I don't agree to change inner_l{23}_len the name.
> > The reason is that using the "inner" word means incoming packet is tunneling
> packet or encapsulation packet.
> > if we add "outer"{2,3}_len, which will cause confusion when processing non-
> tunneling packet.
> 
> Sorry Jijiang, maybe I don't understand what you are saying, but I think you
> missed something. Let me explain the problem.
> 
> For PKT_TX_IP_CKSUM, we must set l{2,3}_len.
> When PKT_TX_VXLAN_CKSUM is set, PKT_TX_IP_CKSUM is related to inner IP,
> right?
First of all, I want to explain that what PKT_TX_VXLAN_CKSUM meaning is,  when the flag is set, driver know that it need set TX checksum for whole packet, not only for inner part. 

So When PKT_TX_VXLAN_CKSUM is set, PKT_TX_IP_CKSUM is related to inner IP,right?



> So we must set inner_l{2,3}_len.
> It means that PKT_TX_IP_CKSUM requires different fields to be set, depending of
> PKT_TX_VXLAN_CKSUM. That's what Olivier calls a semantic change.
> It's not acceptable for an API.
> 
> PKT_TX_IP_CKSUM should always be related to l{2,3}_len.
> When PKT_TX_VXLAN_CKSUM is set, we should add outer_l{2,3}_len.
> 
> Please, correct me if I'm wrong or fix the API.
> 
> Thanks
> --
> Thomas
  
Jijiang Liu Nov. 13, 2014, 5:39 a.m. UTC | #23
Please Ignore this mail.


> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Liu, Jijiang
> Sent: Thursday, November 13, 2014 1:35 PM
> To: Thomas Monjalon
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
> 
> 
> 
> > -----Original Message-----
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > Sent: Thursday, November 13, 2014 1:26 AM
> > To: Liu, Jijiang
> > Cc: dev@dpdk.org; Olivier MATZ
> > Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx
> > checksum offload
> >
> > 2014-11-11 05:29, Liu, Jijiang:
> > > From: Olivier MATZ
> > > > On 11/10/2014 07:03 AM, Liu, Jijiang wrote:
> > > > > > - if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
> > > > > >    driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.
> > > > >
> > > > > If the flag is not set, and imply that it is not VXLAN packet,
> > > > > and do TX checksum offload as regular packet.
> > > > >
> > > > > > - if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
> > > > > >    inner_l{23}_len instead of l{23}_len for the same operation.
> > > > >
> > > > > Your understanding is not fully correct.
> > > > > The l{23}_len is still used for TX checksum offload, please
> > > > > refer to
> > > > > i40e_txd_enable_checksum()  implementation.
> > > >
> > > > This fields are part of public mbuf API. You cannot say to refer
> > > > to i40e PMD code to understand how to use it.
> > > >
> > > > > > Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and
> > l3_len.
> > > > > > To fix this, I suggest to remove the new fields
> > > > > > inner_l{23}_len then add outer_l{23}_len instead. Therefore,
> > > > > > the semantic of l2_len and l3_len would not change, and a
> > > > > > driver would always use the same field for a specific offload.
> > > > >
> > > > > Oh...
> > > >
> > > > Does it mean you agree?
> > >
> > > I don't agree to change inner_l{23}_len the name.
> > > The reason is that using the "inner" word means incoming packet is
> > > tunneling
> > packet or encapsulation packet.
> > > if we add "outer"{2,3}_len, which will cause confusion when
> > > processing non-
> > tunneling packet.
> >
> > Sorry Jijiang, maybe I don't understand what you are saying, but I
> > think you missed something. Let me explain the problem.
> >
> > For PKT_TX_IP_CKSUM, we must set l{2,3}_len.
> > When PKT_TX_VXLAN_CKSUM is set, PKT_TX_IP_CKSUM is related to inner
> > IP, right?
> First of all, I want to explain that what PKT_TX_VXLAN_CKSUM meaning is,  when
> the flag is set, driver know that it need set TX checksum for whole packet, not
> only for inner part.
> 
> So When PKT_TX_VXLAN_CKSUM is set, PKT_TX_IP_CKSUM is related to inner
> IP,right?
> 
> 
> 
> > So we must set inner_l{2,3}_len.
> > It means that PKT_TX_IP_CKSUM requires different fields to be set,
> > depending of PKT_TX_VXLAN_CKSUM. That's what Olivier calls a semantic
> change.
> > It's not acceptable for an API.
> >
> > PKT_TX_IP_CKSUM should always be related to l{2,3}_len.
> > When PKT_TX_VXLAN_CKSUM is set, we should add outer_l{2,3}_len.
> >
> > Please, correct me if I'm wrong or fix the API.
> >
> > Thanks
> > --
> > Thomas
  
Jijiang Liu Nov. 13, 2014, 6:51 a.m. UTC | #24
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Thursday, November 13, 2014 1:26 AM
> To: Liu, Jijiang
> Cc: dev@dpdk.org; Olivier MATZ
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
>
> 2014-11-11 05:29, Liu, Jijiang:
> > From: Olivier MATZ
> > > On 11/10/2014 07:03 AM, Liu, Jijiang wrote:
> > > > > - if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
> > > > >    driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.
> > > >
> > > > If the flag is not set, and imply that it is not VXLAN packet,
> > > > and do TX checksum offload as regular packet.
> > > >
> > > > > - if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
> > > > >    inner_l{23}_len instead of l{23}_len for the same operation.
> > > >
> > > > Your understanding is not fully correct.
> > > > The l{23}_len is still used for TX checksum offload, please refer
> > > > to
> > > > i40e_txd_enable_checksum()  implementation.
> > >
> > > This fields are part of public mbuf API. You cannot say to refer to
> > > i40e PMD code to understand how to use it.
> > >
> > > > > Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and
> l3_len.
> > > > > To fix this, I suggest to remove the new fields inner_l{23}_len
> > > > > then add outer_l{23}_len instead. Therefore, the semantic of
> > > > > l2_len and l3_len would not change, and a driver would always
> > > > > use the same field for a specific offload.
> > > >
> > > > Oh...
> > >
> > > Does it mean you agree?
> >
> > I don't agree to change inner_l{23}_len the name.
> > The reason is that using the "inner" word means incoming packet is tunneling
> packet or encapsulation packet.
> > if we add "outer"{2,3}_len, which will cause confusion when processing non-
> tunneling packet.
>
> Sorry Jijiang, maybe I don't understand what you are saying, but I think you
> missed something. Let me explain the problem.
>
> For PKT_TX_IP_CKSUM, we must set l{2,3}_len.
> When PKT_TX_VXLAN_CKSUM is set, PKT_TX_IP_CKSUM is related to inner IP,
> right?
> So we must set inner_l{2,3}_len.
> It means that PKT_TX_IP_CKSUM requires different fields to be set, depending of
> PKT_TX_VXLAN_CKSUM. That's what Olivier calls a semantic change.
> It's not acceptable for an API.

I'd like to explain what PKT_TX_VXLAN_CKSUM means, it is to tell driver should set whole VXLAN packet TX checksum according to  L3/L4 flag setting.
VXLAN packet IP checksum  not only include inner IP, but also include outer IP, so when PKT_TX_VXLAN_CKSUM is set, the  PKT_TX_IP_CKSUM is not only related to inner IP, but also IP.   In other words, we use this one flag to set inner IP and outer IP checksum offload at the same time in driver, because it is not necessary to add other flag to stand for inner IP flag

L4 flag usage is the same the L3 flag as well.

> PKT_TX_IP_CKSUM should always be related to l{2,3}_len.

> When PKT_TX_VXLAN_CKSUM is set, we should add outer_l{2,3}_len.

> Please, correct me if I'm wrong or fix the API.

Probably we can refer to struct sk_buff in Linux kernel .
Just as a reference!!
struct sk_buff {
...
*       @inner_protocol: Protocol (encapsulation)
 *      @inner_transport_header: Inner transport layer header (encapsulation)
 *      @inner_network_header: Network layer header (encapsulation)
 *      @inner_mac_header: Link layer header (encapsulation)

        __u16                   inner_transport_header;
        __u16                   inner_network_header;
        __u16                   inner_mac_header;
        __u16                   transport_header;
        __u16                   network_header;
        __u16                   mac_header;


> Thanks

> Thomas
  
Thomas Monjalon Nov. 13, 2014, 9:10 a.m. UTC | #25
2014-11-13 06:51, Liu, Jijiang:
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > 2014-11-11 05:29, Liu, Jijiang:
> > > From: Olivier MATZ
> > > > On 11/10/2014 07:03 AM, Liu, Jijiang wrote:
> > > > > > - if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
> > > > > >    driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.
> > > > >
> > > > > If the flag is not set, and imply that it is not VXLAN packet,
> > > > > and do TX checksum offload as regular packet.
> > > > >
> > > > > > - if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
> > > > > >    inner_l{23}_len instead of l{23}_len for the same operation.
> > > > >
> > > > > Your understanding is not fully correct.
> > > > > The l{23}_len is still used for TX checksum offload, please refer
> > > > > to i40e_txd_enable_checksum() implementation.
> > > >
> > > > This fields are part of public mbuf API. You cannot say to refer to
> > > > i40e PMD code to understand how to use it.
> > > >
> > > > > > Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and
> > > > > > l3_len.
> > > > > > To fix this, I suggest to remove the new fields inner_l{23}_len
> > > > > > then add outer_l{23}_len instead. Therefore, the semantic of
> > > > > > l2_len and l3_len would not change, and a driver would always
> > > > > > use the same field for a specific offload.
> > > > >
> > > > > Oh...
> > > >
> > > > Does it mean you agree?
> > >
> > > I don't agree to change inner_l{23}_len the name.
> > > The reason is that using the "inner" word means incoming packet is tunneling
> > > packet or encapsulation packet.
> > > if we add "outer"{2,3}_len, which will cause confusion when processing non-
> > > tunneling packet.
> >
> > Sorry Jijiang, maybe I don't understand what you are saying, but I think you
> > missed something. Let me explain the problem.
> >
> > For PKT_TX_IP_CKSUM, we must set l{2,3}_len.
> > When PKT_TX_VXLAN_CKSUM is set, PKT_TX_IP_CKSUM is related to inner IP,
> > right?
> > So we must set inner_l{2,3}_len.
> > It means that PKT_TX_IP_CKSUM requires different fields to be set, depending of
> > PKT_TX_VXLAN_CKSUM. That's what Olivier calls a semantic change.
> > It's not acceptable for an API.
> 
> I'd like to explain what PKT_TX_VXLAN_CKSUM means, it is to tell driver
> should set whole VXLAN packet TX checksum according to  L3/L4 flag setting.
> VXLAN packet IP checksum  not only include inner IP, but also include outer
> IP, so when PKT_TX_VXLAN_CKSUM is set, the  PKT_TX_IP_CKSUM is not only
> related to inner IP, but also IP.   In other words, we use this one flag to
> set inner IP and outer IP checksum offload at the same time in driver,
> because it is not necessary to add other flag to stand for inner IP flag

You mean that PKT_TX_VXLAN_CKSUM request hardware checksumming of outer L3,
outer L4, inner L3 and inner L4?
So maybe the name and comments are not enough clear.

> L4 flag usage is the same the L3 flag as well.

What do you mean?

> > PKT_TX_IP_CKSUM should always be related to l{2,3}_len.
> > When PKT_TX_VXLAN_CKSUM is set, we should add outer_l{2,3}_len.
> > Please, correct me if I'm wrong or fix the API.
> 
> Probably we can refer to struct sk_buff in Linux kernel .
> Just as a reference!!
> struct sk_buff {
> ...
> *       @inner_protocol: Protocol (encapsulation)
>  *      @inner_transport_header: Inner transport layer header (encapsulation)
>  *      @inner_network_header: Network layer header (encapsulation)
>  *      @inner_mac_header: Link layer header (encapsulation)
> 
>         __u16                   inner_transport_header;
>         __u16                   inner_network_header;
>         __u16                   inner_mac_header;
>         __u16                   transport_header;
>         __u16                   network_header;
>         __u16                   mac_header;

Yes it's a reference. But some things are made differently in DPDK.
Is there a flag PKT_TX_VXLAN_CKSUM in Linux?

I'm not sure what the checksumming API would be.
But I'm sure the VXLAN API is not enough commented.
Olivier is improving documentation of the legacy checksum API:
	http://dpdk.org/ml/archives/dev/2014-November/007956.html
I'd like you do the same thing for VXLAN checksum.

Thanks
  
Jijiang Liu Nov. 14, 2014, 8:15 a.m. UTC | #26
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Thursday, November 13, 2014 5:10 PM
> To: Liu, Jijiang
> Cc: dev@dpdk.org; Olivier MATZ
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
> 
> 2014-11-13 06:51, Liu, Jijiang:
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > 2014-11-11 05:29, Liu, Jijiang:
> > > > From: Olivier MATZ
> > > > > On 11/10/2014 07:03 AM, Liu, Jijiang wrote:
> > > > > > > - if PKT_TX_VXLAN_CKSUM is not set (legacy use case), then the
> > > > > > >    driver use l2_len and l3_len to offload inner IP/UDP/TCP checksums.
> > > > > >
> > > > > > If the flag is not set, and imply that it is not VXLAN packet,
> > > > > > and do TX checksum offload as regular packet.
> > > > > >
> > > > > > > - if PKT_TX_VXLAN_CKSUM is set, then the driver has to use
> > > > > > >    inner_l{23}_len instead of l{23}_len for the same operation.
> > > > > >
> > > > > > Your understanding is not fully correct.
> > > > > > The l{23}_len is still used for TX checksum offload, please
> > > > > > refer to i40e_txd_enable_checksum() implementation.
> > > > >
> > > > > This fields are part of public mbuf API. You cannot say to refer
> > > > > to i40e PMD code to understand how to use it.
> > > > >
> > > > > > > Adding PKT_TX_VXLAN_CKSUM changes the semantic of l2_len and
> > > > > > > l3_len.
> > > > > > > To fix this, I suggest to remove the new fields
> > > > > > > inner_l{23}_len then add outer_l{23}_len instead. Therefore,
> > > > > > > the semantic of l2_len and l3_len would not change, and a
> > > > > > > driver would always use the same field for a specific offload.
> > > > > >
> > > > > > Oh...
> > > > >
> > > > > Does it mean you agree?
> > > >
> > > > I don't agree to change inner_l{23}_len the name.
> > > > The reason is that using the "inner" word means incoming packet is
> > > > tunneling packet or encapsulation packet.
> > > > if we add "outer"{2,3}_len, which will cause confusion when
> > > > processing non- tunneling packet.
> > >
> > > Sorry Jijiang, maybe I don't understand what you are saying, but I
> > > think you missed something. Let me explain the problem.
> > >
> > > For PKT_TX_IP_CKSUM, we must set l{2,3}_len.
> > > When PKT_TX_VXLAN_CKSUM is set, PKT_TX_IP_CKSUM is related to inner
> > > IP, right?
> > > So we must set inner_l{2,3}_len.
> > > It means that PKT_TX_IP_CKSUM requires different fields to be set,
> > > depending of PKT_TX_VXLAN_CKSUM. That's what Olivier calls a semantic
> change.
> > > It's not acceptable for an API.
> >
> > I'd like to explain what PKT_TX_VXLAN_CKSUM means, it is to tell
> > driver should set whole VXLAN packet TX checksum according to  L3/L4 flag
> setting.
> > VXLAN packet IP checksum  not only include inner IP, but also include
> > outer IP, so when PKT_TX_VXLAN_CKSUM is set, the  PKT_TX_IP_CKSUM is not
> only
> > related to inner IP, but also IP.   In other words, we use this one flag to
> > set inner IP and outer IP checksum offload at the same time in driver,
> > because it is not necessary to add other flag to stand for inner IP
> > flag
> 
> You mean that PKT_TX_VXLAN_CKSUM request hardware checksumming of outer
> L3, outer L4, inner L3 and inner L4?
> So maybe the name and comments are not enough clear.

Yes, PKT_TX_VXLAN_CKSUM request hardware checksum of outerL3, outer L4, inner L3 and inner L4.

BTW, for outer UDP Checksum, generally, It SHOULD be transmitted as zero.  When a packet is received with a UDP checksum of zero, it MUST be accepted for decapsulation.

> > L4 flag usage is the same the L3 flag as well.
> 
> What do you mean?

The PKT_TX_TCP_CKSUM, PKT_TX_SCTP_CKSUM,  and PKT_TX_UDP_CKSUM flag are used to  set TX checksum  for outer L4 and inner for tunneling packet.
But for VXLAN, outer L4 protocol must be UDP, so setting PKT_TX_TCP_CKSUM and PKT_TX_SCTP_CKSUM is meaningless for outer L4.


> > > PKT_TX_IP_CKSUM should always be related to l{2,3}_len.
> > > When PKT_TX_VXLAN_CKSUM is set, we should add outer_l{2,3}_len.
> > > Please, correct me if I'm wrong or fix the API.
> >
> > Probably we can refer to struct sk_buff in Linux kernel .
> > Just as a reference!!
> > struct sk_buff {
> > ...
> > *       @inner_protocol: Protocol (encapsulation)
> >  *      @inner_transport_header: Inner transport layer header (encapsulation)
> >  *      @inner_network_header: Network layer header (encapsulation)
> >  *      @inner_mac_header: Link layer header (encapsulation)
> >
> >         __u16                   inner_transport_header;
> >         __u16                   inner_network_header;
> >         __u16                   inner_mac_header;
> >         __u16                   transport_header;
> >         __u16                   network_header;
> >         __u16                   mac_header;
> 
> Yes it's a reference. But some things are made differently in DPDK.
> Is there a flag PKT_TX_VXLAN_CKSUM in Linux?
> 
> I'm not sure what the checksumming API would be.
> But I'm sure the VXLAN API is not enough commented.
> Olivier is improving documentation of the legacy checksum API:
> 	http://dpdk.org/ml/archives/dev/2014-November/007956.html
> I'd like you do the same thing for VXLAN checksum.
Ok , I will improve  document about VXLAN checksum. 
> 
> Thanks
> --
> Thomas
  
Olivier Matz Nov. 14, 2014, 9:09 a.m. UTC | #27
Hi Jijiang,

On 11/14/2014 09:15 AM, Liu, Jijiang wrote:
> 
> Thomas Monjalon wrote:
>>
>> You mean that PKT_TX_VXLAN_CKSUM request hardware checksumming of outer
>> L3, outer L4, inner L3 and inner L4?
>> So maybe the name and comments are not enough clear.
> 
> Yes, PKT_TX_VXLAN_CKSUM request hardware checksum of outerL3, outer L4, inner L3 and inner L4.

I don't understand: it looks in contradiction with our previous
discussion:

Olivier Matz wrote:
> 
> Liu, Jijiang wrote:
>> 
>> Olivier Matz wrote:
>>> What is the
>>> meaning of this flag? Is it enough to checksum outer L3, inner L3, and
>>> inner L4 as specified in commit log? If yes, why are the other flags
>>> PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM, (...) added in the mbuf later?
>>> In my comprehension, these flags are needed in addition to
>>> PKT_TX_VXLAN_CKSUM to do the checksum of the inner headers.
>> 
>> Yes, these flags(PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM)  are needed by HW
>> offload of non-tunneling and tunneling  packet. 
> 
> OK, so I understand that when PKT_TX_VXLAN_CKSUM is set, if the
> driver supports it, it will process IP and UDP checksum of outer
> header, using l2_len and l3_len.

So you say that PKT_TX_VXLAN_CKSUM is enough for all inner and outer
headers, but also that PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM are needed.
What occurs if we don't set them?

Now let's say you have an application that receives a TCP packet, then
encaspulate it in vxlan, and forward it. You want to regenerate the
checksum for the new outer headers, but you don't need to change the
inner ones.
You say that setting the PKT_TX_VXLAN_CKSUM will request the hw to
process inner and outer checksum. This is not required in that case.
Also, do you need to set the pseudo header checksum in the TCP inner
header?

Regards,
Olivier
  
Jijiang Liu Nov. 17, 2014, 6:52 a.m. UTC | #28
> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Friday, November 14, 2014 5:10 PM
> To: Liu, Jijiang; Thomas Monjalon
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum
> offload
> 
> Hi Jijiang,
> 
> On 11/14/2014 09:15 AM, Liu, Jijiang wrote:
> >
> > Thomas Monjalon wrote:
> >>
> >> You mean that PKT_TX_VXLAN_CKSUM request hardware checksumming of
> >> outer L3, outer L4, inner L3 and inner L4?
> >> So maybe the name and comments are not enough clear.
> >
> > Yes, PKT_TX_VXLAN_CKSUM request hardware checksum of outerL3, outer L4,
> inner L3 and inner L4.
> 
> I don't understand: it looks in contradiction with our previous
> discussion:
> 
> Olivier Matz wrote:
> >
> > Liu, Jijiang wrote:
> >>
> >> Olivier Matz wrote:
> >>> What is the
> >>> meaning of this flag? Is it enough to checksum outer L3, inner L3,
> >>> and inner L4 as specified in commit log? If yes, why are the other
> >>> flags PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM, (...) added in the mbuf
> later?
> >>> In my comprehension, these flags are needed in addition to
> >>> PKT_TX_VXLAN_CKSUM to do the checksum of the inner headers.
> >>
> >> Yes, these flags(PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM)  are needed by
> >> HW offload of non-tunneling and tunneling  packet.
> >
> > OK, so I understand that when PKT_TX_VXLAN_CKSUM is set, if the driver
> > supports it, it will process IP and UDP checksum of outer header,
> > using l2_len and l3_len.
> 
> So you say that PKT_TX_VXLAN_CKSUM is enough for all inner and outer headers,
> but also that PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM are needed.
> What occurs if we don't set them?
> 
> Now let's say you have an application that receives a TCP packet, then
> encaspulate it in vxlan, and forward it. You want to regenerate the checksum for
> the new outer headers, but you don't need to change the inner ones.
> You say that setting the PKT_TX_VXLAN_CKSUM will request the hw to process
> inner and outer checksum. This is not required in that case.
> Also, do you need to set the pseudo header checksum in the TCP inner header?

Anyway, I explain the checksum mechanism here again.

In my VXLAN patch set, for an VXLAN packet TX checksum offload,  the logics below:

1. only set outer L3/L4 header TX checksum
    tx_checksum set 0x3(0r 0x1) 0
  In this case, the PKT_TX_VXLAN_CKSUM flag is not set as we don't set inner flags(PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM), so we don't need to change inner ones, driver think it is the ordinary packet,  
HW will do outer L3/L4 checksum offload. 

2. only set inner L3/L4 header TX checksum
    tx_checksum set 0x30 0
  In this case, the PKT_TX_VXLAN_CKSUM flag is set, so driver think it is VXLAN packet, and we don't need to change outer ones because we don't set outer flags here (PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM).

3. set outer L3/L4 TX checksum and inner L3&L4 TX checksum 
tx_checksum set 0x31(0x33) 0
in this case, the PKT_TX_VXLAN_CKSUM flag is set, driver think it is VXLAN packet, and we need to change outer and inner as both outer and inner flags are set.

I'm reviewing your TSO patch to see if your logic is correct in the checksum engine.


> Regards,
> Olivier
  
Olivier Matz Nov. 17, 2014, 11:21 a.m. UTC | #29
Hi Jijiang,

On 11/17/2014 07:52 AM, Liu, Jijiang wrote:
> Anyway, I explain the checksum mechanism here again.
> 
> In my VXLAN patch set, for an VXLAN packet TX checksum offload,  the logics below:
> 
> 1. only set outer L3/L4 header TX checksum
>     tx_checksum set 0x3(0r 0x1) 0
>   In this case, the PKT_TX_VXLAN_CKSUM flag is not set as we don't set inner flags(PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM), so we don't need to change inner ones, driver think it is the ordinary packet,  
> HW will do outer L3/L4 checksum offload. 

Let's take an example with the following packet:
Ether / IP / UDP / VxLAN / Ether / IP / UDP / data

The original behavior (without your vxlan patches), which still
works today, is to select inner or outer using the m->l2_len field:

  - checksum outer IP + UDP
    m->l2_len=14 m->l3_len=20
    flags=PKT_TX_IP_CKSUM PKT_TX_UDP_CKSUM

  - checksum inner IP + UDP
    m->l2_len=64 m->l3_len=20
    flags=PKT_TX_IP_CKSUM PKT_TX_UDP_CKSUM
    of course, the packet is valid only if the outer IP checksum is
    already correct and outer UDP checksum is 0

If i40e does not act like this, it does not follow the previous API.

> 2. only set inner L3/L4 header TX checksum
>     tx_checksum set 0x30 0
>   In this case, the PKT_TX_VXLAN_CKSUM flag is set, so driver think it is VXLAN packet, and we don't need to change outer ones because we don't set outer flags here (PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM).

As explained above, there is no need to set the PKT_TX_VXLAN_CKSUM if
you only want to set the inner L3/L4 checksum. This was already working
like this before your patches, as long as l2_len and l3_len are set
properly in the mbuf (l2_len should include the outer headers).

Moreover, PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM, ... are not "outer flags".
They are hardware checksum flags, and before your vxland patch, they
concerned the headers referenced by m->l2_len and m->l3_len.

With your vxlan patch, it changed without beeing documented. These
flags use either (m->l2_len, m->l3_len) or (m->inner_l2_len,
m->inner_l3_len), which is not a good idea in my opinion.

> 3. set outer L3/L4 TX checksum and inner L3&L4 TX checksum 
> tx_checksum set 0x31(0x33) 0
> in this case, the PKT_TX_VXLAN_CKSUM flag is set, driver think it is VXLAN packet, and we need to change outer and inner as both outer and inner flags are set.

Here you are talking about test pmd flags. You do not describe the
mbuf API: PKT_TX_* flags and lengths values that need to be set (l2_len,
l3_len, ...) and to what they refer to.

I think if you want to explain the vxlan checksum offload mbuf API,
you should not talk about the testpmd flags as:
- they don't match the mbuf flags
- they have undocumented (or uncoherent) behavior in the csumonly
  forward engine

After several exchanges about this vxlan API.
Unfortunately, it is still vague and obscure to me.

So here is a proposition of API documentation that looks
understandable. Maybe it is easier to change the code to match this API:



PKT_TX_IP_CKSUM flag enables hardware computation of IP cksum. To
use it:
- fill l2_len and l3_len in mbuf
- set the flag PKT_TX_IP_CKSUM
- set the ip checksum to 0 in IP header
See (1) and (2).

One value among PKT_TX_L4_NO_CKSUM, PKT_TX_UDP_CKSUM,
PKT_TX_TCP_CKSUM and PKT_TX_SCTP_CKSUM can be assigned to the bits
of PKT_TX_L4_MASK. These flags are used to offload the L4 checksum in
hardware.
The user requires to:
- fill l2_len and l3_len in mbuf
- set the flags PKT_TX_TCP_CKSUM, PKT_TX_SCTP_CKSUM or
  PKT_TX_UDP_CKSUM
- calculate the pseudo header checksum and set it in the L4
  header (only for TCP or UDP). See rte_ipv4_phdr_cksum() and
  rte_ipv6_phdr_cksum().  For SCTP, set the crc field to 0.
See (1) and (2).

The PKT_TX_TCP_SEG flag can be set to enable TCP segmentation
offload for a packet to be transmitted on hardware supporting
TSO:
- set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag
  implies PKT_TX_TCP_CKSUM)
- if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP
  checksum to 0 in the packet
- fill the mbuf offload information: l2_len, l3_len, l4_len,
  tso_segsz
- calculate the pseudo header checksum without taking ip_len in
  accound, and set it in the TCP header. Refer to
  rte_ipv4_phdr_cksum() and rte_ipv6_phdr_cksum() that can be
  used as helpers.
See (1) and (2).

(1) In case the packet is an encapsulated packet, the m->l2_len
    field can include all the outer tunnel headers. These headers
    will remain unmodified by the hardware.

(2) If outer_l2_len and outer_l3_len are not 0, the beginning of
    the inner headers is relative to outer_l2_len + outer_l3_len.


[To replace the PKT_TX_VXLAN_CKSUM, we introduce 2 new flags]

PKT_TX_OUTER_IP_CKSUM flag enables hardware computation of IP cksum
in outer headers. To use it:
- fill outer_l2_len and outer_l3_len in mbuf
- set the flag PKT_TX_OUTER_IP_CKSUM
- set the ip checksum to 0 in outer IP header

One value among PKT_TX_OUTER_L4_NO_CKSUM, PKT_TX_OUTER_UDP_CKSUM,
PKT_TX_OUTER_TCP_CKSUM and PKT_TX_OUTER_SCTP_CKSUM can be assigned
to the bits of PKT_TX_L4_MASK. These flags are used to offload the
outer L4 checksum in hardware.
The user requires to:
- fill outer_l2_len and outer_l3_len in mbuf
- set the flags PKT_TX_OUTER_TCP_CKSUM, PKT_TX_OUTER_SCTP_CKSUM or
  PKT_TX_OUTER_UDP_CKSUM
- calculate the pseudo header checksum and set it in the outer L4
  header (only for TCP or UDP). See rte_ipv4_phdr_cksum() and
  rte_ipv6_phdr_cksum().  For SCTP, set the crc field to 0.


This proposition has several advantages:
- it is documented :)
- the API is straightforward: inner and outer work in the same
  manner.
- the API already supports other tunnels (IPIP, GRE, STT, ...)
- adding m->outer_* fields allows to keep the same semantic for
  the existing flags. Indeed, it does not map linux skb, but this
  is not an argument. Moreover, linux does not seem to support
  hardware tx checksum of outer+inner headers.


Regards,
Olivier
  
Jijiang Liu Nov. 20, 2014, 7:28 a.m. UTC | #30
Hi,

> -----Original Message-----

> From: Olivier MATZ [mailto:olivier.matz@6wind.com]

> Sent: Monday, November 17, 2014 7:22 PM

> To: Liu, Jijiang; Thomas Monjalon

> Cc: dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum

> offload

> 

> Hi Jijiang,

> 

> On 11/17/2014 07:52 AM, Liu, Jijiang wrote:

> > Anyway, I explain the checksum mechanism here again.

> >

> > In my VXLAN patch set, for an VXLAN packet TX checksum offload,  the logics

> below:

> >

> > 1. only set outer L3/L4 header TX checksum

> >     tx_checksum set 0x3(0r 0x1) 0

> >   In this case, the PKT_TX_VXLAN_CKSUM flag is not set as we don't set

> > inner flags(PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM), so we don't need to

> change inner ones, driver think it is the ordinary packet, HW will do outer L3/L4

> checksum offload.

> 

> Let's take an example with the following packet:

> Ether / IP / UDP / VxLAN / Ether / IP / UDP / data

> 

> The original behavior (without your vxlan patches), which still works today, is to

> select inner or outer using the m->l2_len field:

> 

>   - checksum outer IP + UDP

>     m->l2_len=14 m->l3_len=20

>     flags=PKT_TX_IP_CKSUM PKT_TX_UDP_CKSUM

> 

>   - checksum inner IP + UDP

>     m->l2_len=64 m->l3_len=20

>     flags=PKT_TX_IP_CKSUM PKT_TX_UDP_CKSUM

>     of course, the packet is valid only if the outer IP checksum is

>     already correct and outer UDP checksum is 0

> 

> If i40e does not act like this, it does not follow the previous API.


No,  i40e follows this.

> > 2. only set inner L3/L4 header TX checksum

> >     tx_checksum set 0x30 0

> >   In this case, the PKT_TX_VXLAN_CKSUM flag is set, so driver think it is VXLAN

> packet, and we don't need to change outer ones because we don't set outer flags

> here (PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM).



> As explained above, there is no need to set the PKT_TX_VXLAN_CKSUM if you

> only want to set the inner L3/L4 checksum.

> This was already working like this

> before your patches, as long as l2_len and l3_len are set properly in the mbuf

> (l2_len should include the outer headers).


Does VXLAN TX checksum offload or ordinary L2 packet TX checksum offload work?
Have you ever tested it on a NIC that supports VXLAN. 
  
The PKT_TX_VXLAN_CKSUM flag meaning just tell driver this is encapsulation packet, so driver should set TX checksum offload for the packet using outer l2/l3 len, inner l2/l3 len and tunneling header length.

If you don't like this flag name, I can change it for  PKT_TX_TUNNEL_CKSUM, which have more generic meaning.

> Moreover, PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM, ... are not "outer flags".

> They are hardware checksum flags, and before your vxland patch, they concerned

> the headers referenced by m->l2_len and m->l3_len.


Actually, the  key point of debate is that you still think the l2_len filed and the l3_len filed  in mbuf are inner part in the case of tunneling, right?  If yes, let me explain what I thought.

As you know, NIC itself is not responsible for packet decapsulation / encapsulation at all. It sends and receives the whole packet, not only for inner part in the case of tunneling. The translation from receive descriptor to mbuf structure is also for the whole packet. And these fields defined in mbuf structure are also for the whole packet, no matter it is tunneling or non-tunneling.

1) We assume that a NIC can't  recognize VXLAN packet, when a packet  with the format  outer IP / outer UDP / VxLAN / Ether / inner IP / inner UDP / data is received, 
 do you think whether l2  header and l3 header length of this packet is outer or inner,  according to my understanding, I think it is outer, and m->l2_len and m->l3_len is also outer. Do you agree?

2) We also assume that a NIC can  recognize VXLAN packet,  but there is no difference between 1)  and 2) on data in mbuf before patching my VXLAN patch, so I also think  m->l2_len and m->l3_len is outer.  Do you agree?
After patching my VXLAN, the inner_l2_len and inner_l3_len were used to stand for inner header part.


> With your vxlan patch, it changed without beeing documented. These flags use

> either (m->l2_len, m->l3_len) or (m->inner_l2_len,

> m->inner_l3_len), which is not a good idea in my opinion.


The PKT_RX_IPV4_HDR  definition is listed below,
#define PKT_RX_IPV4_HDR      (1ULL << 5)  /**< RX packet with IPv4 header. */
I don't think it just stand for inner IP TX checksum offload, I just extend its usage scope in the case of tunneling.  

> > 3. set outer L3/L4 TX checksum and inner L3&L4 TX checksum tx_checksum

> > set 0x31(0x33) 0 in this case, the PKT_TX_VXLAN_CKSUM flag is set,

> > driver think it is VXLAN packet, and we need to change outer and inner as both

> outer and inner flags are set.

> 

> Here you are talking about test pmd flags. You do not describe the mbuf API:

> PKT_TX_* flags and lengths values that need to be set (l2_len, l3_len, ...) and to

> what they refer to.

> 

> I think if you want to explain the vxlan checksum offload mbuf API, you should not

> talk about the testpmd flags as:

> - they don't match the mbuf flags

> - they have undocumented (or uncoherent) behavior in the csumonly

>   forward engine

> 

> After several exchanges about this vxlan API.

> Unfortunately, it is still vague and obscure to me.


As to tunneling packet TX checksum offload, please don't think it is only an inner or outer part.
You should regard it as whole part.

> So here is a proposition of API documentation that looks understandable. Maybe

> it is easier to change the code to match this API:

> 

Ok, thanks.

> 

> PKT_TX_IP_CKSUM flag enables hardware computation of IP cksum. To use it:

> - fill l2_len and l3_len in mbuf

> - set the flag PKT_TX_IP_CKSUM

> - set the ip checksum to 0 in IP header

> See (1) and (2).

> 

> One value among PKT_TX_L4_NO_CKSUM, PKT_TX_UDP_CKSUM,

> PKT_TX_TCP_CKSUM and PKT_TX_SCTP_CKSUM can be assigned to the bits of

> PKT_TX_L4_MASK. These flags are used to offload the L4 checksum in hardware.

> The user requires to:

> - fill l2_len and l3_len in mbuf

> - set the flags PKT_TX_TCP_CKSUM, PKT_TX_SCTP_CKSUM or

>   PKT_TX_UDP_CKSUM

> - calculate the pseudo header checksum and set it in the L4

>   header (only for TCP or UDP). See rte_ipv4_phdr_cksum() and

>   rte_ipv6_phdr_cksum().  For SCTP, set the crc field to 0.

> See (1) and (2).

> 

> The PKT_TX_TCP_SEG flag can be set to enable TCP segmentation offload for a

> packet to be transmitted on hardware supporting

> TSO:

> - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag

>   implies PKT_TX_TCP_CKSUM)

> - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP

>   checksum to 0 in the packet

> - fill the mbuf offload information: l2_len, l3_len, l4_len,

>   tso_segsz

> - calculate the pseudo header checksum without taking ip_len in

>   accound, and set it in the TCP header. Refer to

>   rte_ipv4_phdr_cksum() and rte_ipv6_phdr_cksum() that can be

>   used as helpers.

> See (1) and (2).

> 

> (1) In case the packet is an encapsulated packet, the m->l2_len

>     field can include all the outer tunnel headers. These headers

>     will remain unmodified by the hardware.

> 

> (2) If outer_l2_len and outer_l3_len are not 0, the beginning of

>     the inner headers is relative to outer_l2_len + outer_l3_len.

> 

> 

> [To replace the PKT_TX_VXLAN_CKSUM, we introduce 2 new flags]

> 

> PKT_TX_OUTER_IP_CKSUM flag enables hardware computation of IP cksum in

> outer headers. To use it:

> - fill outer_l2_len and outer_l3_len in mbuf

> - set the flag PKT_TX_OUTER_IP_CKSUM

> - set the ip checksum to 0 in outer IP header

> 

> One value among PKT_TX_OUTER_L4_NO_CKSUM,

> PKT_TX_OUTER_UDP_CKSUM, PKT_TX_OUTER_TCP_CKSUM and

> PKT_TX_OUTER_SCTP_CKSUM can be assigned to the bits of PKT_TX_L4_MASK.

> These flags are used to offload the outer L4 checksum in hardware.

> The user requires to:

> - fill outer_l2_len and outer_l3_len in mbuf

> - set the flags PKT_TX_OUTER_TCP_CKSUM, PKT_TX_OUTER_SCTP_CKSUM or

>   PKT_TX_OUTER_UDP_CKSUM

> - calculate the pseudo header checksum and set it in the outer L4

>   header (only for TCP or UDP). See rte_ipv4_phdr_cksum() and

>   rte_ipv6_phdr_cksum().  For SCTP, set the crc field to 0.


Good. You provide a common approach.

Actually, I have another common approach,
1. Change PKT_TX_VXLAN_CKSUM to PKT_TX_TUNNEL_CKSUM
2. Add field "uint8_t tun_header_len "(tunneling header length, for example, GRE header )into mbuf structure.
After above change, the API can supports other tunnels.

> 

> This proposition has several advantages:

> - it is documented :)

> - the API is straightforward: inner and outer work in the same

>   manner.

> - the API already supports other tunnels (IPIP, GRE, STT, ...)

> - adding m->outer_* fields allows to keep the same semantic for

>   the existing flags. Indeed, it does not map linux skb, but this

>   is not an argument. Moreover, linux does not seem to support

>   hardware tx checksum of outer+inner headers.


Just as I have mentioned in the previous email,  Linux have already supported hardware tx checksum of outer+inner headers for i40e.

> Regards,

> Olivier
  
Olivier Matz Nov. 20, 2014, 4:36 p.m. UTC | #31
Hi Jijiang,

On 11/20/2014 08:28 AM, Liu, Jijiang wrote:
>> The original behavior (without your vxlan patches), which still works today, is to
>> select inner or outer using the m->l2_len field:
>>
>>    - checksum outer IP + UDP
>>      m->l2_len=14 m->l3_len=20
>>      flags=PKT_TX_IP_CKSUM PKT_TX_UDP_CKSUM
>>
>>    - checksum inner IP + UDP
>>      m->l2_len=64 m->l3_len=20
>>      flags=PKT_TX_IP_CKSUM PKT_TX_UDP_CKSUM
>>      of course, the packet is valid only if the outer IP checksum is
>>      already correct and outer UDP checksum is 0
>>
>> If i40e does not act like this, it does not follow the previous API.
>
> No,  i40e follows this.

OK. This is assumption (A):
To calculate the inner IP + UDP checksum, you don't need VXLAN flag.
You just acked it.

>>> 2. only set inner L3/L4 header TX checksum
>>>      tx_checksum set 0x30 0
>>>    In this case, the PKT_TX_VXLAN_CKSUM flag is set, so driver think it is VXLAN
>> packet, and we don't need to change outer ones because we don't set outer flags
>> here (PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM).

Assumption (B):
To calculate the inner IP + UDP checksum (this is what you wrote "only
set inner L3/L4 header TX checksum"), you say you set the VXLAN flag.
This is the opposite of (A).

>> As explained above, there is no need to set the PKT_TX_VXLAN_CKSUM if you
>> only want to set the inner L3/L4 checksum.
>> This was already working like this
>> before your patches, as long as l2_len and l3_len are set properly in the mbuf
>> (l2_len should include the outer headers).
>
> Does VXLAN TX checksum offload or ordinary L2 packet TX checksum offload work?
> Have you ever tested it on a NIC that supports VXLAN.

You don't answer the question: which between (A) or (B) is correct.

I'm sorry I don't understand your question above.

I have done no test on i40e, because I don't have access to
this hardware.

> The PKT_TX_VXLAN_CKSUM flag meaning just tell driver this is encapsulation packet, so driver should set TX checksum offload for the packet using outer l2/l3 len, inner l2/l3 len and tunneling header length.
>
> If you don't like this flag name, I can change it for  PKT_TX_TUNNEL_CKSUM, which have more generic meaning.

The problem is not only the name. After tens of mails, I'm still not
able to understand the VxLAN checksum API.

I wanted to rework the csum forward engine code, because it is not
understable today. I wanted to clarify the API. But sorry I think
I'll give up now.

>> Moreover, PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM, ... are not "outer flags".
>> They are hardware checksum flags, and before your vxland patch, they concerned
>> the headers referenced by m->l2_len and m->l3_len.
>
> Actually, the  key point of debate is that you still think the l2_len filed and the l3_len filed  in mbuf are inner part in the case of tunneling, right?  If yes, let me explain what I thought.

This is not the only key point of debate. The very first key point is
that the VxLAN checksum offload API is not documented and I'm not
able to rework the csum code to use it.

> As you know, NIC itself is not responsible for packet decapsulation / encapsulation at all. It sends and receives the whole packet, not only for inner part in the case of tunneling. The translation from receive descriptor to mbuf structure is also for the whole packet. And these fields defined in mbuf structure are also for the whole packet, no matter it is tunneling or non-tunneling.
>
> 1) We assume that a NIC can't  recognize VXLAN packet, when a packet  with the format  outer IP / outer UDP / VxLAN / Ether / inner IP / inner UDP / data is received,
>   do you think whether l2  header and l3 header length of this packet is outer or inner,  according to my understanding, I think it is outer, and m->l2_len and m->l3_len is also outer. Do you agree?

The l2_len and l3_len are never set up by any driver on rx side. Your
example does not apply.

These fields are set by the application (a network stack for instance)
to indicate to the driver and hardware where to find the l3 and l4
headers whose checksum need to be calculated.

The l2_len and l3_len does not refer to inner or outer header. It refers
to the header that has to be checksum'd in hardware when the flag is
set. It can be inner or outer. At least, it was the case before the
adding of VxLAN offload feature.


> 2) We also assume that a NIC can  recognize VXLAN packet,  but there is no difference between 1)  and 2) on data in mbuf before patching my VXLAN patch, so I also think  m->l2_len and m->l3_len is outer.  Do you agree?
> After patching my VXLAN, the inner_l2_len and inner_l3_len were used to stand for inner header part.

Your argumentation would make sense if l2_len and l3_len were filled by
a NIC in RX functions. But that's not the case. Today, these fields are
only used in TX when a checksum flag is also set. And I think that a
flag should always refer to the same length fields.

But I'm not the one who decides this, I'm just trying to help to define
an API that makes sense.


>> With your vxlan patch, it changed without beeing documented. These flags use
>> either (m->l2_len, m->l3_len) or (m->inner_l2_len,
>> m->inner_l3_len), which is not a good idea in my opinion.
>
> The PKT_RX_IPV4_HDR  definition is listed below,
> #define PKT_RX_IPV4_HDR      (1ULL << 5)  /**< RX packet with IPv4 header. */
> I don't think it just stand for inner IP TX checksum offload, I just extend its usage scope in the case of tunneling.

If you reread my mail, I was not talking about PKT_RX_IPV4_HDR but
about PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM, (etc...) which are TX flags.
I think my previous mail was clear enough:

   They are hardware checksum flags, and before your vxlan patch, they
   concerned the headers referenced by m->l2_len and m->l3_len.

   With your vxlan patch, it changed without beeing documented. These
   flags use either (m->l2_len, m->l3_len) or (m->inner_l2_len,
   m->inner_l3_len), which is not a good idea in my opinion.


>>> 3. set outer L3/L4 TX checksum and inner L3&L4 TX checksum tx_checksum
>>> set 0x31(0x33) 0 in this case, the PKT_TX_VXLAN_CKSUM flag is set,
>>> driver think it is VXLAN packet, and we need to change outer and inner as both
>> outer and inner flags are set.
>>
>> Here you are talking about test pmd flags. You do not describe the mbuf API:
>> PKT_TX_* flags and lengths values that need to be set (l2_len, l3_len, ...) and to
>> what they refer to.
>>
>> I think if you want to explain the vxlan checksum offload mbuf API, you should not
>> talk about the testpmd flags as:
>> - they don't match the mbuf flags
>> - they have undocumented (or uncoherent) behavior in the csumonly
>>    forward engine
>>
>> After several exchanges about this vxlan API.
>> Unfortunately, it is still vague and obscure to me.
>
> As to tunneling packet TX checksum offload, please don't think it is only an inner or outer part.
> You should regard it as whole part.

So what?

>> So here is a proposition of API documentation that looks understandable. Maybe
>> it is easier to change the code to match this API:
>>
> Ok, thanks.
>
>>
>> PKT_TX_IP_CKSUM flag enables hardware computation of IP cksum. To use it:
>> - fill l2_len and l3_len in mbuf
>> - set the flag PKT_TX_IP_CKSUM
>> - set the ip checksum to 0 in IP header
>> See (1) and (2).
>>
>> One value among PKT_TX_L4_NO_CKSUM, PKT_TX_UDP_CKSUM,
>> PKT_TX_TCP_CKSUM and PKT_TX_SCTP_CKSUM can be assigned to the bits of
>> PKT_TX_L4_MASK. These flags are used to offload the L4 checksum in hardware.
>> The user requires to:
>> - fill l2_len and l3_len in mbuf
>> - set the flags PKT_TX_TCP_CKSUM, PKT_TX_SCTP_CKSUM or
>>    PKT_TX_UDP_CKSUM
>> - calculate the pseudo header checksum and set it in the L4
>>    header (only for TCP or UDP). See rte_ipv4_phdr_cksum() and
>>    rte_ipv6_phdr_cksum().  For SCTP, set the crc field to 0.
>> See (1) and (2).
>>
>> The PKT_TX_TCP_SEG flag can be set to enable TCP segmentation offload for a
>> packet to be transmitted on hardware supporting
>> TSO:
>> - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag
>>    implies PKT_TX_TCP_CKSUM)
>> - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP
>>    checksum to 0 in the packet
>> - fill the mbuf offload information: l2_len, l3_len, l4_len,
>>    tso_segsz
>> - calculate the pseudo header checksum without taking ip_len in
>>    accound, and set it in the TCP header. Refer to
>>    rte_ipv4_phdr_cksum() and rte_ipv6_phdr_cksum() that can be
>>    used as helpers.
>> See (1) and (2).
>>
>> (1) In case the packet is an encapsulated packet, the m->l2_len
>>      field can include all the outer tunnel headers. These headers
>>      will remain unmodified by the hardware.
>>
>> (2) If outer_l2_len and outer_l3_len are not 0, the beginning of
>>      the inner headers is relative to outer_l2_len + outer_l3_len.
>>
>>
>> [To replace the PKT_TX_VXLAN_CKSUM, we introduce 2 new flags]
>>
>> PKT_TX_OUTER_IP_CKSUM flag enables hardware computation of IP cksum in
>> outer headers. To use it:
>> - fill outer_l2_len and outer_l3_len in mbuf
>> - set the flag PKT_TX_OUTER_IP_CKSUM
>> - set the ip checksum to 0 in outer IP header
>>
>> One value among PKT_TX_OUTER_L4_NO_CKSUM,
>> PKT_TX_OUTER_UDP_CKSUM, PKT_TX_OUTER_TCP_CKSUM and
>> PKT_TX_OUTER_SCTP_CKSUM can be assigned to the bits of PKT_TX_L4_MASK.
>> These flags are used to offload the outer L4 checksum in hardware.
>> The user requires to:
>> - fill outer_l2_len and outer_l3_len in mbuf
>> - set the flags PKT_TX_OUTER_TCP_CKSUM, PKT_TX_OUTER_SCTP_CKSUM or
>>    PKT_TX_OUTER_UDP_CKSUM
>> - calculate the pseudo header checksum and set it in the outer L4
>>    header (only for TCP or UDP). See rte_ipv4_phdr_cksum() and
>>    rte_ipv6_phdr_cksum().  For SCTP, set the crc field to 0.
>
> Good. You provide a common approach.
>
> Actually, I have another common approach,
> 1. Change PKT_TX_VXLAN_CKSUM to PKT_TX_TUNNEL_CKSUM
> 2. Add field "uint8_t tun_header_len "(tunneling header length, for example, GRE header )into mbuf structure.
> After above change, the API can supports other tunnels.

No. I don't want that you explain me what should be modified in
the current API, as I don't understand it. I need (the community
needs?) a full definition of the API, like I just did in my previous
mail.

I think my description was clear. Please, do the same effort to
describe the vxlan API from the beginning to the end, and how it
changes (or not) the legacy checksum API.

>> This proposition has several advantages:
>> - it is documented :)
>> - the API is straightforward: inner and outer work in the same
>>    manner.
>> - the API already supports other tunnels (IPIP, GRE, STT, ...)
>> - adding m->outer_* fields allows to keep the same semantic for
>>    the existing flags. Indeed, it does not map linux skb, but this
>>    is not an argument. Moreover, linux does not seem to support
>>    hardware tx checksum of outer+inner headers.
>
> Just as I have mentioned in the previous email,  Linux have already supported hardware tx checksum of outer+inner headers for i40e.

Yes you are right. But I think that's not the point here.

Regards,
Olivier
  
Jijiang Liu Nov. 21, 2014, 5:40 a.m. UTC | #32
> -----Original Message-----

> From: Olivier MATZ [mailto:olivier.matz@6wind.com]

> Sent: Friday, November 21, 2014 12:36 AM

> To: Liu, Jijiang

> Cc: Thomas Monjalon; dev

> Subject: Re: [dpdk-dev] [PATCH v8 10/10] app/testpmd:test VxLAN Tx checksum

> offload

> 

> Hi Jijiang,

> 

> On 11/20/2014 08:28 AM, Liu, Jijiang wrote:

> >> The original behavior (without your vxlan patches), which still works

> >> today, is to select inner or outer using the m->l2_len field:

> >>

> >>    - checksum outer IP + UDP

> >>      m->l2_len=14 m->l3_len=20

> >>      flags=PKT_TX_IP_CKSUM PKT_TX_UDP_CKSUM

> >>

> >>    - checksum inner IP + UDP

> >>      m->l2_len=64 m->l3_len=20

> >>      flags=PKT_TX_IP_CKSUM PKT_TX_UDP_CKSUM

> >>      of course, the packet is valid only if the outer IP checksum is

> >>      already correct and outer UDP checksum is 0

> >>

> >> If i40e does not act like this, it does not follow the previous API.

> >

> > No,  i40e follows this.

In this case, I meant that if the packet with format  outer IP / outer UDP / VxLAN / Ether / inner IP / inner UDP / data   is not recognized VXLAN packet, it also can work on 40G If it can work on 10G today.
 
> OK. This is assumption (A):

> To calculate the inner IP + UDP checksum, you don't need VXLAN flag.

> You just acked it.


The VXLAN packet can be recognized after  VXLAN UDP destination port is configured on i40e , 
TX checksum offload must not work if you still set  m->l2_len=64 without  the PKT_TX_VXLAN_CKSUM flag.
Because we need this PKT_TX_VXLAN/TUNNEL_CKSUM to tell driver to set some related tunneling  registers.
 Do you hope that we can use L2 length to check if the packet is tunneling? If yes,  I don't think it makes sense.

As to tunneling parameters, for example, 
L4TUNT L4 Tunneling Type parameter

9:10   L4TUNT L4 Tunneling Type (Teredo / GRE header / VXLAN header) indication:
    00b - No UDP / GRE tunneling (field must be set to zero if EIPT equals to zero)
    01b - UDP tunneling header (Any UDP tunneling, VxLAN and Geneve)
    10b - GRE tunneling header
   Else - reserved

L4 Tunneling length
 
12:18 L4TUNLEN L4 Tunneling Length (Teredo / GRE header / VXLAN header) defined in Words (field must be set to zero
if L4TUNT equals to zero).
• For standard Teredo headers with no additional header payload it should be set to 4 which equals to
8 bytes. If the tunneling header includes proprietary content it should be included as well.
• For IP in GRE it should be set to the length of the GRE header.
• For MAC in GRE or MAC in UDP it should be set to the length of the GRE or UDP headers plus the inner MAC up to including its last Ethertype.
If the L4TUNT is cleared, this field is ignored and must be set to zero.

Olivier, Thomas
I don't know  if you got intel 40G datasheet,  we all known you are focusing on generic concept and programming, I also think it is very important.
But  I think if you can read intel 40G data sheet, you probably understand easily what these 40G patches are for and what we are talking about. This is my personal opinion.


> >>> 2. only set inner L3/L4 header TX checksum

> >>>      tx_checksum set 0x30 0

> >>>    In this case, the PKT_TX_VXLAN_CKSUM flag is set, so driver think

> >>> it is VXLAN

> >> packet, and we don't need to change outer ones because we don't set

> >> outer flags here (PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM).

> 

> Assumption (B):

> To calculate the inner IP + UDP checksum (this is what you wrote "only set inner

> L3/L4 header TX checksum"), you say you set the VXLAN flag.

> This is the opposite of (A).

> 

> >> As explained above, there is no need to set the PKT_TX_VXLAN_CKSUM if

> >> you only want to set the inner L3/L4 checksum.

> >> This was already working like this

> >> before your patches, as long as l2_len and l3_len are set properly in

> >> the mbuf (l2_len should include the outer headers).

> >

> > Does VXLAN TX checksum offload or ordinary L2 packet TX checksum offload

> work?

> > Have you ever tested it on a NIC that supports VXLAN.

> 

> You don't answer the question: which between (A) or (B) is correct.

> 

> I'm sorry I don't understand your question above.

> 

> I have done no test on i40e, because I don't have access to this hardware.

> 

> > The PKT_TX_VXLAN_CKSUM flag meaning just tell driver this is encapsulation

> packet, so driver should set TX checksum offload for the packet using outer l2/l3

> len, inner l2/l3 len and tunneling header length.

> >

> > If you don't like this flag name, I can change it for  PKT_TX_TUNNEL_CKSUM,

> which have more generic meaning.

> 

> The problem is not only the name. After tens of mails, I'm still not able to

> understand the VxLAN checksum API.

> 

> I wanted to rework the csum forward engine code, because it is not understable

> today. I wanted to clarify the API. But sorry I think I'll give up now.

> 

> >> Moreover, PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM, ... are not "outer

> flags".

> >> They are hardware checksum flags, and before your vxland patch, they

> >> concerned the headers referenced by m->l2_len and m->l3_len.

> >

> > Actually, the  key point of debate is that you still think the l2_len filed and the

> l3_len filed  in mbuf are inner part in the case of tunneling, right?  If yes, let me

> explain what I thought.

> 

> This is not the only key point of debate. The very first key point is that the VxLAN

> checksum offload API is not documented and I'm not able to rework the csum

> code to use it.

> 

> > As you know, NIC itself is not responsible for packet decapsulation /

> encapsulation at all. It sends and receives the whole packet, not only for inner

> part in the case of tunneling. The translation from receive descriptor to mbuf

> structure is also for the whole packet. And these fields defined in mbuf structure

> are also for the whole packet, no matter it is tunneling or non-tunneling.

> >

> > 1) We assume that a NIC can't  recognize VXLAN packet, when a packet  with

> the format  outer IP / outer UDP / VxLAN / Ether / inner IP / inner UDP / data is

> received,

> >   do you think whether l2  header and l3 header length of this packet is outer or

> inner,  according to my understanding, I think it is outer, and m->l2_len and m-

> >l3_len is also outer. Do you agree?

> 

> The l2_len and l3_len are never set up by any driver on rx side. Your example does

> not apply.

> 

> These fields are set by the application (a network stack for instance) to indicate to

> the driver and hardware where to find the l3 and l4 headers whose checksum

> need to be calculated.

> 

> The l2_len and l3_len does not refer to inner or outer header. It refers to the

> header that has to be checksum'd in hardware when the flag is set. It can be

> inner or outer. At least, it was the case before the adding of VxLAN offload

> feature.

> 

> 

> > 2) We also assume that a NIC can  recognize VXLAN packet,  but there is no

> difference between 1)  and 2) on data in mbuf before patching my VXLAN patch,

> so I also think  m->l2_len and m->l3_len is outer.  Do you agree?

> > After patching my VXLAN, the inner_l2_len and inner_l3_len were used to stand

> for inner header part.

> 

> Your argumentation would make sense if l2_len and l3_len were filled by a NIC in

> RX functions. But that's not the case. Today, these fields are only used in TX when

> a checksum flag is also set. And I think that a flag should always refer to the same

> length fields.

> 

> But I'm not the one who decides this, I'm just trying to help to define an API that

> makes sense.

> 

> 

> >> With your vxlan patch, it changed without beeing documented. These

> >> flags use either (m->l2_len, m->l3_len) or (m->inner_l2_len,

> >> m->inner_l3_len), which is not a good idea in my opinion.

> >

> > The PKT_RX_IPV4_HDR  definition is listed below,

> > #define PKT_RX_IPV4_HDR      (1ULL << 5)  /**< RX packet with IPv4 header. */

> > I don't think it just stand for inner IP TX checksum offload, I just extend its usage

> scope in the case of tunneling.

> 

> If you reread my mail, I was not talking about PKT_RX_IPV4_HDR but about

> PKT_TX_IPV4_CSUM, PKT_TX_UDP_CKSUM, (etc...) which are TX flags.

> I think my previous mail was clear enough:

> 

>    They are hardware checksum flags, and before your vxlan patch, they

>    concerned the headers referenced by m->l2_len and m->l3_len.

> 

>    With your vxlan patch, it changed without beeing documented. These

>    flags use either (m->l2_len, m->l3_len) or (m->inner_l2_len,

>    m->inner_l3_len), which is not a good idea in my opinion.

> 

> 

> >>> 3. set outer L3/L4 TX checksum and inner L3&L4 TX checksum

> >>> tx_checksum set 0x31(0x33) 0 in this case, the PKT_TX_VXLAN_CKSUM

> >>> flag is set, driver think it is VXLAN packet, and we need to change

> >>> outer and inner as both

> >> outer and inner flags are set.

> >>

> >> Here you are talking about test pmd flags. You do not describe the mbuf API:

> >> PKT_TX_* flags and lengths values that need to be set (l2_len,

> >> l3_len, ...) and to what they refer to.

> >>

> >> I think if you want to explain the vxlan checksum offload mbuf API,

> >> you should not talk about the testpmd flags as:

> >> - they don't match the mbuf flags

> >> - they have undocumented (or uncoherent) behavior in the csumonly

> >>    forward engine

> >>

> >> After several exchanges about this vxlan API.

> >> Unfortunately, it is still vague and obscure to me.

> >

> > As to tunneling packet TX checksum offload, please don't think it is only an inner

> or outer part.

> > You should regard it as whole part.

> 

> So what?

> 

> >> So here is a proposition of API documentation that looks

> >> understandable. Maybe it is easier to change the code to match this API:

> >>

> > Ok, thanks.

> >

> >>

> >> PKT_TX_IP_CKSUM flag enables hardware computation of IP cksum. To use it:

> >> - fill l2_len and l3_len in mbuf

> >> - set the flag PKT_TX_IP_CKSUM

> >> - set the ip checksum to 0 in IP header See (1) and (2).

> >>

> >> One value among PKT_TX_L4_NO_CKSUM, PKT_TX_UDP_CKSUM,

> >> PKT_TX_TCP_CKSUM and PKT_TX_SCTP_CKSUM can be assigned to the bits of

> >> PKT_TX_L4_MASK. These flags are used to offload the L4 checksum in

> hardware.

> >> The user requires to:

> >> - fill l2_len and l3_len in mbuf

> >> - set the flags PKT_TX_TCP_CKSUM, PKT_TX_SCTP_CKSUM or

> >>    PKT_TX_UDP_CKSUM

> >> - calculate the pseudo header checksum and set it in the L4

> >>    header (only for TCP or UDP). See rte_ipv4_phdr_cksum() and

> >>    rte_ipv6_phdr_cksum().  For SCTP, set the crc field to 0.

> >> See (1) and (2).

> >>

> >> The PKT_TX_TCP_SEG flag can be set to enable TCP segmentation offload

> >> for a packet to be transmitted on hardware supporting

> >> TSO:

> >> - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag

> >>    implies PKT_TX_TCP_CKSUM)

> >> - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP

> >>    checksum to 0 in the packet

> >> - fill the mbuf offload information: l2_len, l3_len, l4_len,

> >>    tso_segsz

> >> - calculate the pseudo header checksum without taking ip_len in

> >>    accound, and set it in the TCP header. Refer to

> >>    rte_ipv4_phdr_cksum() and rte_ipv6_phdr_cksum() that can be

> >>    used as helpers.

> >> See (1) and (2).

> >>

> >> (1) In case the packet is an encapsulated packet, the m->l2_len

> >>      field can include all the outer tunnel headers. These headers

> >>      will remain unmodified by the hardware.

> >>

> >> (2) If outer_l2_len and outer_l3_len are not 0, the beginning of

> >>      the inner headers is relative to outer_l2_len + outer_l3_len.

> >>

> >>

> >> [To replace the PKT_TX_VXLAN_CKSUM, we introduce 2 new flags]

> >>

> >> PKT_TX_OUTER_IP_CKSUM flag enables hardware computation of IP cksum

> >> in outer headers. To use it:

> >> - fill outer_l2_len and outer_l3_len in mbuf

> >> - set the flag PKT_TX_OUTER_IP_CKSUM

> >> - set the ip checksum to 0 in outer IP header

> >>

> >> One value among PKT_TX_OUTER_L4_NO_CKSUM,

> PKT_TX_OUTER_UDP_CKSUM,

> >> PKT_TX_OUTER_TCP_CKSUM and PKT_TX_OUTER_SCTP_CKSUM can be

> assigned to

> >> the bits of PKT_TX_L4_MASK.

> >> These flags are used to offload the outer L4 checksum in hardware.

> >> The user requires to:

> >> - fill outer_l2_len and outer_l3_len in mbuf

> >> - set the flags PKT_TX_OUTER_TCP_CKSUM, PKT_TX_OUTER_SCTP_CKSUM or

> >>    PKT_TX_OUTER_UDP_CKSUM

> >> - calculate the pseudo header checksum and set it in the outer L4

> >>    header (only for TCP or UDP). See rte_ipv4_phdr_cksum() and

> >>    rte_ipv6_phdr_cksum().  For SCTP, set the crc field to 0.

> >

> > Good. You provide a common approach.

> >

> > Actually, I have another common approach, 1. Change PKT_TX_VXLAN_CKSUM

> > to PKT_TX_TUNNEL_CKSUM 2. Add field "uint8_t tun_header_len

> > "(tunneling header length, for example, GRE header )into mbuf structure.

> > After above change, the API can supports other tunnels.

> 

> No. I don't want that you explain me what should be modified in the current API,

> as I don't understand it. I need (the community

> needs?) a full definition of the API, like I just did in my previous mail.

> 

> I think my description was clear. Please, do the same effort to describe the vxlan

> API from the beginning to the end, and how it changes (or not) the legacy

> checksum API.

> 

> >> This proposition has several advantages:

> >> - it is documented :)

> >> - the API is straightforward: inner and outer work in the same

> >>    manner.

> >> - the API already supports other tunnels (IPIP, GRE, STT, ...)

> >> - adding m->outer_* fields allows to keep the same semantic for

> >>    the existing flags. Indeed, it does not map linux skb, but this

> >>    is not an argument. Moreover, linux does not seem to support

> >>    hardware tx checksum of outer+inner headers.

> >

> > Just as I have mentioned in the previous email,  Linux have already supported

> hardware tx checksum of outer+inner headers for i40e.

> 

> Yes you are right. But I think that's not the point here.

> 

> Regards,

> Olivier
  

Patch

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index da5d272..757c399 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -310,13 +310,17 @@  static void cmd_help_long_parsed(void *parsed_result,
 			"    Disable hardware insertion of a VLAN header in"
 			" packets sent on a port.\n\n"
 
-			"tx_checksum set mask (port_id)\n"
+			"tx_checksum set (mask) (port_id)\n"
 			"    Enable hardware insertion of checksum offload with"
-			" the 4-bit mask, 0~0xf, in packets sent on a port.\n"
+			" the 8-bit mask, 0~0xff, in packets sent on a port.\n"
 			"        bit 0 - insert ip   checksum offload if set\n"
 			"        bit 1 - insert udp  checksum offload if set\n"
 			"        bit 2 - insert tcp  checksum offload if set\n"
 			"        bit 3 - insert sctp checksum offload if set\n"
+			"        bit 4 - insert inner ip  checksum offload if set\n"
+			"        bit 5 - insert inner udp checksum offload if set\n"
+			"        bit 6 - insert inner tcp checksum offload if set\n"
+			"        bit 7 - insert inner sctp checksum offload if set\n"
 			"    Please check the NIC datasheet for HW limits.\n\n"
 
 			"set fwd (%s)\n"
@@ -2763,8 +2767,9 @@  cmdline_parse_inst_t cmd_tx_cksum_set = {
 	.f = cmd_tx_cksum_set_parsed,
 	.data = NULL,
 	.help_str = "enable hardware insertion of L3/L4checksum with a given "
-	"mask in packets sent on a port, the bit mapping is given as, Bit 0 for ip"
-	"Bit 1 for UDP, Bit 2 for TCP, Bit 3 for SCTP",
+	"mask in packets sent on a port, the bit mapping is given as, Bit 0 for ip, "
+	"Bit 1 for UDP, Bit 2 for TCP, Bit 3 for SCTP, Bit 4 for inner ip, "
+	"Bit 5 for inner UDP, Bit 6 for inner TCP, Bit 7 for inner SCTP",
 	.tokens = {
 		(void *)&cmd_tx_cksum_set_tx_cksum,
 		(void *)&cmd_tx_cksum_set_set,
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 2a1b93f..9bc08f4 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1753,9 +1753,9 @@  tx_cksum_set(portid_t port_id, uint64_t ol_flags)
 	uint64_t tx_ol_flags;
 	if (port_id_is_invalid(port_id))
 		return;
-	/* Clear last 4 bits and then set L3/4 checksum mask again */
-	tx_ol_flags = ports[port_id].tx_ol_flags & (~0x0Full);
-	ports[port_id].tx_ol_flags = ((ol_flags & 0xf) | tx_ol_flags);
+	/* Clear last 8 bits and then set L3/4 checksum mask again */
+	tx_ol_flags = ports[port_id].tx_ol_flags & (~0x0FFull);
+	ports[port_id].tx_ol_flags = ((ol_flags & 0xff) | tx_ol_flags);
 }
 
 void
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index fcc4876..3967476 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -209,10 +209,16 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 	struct rte_mbuf  *mb;
 	struct ether_hdr *eth_hdr;
 	struct ipv4_hdr  *ipv4_hdr;
+	struct ether_hdr *inner_eth_hdr;
+	struct ipv4_hdr  *inner_ipv4_hdr = NULL;
 	struct ipv6_hdr  *ipv6_hdr;
+	struct ipv6_hdr  *inner_ipv6_hdr = NULL;
 	struct udp_hdr   *udp_hdr;
+	struct udp_hdr   *inner_udp_hdr;
 	struct tcp_hdr   *tcp_hdr;
+	struct tcp_hdr   *inner_tcp_hdr;
 	struct sctp_hdr  *sctp_hdr;
+	struct sctp_hdr  *inner_sctp_hdr;
 
 	uint16_t nb_rx;
 	uint16_t nb_tx;
@@ -221,12 +227,17 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint64_t pkt_ol_flags;
 	uint64_t tx_ol_flags;
 	uint16_t l4_proto;
+	uint16_t inner_l4_proto = 0;
 	uint16_t eth_type;
 	uint8_t  l2_len;
 	uint8_t  l3_len;
+	uint8_t  inner_l2_len = 0;
+	uint8_t  inner_l3_len = 0;
 
 	uint32_t rx_bad_ip_csum;
 	uint32_t rx_bad_l4_csum;
+	uint8_t  ipv4_tunnel;
+	uint8_t  ipv6_tunnel;
 
 #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
 	uint64_t start_tsc;
@@ -262,7 +273,10 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 		l2_len  = sizeof(struct ether_hdr);
 		pkt_ol_flags = mb->ol_flags;
 		ol_flags = (pkt_ol_flags & (~PKT_TX_L4_MASK));
-
+		ipv4_tunnel = (pkt_ol_flags & PKT_RX_TUNNEL_IPV4_HDR) ?
+				1 : 0;
+		ipv6_tunnel = (pkt_ol_flags & PKT_RX_TUNNEL_IPV6_HDR) ?
+				1 : 0;
 		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
 		if (eth_type == ETHER_TYPE_VLAN) {
@@ -295,7 +309,7 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 		 *      + ipv4 or ipv6
 		 *      + udp or tcp or sctp or others
 		 */
-		if (pkt_ol_flags & PKT_RX_IPV4_HDR) {
+		if (pkt_ol_flags & (PKT_RX_IPV4_HDR | PKT_RX_TUNNEL_IPV4_HDR)) {
 
 			/* Do not support ipv4 option field */
 			l3_len = sizeof(struct ipv4_hdr) ;
@@ -325,17 +339,92 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 				if (tx_ol_flags & 0x2) {
 					/* HW Offload */
 					ol_flags |= PKT_TX_UDP_CKSUM;
-					/* Pseudo header sum need be set properly */
-					udp_hdr->dgram_cksum = get_ipv4_psd_sum(ipv4_hdr);
+					if (ipv4_tunnel)
+						udp_hdr->dgram_cksum = 0;
+					else
+						/* Pseudo header sum need be set properly */
+						udp_hdr->dgram_cksum =
+							get_ipv4_psd_sum(ipv4_hdr);
 				}
 				else {
 					/* SW Implementation, clear checksum field first */
 					udp_hdr->dgram_cksum = 0;
 					udp_hdr->dgram_cksum = get_ipv4_udptcp_checksum(ipv4_hdr,
-							(uint16_t*)udp_hdr);
+									(uint16_t *)udp_hdr);
 				}
-			}
-			else if (l4_proto == IPPROTO_TCP){
+
+				if (ipv4_tunnel) {
+
+					uint16_t len;
+
+					/* Check if inner L3/L4 checkum flag is set */
+					if (tx_ol_flags & 0xF0)
+						ol_flags |= PKT_TX_VXLAN_CKSUM;
+
+					inner_l2_len  = sizeof(struct ether_hdr);
+					inner_eth_hdr = (struct ether_hdr *) (rte_pktmbuf_mtod(mb,
+								unsigned char *) + l2_len + l3_len
+								 + ETHER_VXLAN_HLEN);
+
+					eth_type = rte_be_to_cpu_16(inner_eth_hdr->ether_type);
+					if (eth_type == ETHER_TYPE_VLAN) {
+						inner_l2_len += sizeof(struct vlan_hdr);
+						eth_type = rte_be_to_cpu_16(*(uint16_t *)
+							((uintptr_t)&eth_hdr->ether_type +
+								sizeof(struct vlan_hdr)));
+					}
+
+					len = l2_len + l3_len + ETHER_VXLAN_HLEN + inner_l2_len;
+					if (eth_type == ETHER_TYPE_IPv4) {
+						inner_l3_len = sizeof(struct ipv4_hdr);
+						inner_ipv4_hdr = (struct ipv4_hdr *) (rte_pktmbuf_mtod(mb,
+								unsigned char *) + len);
+						inner_l4_proto = inner_ipv4_hdr->next_proto_id;
+
+						if (tx_ol_flags & 0x10) {
+
+							/* Do not delete, this is required by HW*/
+							inner_ipv4_hdr->hdr_checksum = 0;
+							ol_flags |= PKT_TX_IPV4_CSUM;
+						}
+
+					} else if (eth_type == ETHER_TYPE_IPv6) {
+						inner_l3_len = sizeof(struct ipv6_hdr);
+						inner_ipv6_hdr = (struct ipv6_hdr *) (rte_pktmbuf_mtod(mb,
+								unsigned char *) + len);
+						inner_l4_proto = inner_ipv6_hdr->proto;
+					}
+					if ((inner_l4_proto == IPPROTO_UDP) && (tx_ol_flags & 0x20)) {
+
+						/* HW Offload */
+						ol_flags |= PKT_TX_UDP_CKSUM;
+						inner_udp_hdr = (struct udp_hdr *) (rte_pktmbuf_mtod(mb,
+								unsigned char *) + len + inner_l3_len);
+						if (eth_type == ETHER_TYPE_IPv4)
+							inner_udp_hdr->dgram_cksum = get_ipv4_psd_sum(inner_ipv4_hdr);
+						else if (eth_type == ETHER_TYPE_IPv6)
+							inner_udp_hdr->dgram_cksum = get_ipv6_psd_sum(inner_ipv6_hdr);
+
+					} else if ((inner_l4_proto == IPPROTO_TCP) && (tx_ol_flags & 0x40)) {
+						/* HW Offload */
+						ol_flags |= PKT_TX_TCP_CKSUM;
+						inner_tcp_hdr = (struct tcp_hdr *) (rte_pktmbuf_mtod(mb,
+								unsigned char *) + len + inner_l3_len);
+						if (eth_type == ETHER_TYPE_IPv4)
+							inner_tcp_hdr->cksum = get_ipv4_psd_sum(inner_ipv4_hdr);
+						else if (eth_type == ETHER_TYPE_IPv6)
+							inner_tcp_hdr->cksum = get_ipv6_psd_sum(inner_ipv6_hdr);
+					} else if ((inner_l4_proto == IPPROTO_SCTP) && (tx_ol_flags & 0x80)) {
+						/* HW Offload */
+						ol_flags |= PKT_TX_SCTP_CKSUM;
+						inner_sctp_hdr = (struct sctp_hdr *) (rte_pktmbuf_mtod(mb,
+								unsigned char *) + len + inner_l3_len);
+						inner_sctp_hdr->cksum = 0;
+					}
+
+				}
+
+			} else if (l4_proto == IPPROTO_TCP) {
 				tcp_hdr = (struct tcp_hdr*) (rte_pktmbuf_mtod(mb,
 						unsigned char *) + l2_len + l3_len);
 				if (tx_ol_flags & 0x4) {
@@ -347,8 +436,7 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 					tcp_hdr->cksum = get_ipv4_udptcp_checksum(ipv4_hdr,
 							(uint16_t*)tcp_hdr);
 				}
-			}
-			else if (l4_proto == IPPROTO_SCTP) {
+			} else if (l4_proto == IPPROTO_SCTP) {
 				sctp_hdr = (struct sctp_hdr*) (rte_pktmbuf_mtod(mb,
 						unsigned char *) + l2_len + l3_len);
 
@@ -367,9 +455,7 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 				}
 			}
 			/* End of L4 Handling*/
-		}
-		else if (pkt_ol_flags & PKT_RX_IPV6_HDR) {
-
+		} else if (pkt_ol_flags & (PKT_RX_IPV6_HDR | PKT_RX_TUNNEL_IPV6_HDR)) {
 			ipv6_hdr = (struct ipv6_hdr *) (rte_pktmbuf_mtod(mb,
 					unsigned char *) + l2_len);
 			l3_len = sizeof(struct ipv6_hdr) ;
@@ -382,15 +468,93 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 				if (tx_ol_flags & 0x2) {
 					/* HW Offload */
 					ol_flags |= PKT_TX_UDP_CKSUM;
-					udp_hdr->dgram_cksum = get_ipv6_psd_sum(ipv6_hdr);
+					if (ipv6_tunnel)
+						udp_hdr->dgram_cksum = 0;
+					else
+						udp_hdr->dgram_cksum =
+							get_ipv6_psd_sum(ipv6_hdr);
 				}
 				else {
 					/* SW Implementation */
 					/* checksum field need be clear first */
 					udp_hdr->dgram_cksum = 0;
 					udp_hdr->dgram_cksum = get_ipv6_udptcp_checksum(ipv6_hdr,
-							(uint16_t*)udp_hdr);
+								(uint16_t *)udp_hdr);
 				}
+
+				if (ipv6_tunnel) {
+
+					uint16_t len;
+
+					/* Check if inner L3/L4 checksum flag is set */
+					if (tx_ol_flags & 0xF0)
+						ol_flags |= PKT_TX_VXLAN_CKSUM;
+
+					inner_l2_len  = sizeof(struct ether_hdr);
+					inner_eth_hdr = (struct ether_hdr *) (rte_pktmbuf_mtod(mb,
+						unsigned char *) + l2_len + l3_len + ETHER_VXLAN_HLEN);
+					eth_type = rte_be_to_cpu_16(inner_eth_hdr->ether_type);
+
+					if (eth_type == ETHER_TYPE_VLAN) {
+						inner_l2_len += sizeof(struct vlan_hdr);
+						eth_type = rte_be_to_cpu_16(*(uint16_t *)
+							((uintptr_t)&eth_hdr->ether_type +
+							sizeof(struct vlan_hdr)));
+					}
+
+					len = l2_len + l3_len + ETHER_VXLAN_HLEN + inner_l2_len;
+
+					if (eth_type == ETHER_TYPE_IPv4) {
+						inner_l3_len = sizeof(struct ipv4_hdr);
+						inner_ipv4_hdr = (struct ipv4_hdr *) (rte_pktmbuf_mtod(mb,
+								unsigned char *) + len);
+						inner_l4_proto = inner_ipv4_hdr->next_proto_id;
+
+						/* HW offload */
+						if (tx_ol_flags & 0x10) {
+
+							/* Do not delete, this is required by HW*/
+							inner_ipv4_hdr->hdr_checksum = 0;
+							ol_flags |= PKT_TX_IPV4_CSUM;
+						}
+					} else if (eth_type == ETHER_TYPE_IPv6) {
+						inner_l3_len = sizeof(struct ipv6_hdr);
+						inner_ipv6_hdr = (struct ipv6_hdr *) (rte_pktmbuf_mtod(mb,
+							unsigned char *) + len);
+						inner_l4_proto = inner_ipv6_hdr->proto;
+					}
+
+					if ((inner_l4_proto == IPPROTO_UDP) && (tx_ol_flags & 0x20)) {
+						inner_udp_hdr = (struct udp_hdr *) (rte_pktmbuf_mtod(mb,
+							unsigned char *) + len + inner_l3_len);
+						/* HW offload */
+						ol_flags |= PKT_TX_UDP_CKSUM;
+						inner_udp_hdr->dgram_cksum = 0;
+						if (eth_type == ETHER_TYPE_IPv4)
+							inner_udp_hdr->dgram_cksum = get_ipv4_psd_sum(inner_ipv4_hdr);
+						else if (eth_type == ETHER_TYPE_IPv6)
+							inner_udp_hdr->dgram_cksum = get_ipv6_psd_sum(inner_ipv6_hdr);
+					} else if ((inner_l4_proto == IPPROTO_TCP) && (tx_ol_flags & 0x40)) {
+						/* HW offload */
+						ol_flags |= PKT_TX_TCP_CKSUM;
+						inner_tcp_hdr = (struct tcp_hdr *) (rte_pktmbuf_mtod(mb,
+								unsigned char *) + len + inner_l3_len);
+
+						if (eth_type == ETHER_TYPE_IPv4)
+							inner_tcp_hdr->cksum = get_ipv4_psd_sum(inner_ipv4_hdr);
+						else if (eth_type == ETHER_TYPE_IPv6)
+							inner_tcp_hdr->cksum = get_ipv6_psd_sum(inner_ipv6_hdr);
+
+					} else if ((inner_l4_proto == IPPROTO_SCTP) && (tx_ol_flags & 0x80)) {
+						/* HW offload */
+						ol_flags |= PKT_TX_SCTP_CKSUM;
+						inner_sctp_hdr = (struct sctp_hdr *) (rte_pktmbuf_mtod(mb,
+								unsigned char *) + len + inner_l3_len);
+						inner_sctp_hdr->cksum = 0;
+					}
+
+				}
+
 			}
 			else if (l4_proto == IPPROTO_TCP) {
 				tcp_hdr = (struct tcp_hdr*) (rte_pktmbuf_mtod(mb,
@@ -434,6 +598,8 @@  pkt_burst_checksum_forward(struct fwd_stream *fs)
 		/* Combine the packet header write. VLAN is not consider here */
 		mb->l2_len = l2_len;
 		mb->l3_len = l3_len;
+		mb->inner_l2_len = inner_l2_len;
+		mb->inner_l3_len = inner_l3_len;
 		mb->ol_flags = ol_flags;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);