Message ID | 1409759378-10113-12-git-send-email-bruce.richardson@intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers |
Return-Path: <bricha3@ecsmtp.ir.intel.com> Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id B38B958EE for <dev@dpdk.org>; Wed, 3 Sep 2014 17:45:07 +0200 (CEST) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga101.jf.intel.com with ESMTP; 03 Sep 2014 08:49:42 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.04,458,1406617200"; d="scan'208";a="567777854" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga001.jf.intel.com with ESMTP; 03 Sep 2014 08:49:41 -0700 Received: from sivswdev02.ir.intel.com (sivswdev02.ir.intel.com [10.237.217.46]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id s83FneKI025437; Wed, 3 Sep 2014 16:49:40 +0100 Received: from sivswdev02.ir.intel.com (localhost [127.0.0.1]) by sivswdev02.ir.intel.com with ESMTP id s83Fnd7G010493; Wed, 3 Sep 2014 16:49:39 +0100 Received: (from bricha3@localhost) by sivswdev02.ir.intel.com with id s83Fndqk010489; Wed, 3 Sep 2014 16:49:39 +0100 From: Bruce Richardson <bruce.richardson@intel.com> To: dev@dpdk.org Date: Wed, 3 Sep 2014 16:49:36 +0100 Message-Id: <1409759378-10113-12-git-send-email-bruce.richardson@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: <1409759378-10113-1-git-send-email-bruce.richardson@intel.com> References: <1409759378-10113-1-git-send-email-bruce.richardson@intel.com> Subject: [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK <dev.dpdk.org> List-Unsubscribe: <http://dpdk.org/ml/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://dpdk.org/ml/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <http://dpdk.org/ml/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> X-List-Received-Date: Wed, 03 Sep 2014 15:45:09 -0000 |
Commit Message
Bruce Richardson
Sept. 3, 2014, 3:49 p.m. UTC
The l2_len and l3_len fields are used for TX offloads and so should be
put on the second cache line, along with the other fields only used on
TX.
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
lib/librte_mbuf/rte_mbuf.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
Comments
Hello Bruce, I'm a little bit concerned about performance issues that would arise if these fields would go to the 2nd cache line. For exampe, l2_len and l3_len fields are used by librte_ip_frag to find L3 and L4 headers position inside mbuf data. Thus, these values should be calculated by NIC offload, or by user on RX leg. Secondly, (I wouldn't say on behalf of everyone, but) we use these fields in our libraries as well for needs of classification. For instance, in case you try to support other ethertypes which are not supported by NIC offload (MPLS, IPX etc), but you still need to point out L3 and L3 headers. If my concerns are consistent, what would be possible suggestions? 03.09.2014 21:49, Bruce Richardson ГЇГЁГёГҐГІ: > The l2_len and l3_len fields are used for TX offloads and so should be > put on the second cache line, along with the other fields only used on > TX. > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
On Thu, Sep 04, 2014 at 11:08:57AM +0600, Yerden Zhumabekov wrote: > Hello Bruce, > > I'm a little bit concerned about performance issues that would arise if > these fields would go to the 2nd cache line. > > For exampe, l2_len and l3_len fields are used by librte_ip_frag to find > L3 and L4 headers position inside mbuf data. Thus, these values should > be calculated by NIC offload, or by user on RX leg. > > Secondly, (I wouldn't say on behalf of everyone, but) we use these > fields in our libraries as well for needs of classification. For > instance, in case you try to support other ethertypes which are not > supported by NIC offload (MPLS, IPX etc), but you still need to point > out L3 and L3 headers. > > If my concerns are consistent, what would be possible suggestions? Hi Yerden, I understand your concerns and it's good to have this discussion. There are a number of reasons why I've moved these particular fields to the second cache line. Firstly, the main reason is that, obviously enough, not all fields will fit in cache line 0, and we need to prioritize what does get stored there. The guiding principle behind what fields get moved or not that I've chosen to use for this patch set is to move fields that are not used on the receive path (or the fastpath receive path, more specifically - so that we can move fields only used by jumbo frames that span mbufs) to the second cache line. From a search through the existing codebase, there are no drivers which set the l2/l3 length fields on RX, this is only used in reassembly libraries/apps and by the drivers on TX. The other reason for moving it to the second cache line is that it logically belongs with all the other length fields that we need to add to enable tunneling support. [To get an idea of the extra fields that I propose adding to the mbuf, please see the RFC patchset I sent out previously as "[RFC PATCH 00/14] Extend the mbuf structure"]. While we probably can fit the 16-bits needed for l2/l3 length on the mbuf line 0, there is not enough room for all the lengths so we would end up splitting them with other fields in between. So, in terms of what do to about this particular issue. I would hope that for applications that use these fields the impact should be small and/or possible to work around e.g. maybe prefetch second cache line on RX in driver. If not, then I'm happy to see about withdrawing this particular change and seeing if we can keep l2/l3 lengths on cache line zero, with other length fields being on cache line 1. Question: would you consider the ip fragmentation and reassembly example apps in the Intel DPDK releases good examples to test to see the impacts of this change, or is there some other test you would prefer that I look to do? Can you perhaps test out the patch sets for the mbuf that I've upstreamed so far and let me know what regressions, if any, you see in your use-case scenarios? Regards, /Bruce > > 03.09.2014 21:49, Bruce Richardson ?????: > > The l2_len and l3_len fields are used for TX offloads and so should be > > put on the second cache line, along with the other fields only used on > > TX. > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> > > -- > Sincerely, > > Yerden Zhumabekov > STS, ACI > Astana, KZ >
I get your point. I've also read throught the code of various PMDs and have found no indication of setting l2_len/l3_len fields as well. As for testing, we'd be happy to test the patchset but we are now in process of building our testing facilities so we are not ready to provide enough workload for the hardware/software. I was also wondering if anyone has run some test and can provide some numbers on that matter. Personally, I don't think frag/reassemly app is a perfect example for evaluating 2nd cache line performance penalty. The offsets to L3 and L4 headers need to be calculated for all TCP/IP traffic and fragmented traffic is not representative in this case. Maybe it would be better to write an app which calculates these offsets for different set of mbufs and provides some stats. For example, l2fwd/l3fwd + additional l2_len and l3_len calculation. And I'm also figuring out how to rewrite our app/libs (prefetch etc) to reflect the future changes in mbuf, hence my concerns :) 04.09.2014 16:27, Bruce Richardson пиÑеÑ: > Hi Yerden, > > I understand your concerns and it's good to have this discussion. > > There are a number of reasons why I've moved these particular fields > to the second cache line. Firstly, the main reason is that, obviously enough, > not all fields will fit in cache line 0, and we need to prioritize what does > get stored there. The guiding principle behind what fields get moved or not > that I've chosen to use for this patch set is to move fields that are not > used on the receive path (or the fastpath receive path, more specifically - > so that we can move fields only used by jumbo frames that span mbufs) to the > second cache line. From a search through the existing codebase, there are no > drivers which set the l2/l3 length fields on RX, this is only used in > reassembly libraries/apps and by the drivers on TX. > > The other reason for moving it to the second cache line is that it logically > belongs with all the other length fields that we need to add to enable > tunneling support. [To get an idea of the extra fields that I propose adding > to the mbuf, please see the RFC patchset I sent out previously as "[RFC > PATCH 00/14] Extend the mbuf structure"]. While we probably can fit the 16-bits > needed for l2/l3 length on the mbuf line 0, there is not enough room for all > the lengths so we would end up splitting them with other fields in between. > > So, in terms of what do to about this particular issue. I would hope that for > applications that use these fields the impact should be small and/or possible > to work around e.g. maybe prefetch second cache line on RX in driver. If not, > then I'm happy to see about withdrawing this particular change and seeing if > we can keep l2/l3 lengths on cache line zero, with other length fields being > on cache line 1. > > Question: would you consider the ip fragmentation and reassembly example apps > in the Intel DPDK releases good examples to test to see the impacts of this > change, or is there some other test you would prefer that I look to do? > Can you perhaps test out the patch sets for the mbuf that I've upstreamed so > far and let me know what regressions, if any, you see in your use-case > scenarios? > > Regards, > /Bruce >
On Thu, Sep 04, 2014 at 05:00:12PM +0600, Yerden Zhumabekov wrote: > I get your point. I've also read throught the code of various PMDs and > have found no indication of setting l2_len/l3_len fields as well. > > As for testing, we'd be happy to test the patchset but we are now in > process of building our testing facilities so we are not ready to > provide enough workload for the hardware/software. I was also wondering > if anyone has run some test and can provide some numbers on that matter. > > Personally, I don't think frag/reassemly app is a perfect example for > evaluating 2nd cache line performance penalty. The offsets to L3 and L4 > headers need to be calculated for all TCP/IP traffic and fragmented > traffic is not representative in this case. Maybe it would be better to > write an app which calculates these offsets for different set of mbufs > and provides some stats. For example, l2fwd/l3fwd + additional l2_len > and l3_len calculation. > > And I'm also figuring out how to rewrite our app/libs (prefetch etc) to > reflect the future changes in mbuf, hence my concerns :) > Just a final point on this. Note that the second cache line is always being read by the TX leg of the code to free back mbufs to their mbuf pool post- transmit. The overall fast-path RX+TX benchmarks show no performance degradation due to that access. For sample apps, you make a good point indeed about the existing app not being very useful as they work on larger packets. I'll see what I can throw together here to make a more realistic test. /Bruce > > 04.09.2014 16:27, Bruce Richardson ??????????: > > Hi Yerden, > > > > I understand your concerns and it's good to have this discussion. > > > > There are a number of reasons why I've moved these particular fields > > to the second cache line. Firstly, the main reason is that, obviously enough, > > not all fields will fit in cache line 0, and we need to prioritize what does > > get stored there. The guiding principle behind what fields get moved or not > > that I've chosen to use for this patch set is to move fields that are not > > used on the receive path (or the fastpath receive path, more specifically - > > so that we can move fields only used by jumbo frames that span mbufs) to the > > second cache line. From a search through the existing codebase, there are no > > drivers which set the l2/l3 length fields on RX, this is only used in > > reassembly libraries/apps and by the drivers on TX. > > > > The other reason for moving it to the second cache line is that it logically > > belongs with all the other length fields that we need to add to enable > > tunneling support. [To get an idea of the extra fields that I propose adding > > to the mbuf, please see the RFC patchset I sent out previously as "[RFC > > PATCH 00/14] Extend the mbuf structure"]. While we probably can fit the 16-bits > > needed for l2/l3 length on the mbuf line 0, there is not enough room for all > > the lengths so we would end up splitting them with other fields in between. > > > > So, in terms of what do to about this particular issue. I would hope that for > > applications that use these fields the impact should be small and/or possible > > to work around e.g. maybe prefetch second cache line on RX in driver. If not, > > then I'm happy to see about withdrawing this particular change and seeing if > > we can keep l2/l3 lengths on cache line zero, with other length fields being > > on cache line 1. > > > > Question: would you consider the ip fragmentation and reassembly example apps > > in the Intel DPDK releases good examples to test to see the impacts of this > > change, or is there some other test you would prefer that I look to do? > > Can you perhaps test out the patch sets for the mbuf that I've upstreamed so > > far and let me know what regressions, if any, you see in your use-case > > scenarios? > > > > Regards, > > /Bruce > > > -- > Sincerely, > > Yerden Zhumabekov > STS, ACI > Astana, KZ >
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index db079ac..d3c1745 100644 --- a/lib/librte_mbuf/rte_mbuf.h +++ b/lib/librte_mbuf/rte_mbuf.h @@ -159,8 +159,7 @@ struct rte_mbuf { uint16_t packet_type; /**< Type of packet, e.g. protocols used */ uint16_t data_len; /**< Amount of data in segment buffer. */ uint32_t pkt_len; /**< Total pkt len: sum of all segments. */ - uint16_t l3_len:9; /**< L3 (IP) Header Length. */ - uint16_t l2_len:7; /**< L2 (MAC) Header Length. */ + uint16_t reserved; uint16_t vlan_tci; /**< VLAN Tag Control Identifier (CPU order). */ union { uint32_t rss; /**< RSS hash result if RSS enabled */ @@ -176,6 +175,9 @@ struct rte_mbuf { struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */ struct rte_mbuf *next; /**< Next segment of scattered packet. */ + /* fields to support TX offloads */ + uint16_t l3_len:9; /**< L3 (IP) Header Length. */ + uint16_t l2_len:7; /**< L2 (MAC) Header Length. */ } __rte_cache_aligned; #define RTE_MBUF_METADATA_UINT8(mbuf, offset) \