[dpdk-dev,v2,1/6] mbuf: add buffer offset field for flexible indirection

Message ID 20180402185008.13073-2-yskoh@mellanox.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Yongseok Koh April 2, 2018, 6:50 p.m. UTC
  When attaching a mbuf, indirect mbuf has to point to start of buffer of
direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
flexible. Indirect mbuf can point to any part of direct mbuf by calling
rte_pktmbuf_attach_at().

Possible use-cases could be:
- If a packet has multiple layers of encapsulation, multiple indirect
  buffers can reference different layers of the encapsulated packet.
- A large direct mbuf can even contain multiple packets in series and
  each packet can be referenced by multiple mbuf indirections.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_mbuf/rte_mbuf.h | 158 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 157 insertions(+), 1 deletion(-)
  

Comments

Olivier Matz April 3, 2018, 8:26 a.m. UTC | #1
Hi,

On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> When attaching a mbuf, indirect mbuf has to point to start of buffer of
> direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> flexible. Indirect mbuf can point to any part of direct mbuf by calling
> rte_pktmbuf_attach_at().
> 
> Possible use-cases could be:
> - If a packet has multiple layers of encapsulation, multiple indirect
>   buffers can reference different layers of the encapsulated packet.
> - A large direct mbuf can even contain multiple packets in series and
>   each packet can be referenced by multiple mbuf indirections.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

I think the current API is already able to do what you want.

1/ Here is a mbuf m with its data

               off
               <-->
                      len
          +----+   <---------->
          |    |
        +-|----v----------------------+
        | |    -----------------------|
m       | buf  |    XXXXXXXXXXX      ||
        |      -----------------------|
        +-----------------------------+


2/ clone m:

  c = rte_pktmbuf_alloc(pool);
  rte_pktmbuf_attach(c, m);

  Note that c has its own offset and length fields.


               off
               <-->
                      len
          +----+   <---------->
          |    |
        +-|----v----------------------+
        | |    -----------------------|
m       | buf  |    XXXXXXXXXXX      ||
        |      -----------------------|
        +------^----------------------+
               |
          +----+
indirect  |
        +-|---------------------------+
        | |    -----------------------|
c       | buf  |                     ||
        |      -----------------------|
        +-----------------------------+

                off    len
                <--><---------->


3/ remove some data from c without changing m

   rte_pktmbuf_adj(c, 10)   // at head
   rte_pktmbuf_trim(c, 10)  // at tail


Please let me know if it fits your needs.

Regards,
Olivier
  
Yongseok Koh April 4, 2018, 12:12 a.m. UTC | #2
On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> Hi,
> 
> On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > rte_pktmbuf_attach_at().
> > 
> > Possible use-cases could be:
> > - If a packet has multiple layers of encapsulation, multiple indirect
> >   buffers can reference different layers of the encapsulated packet.
> > - A large direct mbuf can even contain multiple packets in series and
> >   each packet can be referenced by multiple mbuf indirections.
> > 
> > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> 
> I think the current API is already able to do what you want.
> 
> 1/ Here is a mbuf m with its data
> 
>                off
>                <-->
>                       len
>           +----+   <---------->
>           |    |
>         +-|----v----------------------+
>         | |    -----------------------|
> m       | buf  |    XXXXXXXXXXX      ||
>         |      -----------------------|
>         +-----------------------------+
> 
> 
> 2/ clone m:
> 
>   c = rte_pktmbuf_alloc(pool);
>   rte_pktmbuf_attach(c, m);
> 
>   Note that c has its own offset and length fields.
> 
> 
>                off
>                <-->
>                       len
>           +----+   <---------->
>           |    |
>         +-|----v----------------------+
>         | |    -----------------------|
> m       | buf  |    XXXXXXXXXXX      ||
>         |      -----------------------|
>         +------^----------------------+
>                |
>           +----+
> indirect  |
>         +-|---------------------------+
>         | |    -----------------------|
> c       | buf  |                     ||
>         |      -----------------------|
>         +-----------------------------+
> 
>                 off    len
>                 <--><---------->
> 
> 
> 3/ remove some data from c without changing m
> 
>    rte_pktmbuf_adj(c, 10)   // at head
>    rte_pktmbuf_trim(c, 10)  // at tail
> 
> 
> Please let me know if it fits your needs.

No, it doesn't.

Trimming head and tail with the current APIs removes data and make the space
available. Adjusting packet head means giving more headroom, not shifting the
buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
difference offsets in m,

rte_pktmbuf_adj(c1, 10);
rte_pktmbuf_adj(c2, 20);

then the owner of c2 regard the first (off+20)B as available headroom. If it
wants to attach outer header, it will overwrite the headroom even though the
owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
header should be linked by h1->next = c2.

If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
which actually shrink the headroom, this case can be properly handled.

And another use-case (this is my actual use-case) is to make a large mbuf have
multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
because it transfers multiple packets to a single large buffer to reduce PCIe
overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
indirect referencing.

Does this make sense?


Thanks,
Yongseok
  
Olivier Matz April 9, 2018, 4:04 p.m. UTC | #3
Hi Yongseok,

On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > Hi,
> > 
> > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > rte_pktmbuf_attach_at().
> > > 
> > > Possible use-cases could be:
> > > - If a packet has multiple layers of encapsulation, multiple indirect
> > >   buffers can reference different layers of the encapsulated packet.
> > > - A large direct mbuf can even contain multiple packets in series and
> > >   each packet can be referenced by multiple mbuf indirections.
> > > 
> > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > 
> > I think the current API is already able to do what you want.
> > 
> > 1/ Here is a mbuf m with its data
> > 
> >                off
> >                <-->
> >                       len
> >           +----+   <---------->
> >           |    |
> >         +-|----v----------------------+
> >         | |    -----------------------|
> > m       | buf  |    XXXXXXXXXXX      ||
> >         |      -----------------------|
> >         +-----------------------------+
> > 
> > 
> > 2/ clone m:
> > 
> >   c = rte_pktmbuf_alloc(pool);
> >   rte_pktmbuf_attach(c, m);
> > 
> >   Note that c has its own offset and length fields.
> > 
> > 
> >                off
> >                <-->
> >                       len
> >           +----+   <---------->
> >           |    |
> >         +-|----v----------------------+
> >         | |    -----------------------|
> > m       | buf  |    XXXXXXXXXXX      ||
> >         |      -----------------------|
> >         +------^----------------------+
> >                |
> >           +----+
> > indirect  |
> >         +-|---------------------------+
> >         | |    -----------------------|
> > c       | buf  |                     ||
> >         |      -----------------------|
> >         +-----------------------------+
> > 
> >                 off    len
> >                 <--><---------->
> > 
> > 
> > 3/ remove some data from c without changing m
> > 
> >    rte_pktmbuf_adj(c, 10)   // at head
> >    rte_pktmbuf_trim(c, 10)  // at tail
> > 
> > 
> > Please let me know if it fits your needs.
> 
> No, it doesn't.
> 
> Trimming head and tail with the current APIs removes data and make the space
> available. Adjusting packet head means giving more headroom, not shifting the
> buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> difference offsets in m,
> 
> rte_pktmbuf_adj(c1, 10);
> rte_pktmbuf_adj(c2, 20);
> 
> then the owner of c2 regard the first (off+20)B as available headroom. If it
> wants to attach outer header, it will overwrite the headroom even though the
> owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> header should be linked by h1->next = c2.

Yes, after these operations c1, c2 and m should become read-only. So, to
prepend headers, another mbuf has to be inserted before as you suggest. It
is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
length) that will:
  - alloc and attach indirect mbuf for each segment of m that is
    in the range [offset : length+offset].
  - prepend an empty and writable mbuf for the headers

> If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> which actually shrink the headroom, this case can be properly handled.

What do you mean by properly handled?

Yes, prepending data or adding data in the indirect mbuf won't override
the direct mbuf. But prepending data or adding data in the direct mbuf m
won't be protected.

From an application point of view, indirect mbufs, or direct mbufs that
have refcnt != 1, should be both considered as read-only because they
may share their data. How an application can know if the data is shared
or not?

Maybe we need a flag to differentiate mbufs that are read-only
(something like SHARED_DATA, or simply READONLY). In your case, if my
understanding is correct, you want to have indirect mbufs with RW data.


> And another use-case (this is my actual use-case) is to make a large mbuf have
> multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> because it transfers multiple packets to a single large buffer to reduce PCIe
> overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> indirect referencing.
> 
> Does this make sense?

I understand the need.

Another option would be to make the mbuf->buffer point to an external
buffer (not inside the direct mbuf). This would require to add a
mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
a quick overview.

[1] https://dpdksummit.com/Archive/pdf/2016Userspace/Day01-Session05-OlivierMatz-Userspace2016.pdf

The advantage is that it does not require the large data to be inside a
mbuf (requiring a mbuf structure before the buffer, and requiring to be
allocated from a mempool). On the other hand, it is maybe more complex
to implement compared to your solution.


Regards,
Olivier
  
Yongseok Koh April 10, 2018, 1:59 a.m. UTC | #4
On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> Hi Yongseok,
> 
> On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > Hi,
> > > 
> > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > rte_pktmbuf_attach_at().
> > > > 
> > > > Possible use-cases could be:
> > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > >   buffers can reference different layers of the encapsulated packet.
> > > > - A large direct mbuf can even contain multiple packets in series and
> > > >   each packet can be referenced by multiple mbuf indirections.
> > > > 
> > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > 
> > > I think the current API is already able to do what you want.
> > > 
> > > 1/ Here is a mbuf m with its data
> > > 
> > >                off
> > >                <-->
> > >                       len
> > >           +----+   <---------->
> > >           |    |
> > >         +-|----v----------------------+
> > >         | |    -----------------------|
> > > m       | buf  |    XXXXXXXXXXX      ||
> > >         |      -----------------------|
> > >         +-----------------------------+
> > > 
> > > 
> > > 2/ clone m:
> > > 
> > >   c = rte_pktmbuf_alloc(pool);
> > >   rte_pktmbuf_attach(c, m);
> > > 
> > >   Note that c has its own offset and length fields.
> > > 
> > > 
> > >                off
> > >                <-->
> > >                       len
> > >           +----+   <---------->
> > >           |    |
> > >         +-|----v----------------------+
> > >         | |    -----------------------|
> > > m       | buf  |    XXXXXXXXXXX      ||
> > >         |      -----------------------|
> > >         +------^----------------------+
> > >                |
> > >           +----+
> > > indirect  |
> > >         +-|---------------------------+
> > >         | |    -----------------------|
> > > c       | buf  |                     ||
> > >         |      -----------------------|
> > >         +-----------------------------+
> > > 
> > >                 off    len
> > >                 <--><---------->
> > > 
> > > 
> > > 3/ remove some data from c without changing m
> > > 
> > >    rte_pktmbuf_adj(c, 10)   // at head
> > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > 
> > > 
> > > Please let me know if it fits your needs.
> > 
> > No, it doesn't.
> > 
> > Trimming head and tail with the current APIs removes data and make the space
> > available. Adjusting packet head means giving more headroom, not shifting the
> > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > difference offsets in m,
> > 
> > rte_pktmbuf_adj(c1, 10);
> > rte_pktmbuf_adj(c2, 20);
> > 
> > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > wants to attach outer header, it will overwrite the headroom even though the
> > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > header should be linked by h1->next = c2.
> 
> Yes, after these operations c1, c2 and m should become read-only. So, to
> prepend headers, another mbuf has to be inserted before as you suggest. It
> is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> length) that will:
>   - alloc and attach indirect mbuf for each segment of m that is
>     in the range [offset : length+offset].
>   - prepend an empty and writable mbuf for the headers
> 
> > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > which actually shrink the headroom, this case can be properly handled.
> 
> What do you mean by properly handled?
> 
> Yes, prepending data or adding data in the indirect mbuf won't override
> the direct mbuf. But prepending data or adding data in the direct mbuf m
> won't be protected.
> 
> From an application point of view, indirect mbufs, or direct mbufs that
> have refcnt != 1, should be both considered as read-only because they
> may share their data. How an application can know if the data is shared
> or not?
> 
> Maybe we need a flag to differentiate mbufs that are read-only
> (something like SHARED_DATA, or simply READONLY). In your case, if my
> understanding is correct, you want to have indirect mbufs with RW data.

Agree that indirect mbuf must be treated as read-only, Then the current code is
enough to handle that use-case.

> > And another use-case (this is my actual use-case) is to make a large mbuf have
> > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > because it transfers multiple packets to a single large buffer to reduce PCIe
> > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > indirect referencing.
> > 
> > Does this make sense?
> 
> I understand the need.
> 
> Another option would be to make the mbuf->buffer point to an external
> buffer (not inside the direct mbuf). This would require to add a
> mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> a quick overview.
> 
> [1] https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01-Session05-OlivierMatz-Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> 
> The advantage is that it does not require the large data to be inside a
> mbuf (requiring a mbuf structure before the buffer, and requiring to be
> allocated from a mempool). On the other hand, it is maybe more complex
> to implement compared to your solution.

I knew that you presented the slides and frankly, I had considered that option
at first. But even with that option, metadata to store refcnt should also be
allocated and managed anyway. Kernel also maintains the skb_shared_info at the
end of the data segment. Even though it could have smaller metadata structure,
I just wanted to make full use of the existing framework because it is less
complex as you mentioned. Given that you presented the idea of external data
buffer in 2016 and there hasn't been many follow-up discussions/activities so
far, I thought the demand isn't so big yet thus I wanted to make this patch
simpler.  I personally think that we can take the idea of external data seg when
more demands come from users in the future as it would be a huge change and may
break current ABI/API. When the day comes, I'll gladly participate in the
discussions and write codes for it if I can be helpful.

Do you think this patch is okay for now?


Thanks for your comments,
Yongseok
  
Ananyev, Konstantin April 11, 2018, 12:25 a.m. UTC | #5
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yongseok Koh
> Sent: Tuesday, April 10, 2018 2:59 AM
> To: Olivier Matz <olivier.matz@6wind.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; adrien.mazarguil@6wind.com;
> nelio.laranjeiro@6wind.com; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
> 
> On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > Hi Yongseok,
> >
> > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > Hi,
> > > >
> > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > rte_pktmbuf_attach_at().
> > > > >
> > > > > Possible use-cases could be:
> > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > >
> > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > >
> > > > I think the current API is already able to do what you want.
> > > >
> > > > 1/ Here is a mbuf m with its data
> > > >
> > > >                off
> > > >                <-->
> > > >                       len
> > > >           +----+   <---------->
> > > >           |    |
> > > >         +-|----v----------------------+
> > > >         | |    -----------------------|
> > > > m       | buf  |    XXXXXXXXXXX      ||
> > > >         |      -----------------------|
> > > >         +-----------------------------+
> > > >
> > > >
> > > > 2/ clone m:
> > > >
> > > >   c = rte_pktmbuf_alloc(pool);
> > > >   rte_pktmbuf_attach(c, m);
> > > >
> > > >   Note that c has its own offset and length fields.
> > > >
> > > >
> > > >                off
> > > >                <-->
> > > >                       len
> > > >           +----+   <---------->
> > > >           |    |
> > > >         +-|----v----------------------+
> > > >         | |    -----------------------|
> > > > m       | buf  |    XXXXXXXXXXX      ||
> > > >         |      -----------------------|
> > > >         +------^----------------------+
> > > >                |
> > > >           +----+
> > > > indirect  |
> > > >         +-|---------------------------+
> > > >         | |    -----------------------|
> > > > c       | buf  |                     ||
> > > >         |      -----------------------|
> > > >         +-----------------------------+
> > > >
> > > >                 off    len
> > > >                 <--><---------->
> > > >
> > > >
> > > > 3/ remove some data from c without changing m
> > > >
> > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > >
> > > >
> > > > Please let me know if it fits your needs.
> > >
> > > No, it doesn't.
> > >
> > > Trimming head and tail with the current APIs removes data and make the space
> > > available. Adjusting packet head means giving more headroom, not shifting the
> > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > difference offsets in m,
> > >
> > > rte_pktmbuf_adj(c1, 10);
> > > rte_pktmbuf_adj(c2, 20);
> > >
> > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > wants to attach outer header, it will overwrite the headroom even though the
> > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > header should be linked by h1->next = c2.
> >
> > Yes, after these operations c1, c2 and m should become read-only. So, to
> > prepend headers, another mbuf has to be inserted before as you suggest. It
> > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > length) that will:
> >   - alloc and attach indirect mbuf for each segment of m that is
> >     in the range [offset : length+offset].
> >   - prepend an empty and writable mbuf for the headers
> >
> > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > which actually shrink the headroom, this case can be properly handled.
> >
> > What do you mean by properly handled?
> >
> > Yes, prepending data or adding data in the indirect mbuf won't override
> > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > won't be protected.
> >
> > From an application point of view, indirect mbufs, or direct mbufs that
> > have refcnt != 1, should be both considered as read-only because they
> > may share their data. How an application can know if the data is shared
> > or not?
> >
> > Maybe we need a flag to differentiate mbufs that are read-only
> > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > understanding is correct, you want to have indirect mbufs with RW data.
> 
> Agree that indirect mbuf must be treated as read-only, Then the current code is
> enough to handle that use-case.
> 
> > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > indirect referencing.

But just to make HW to RX multiple packets into one mbuf,
data_off inside indirect mbuf should be enough, correct?
As I understand, what you'd like to achieve with this new field -
ability to manipulate packet boundaries after RX, probably at upper layer.
As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
indirect mbufs trying to modify same direct buffer.
Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
Fields for indirect mbufs, straight after attach()?
Konstantin

> > >
> > > Does this make sense?
> >
> > I understand the need.
> >
> > Another option would be to make the mbuf->buffer point to an external
> > buffer (not inside the direct mbuf). This would require to add a
> > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > a quick overview.
> >
> > [1]
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> -Session05-OlivierMatz-
> Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> >
> > The advantage is that it does not require the large data to be inside a
> > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > allocated from a mempool). On the other hand, it is maybe more complex
> > to implement compared to your solution.
> 
> I knew that you presented the slides and frankly, I had considered that option
> at first. But even with that option, metadata to store refcnt should also be
> allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> end of the data segment. Even though it could have smaller metadata structure,
> I just wanted to make full use of the existing framework because it is less
> complex as you mentioned. Given that you presented the idea of external data
> buffer in 2016 and there hasn't been many follow-up discussions/activities so
> far, I thought the demand isn't so big yet thus I wanted to make this patch
> simpler.  I personally think that we can take the idea of external data seg when
> more demands come from users in the future as it would be a huge change and may
> break current ABI/API. When the day comes, I'll gladly participate in the
> discussions and write codes for it if I can be helpful.
> 
> Do you think this patch is okay for now?
> 
> 
> Thanks for your comments,
> Yongseok
  
Yongseok Koh April 11, 2018, 5:33 a.m. UTC | #6
On Tue, Apr 10, 2018 at 05:25:31PM -0700, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yongseok Koh
> > Sent: Tuesday, April 10, 2018 2:59 AM
> > To: Olivier Matz <olivier.matz@6wind.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; adrien.mazarguil@6wind.com;
> > nelio.laranjeiro@6wind.com; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection
> > 
> > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > Hi Yongseok,
> > >
> > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > Hi,
> > > > >
> > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > rte_pktmbuf_attach_at().
> > > > > >
> > > > > > Possible use-cases could be:
> > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > >
> > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > >
> > > > > I think the current API is already able to do what you want.
> > > > >
> > > > > 1/ Here is a mbuf m with its data
> > > > >
> > > > >                off
> > > > >                <-->
> > > > >                       len
> > > > >           +----+   <---------->
> > > > >           |    |
> > > > >         +-|----v----------------------+
> > > > >         | |    -----------------------|
> > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > >         |      -----------------------|
> > > > >         +-----------------------------+
> > > > >
> > > > >
> > > > > 2/ clone m:
> > > > >
> > > > >   c = rte_pktmbuf_alloc(pool);
> > > > >   rte_pktmbuf_attach(c, m);
> > > > >
> > > > >   Note that c has its own offset and length fields.
> > > > >
> > > > >
> > > > >                off
> > > > >                <-->
> > > > >                       len
> > > > >           +----+   <---------->
> > > > >           |    |
> > > > >         +-|----v----------------------+
> > > > >         | |    -----------------------|
> > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > >         |      -----------------------|
> > > > >         +------^----------------------+
> > > > >                |
> > > > >           +----+
> > > > > indirect  |
> > > > >         +-|---------------------------+
> > > > >         | |    -----------------------|
> > > > > c       | buf  |                     ||
> > > > >         |      -----------------------|
> > > > >         +-----------------------------+
> > > > >
> > > > >                 off    len
> > > > >                 <--><---------->
> > > > >
> > > > >
> > > > > 3/ remove some data from c without changing m
> > > > >
> > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > >
> > > > >
> > > > > Please let me know if it fits your needs.
> > > >
> > > > No, it doesn't.
> > > >
> > > > Trimming head and tail with the current APIs removes data and make the space
> > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > difference offsets in m,
> > > >
> > > > rte_pktmbuf_adj(c1, 10);
> > > > rte_pktmbuf_adj(c2, 20);
> > > >
> > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > header should be linked by h1->next = c2.
> > >
> > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > length) that will:
> > >   - alloc and attach indirect mbuf for each segment of m that is
> > >     in the range [offset : length+offset].
> > >   - prepend an empty and writable mbuf for the headers
> > >
> > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > which actually shrink the headroom, this case can be properly handled.
> > >
> > > What do you mean by properly handled?
> > >
> > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > won't be protected.
> > >
> > > From an application point of view, indirect mbufs, or direct mbufs that
> > > have refcnt != 1, should be both considered as read-only because they
> > > may share their data. How an application can know if the data is shared
> > > or not?
> > >
> > > Maybe we need a flag to differentiate mbufs that are read-only
> > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > understanding is correct, you want to have indirect mbufs with RW data.
> > 
> > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > enough to handle that use-case.
> > 
> > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > indirect referencing.
> 
> But just to make HW to RX multiple packets into one mbuf,
> data_off inside indirect mbuf should be enough, correct?
Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
to saturate the network link.

> As I understand, what you'd like to achieve with this new field -
> ability to manipulate packet boundaries after RX, probably at upper layer.
> As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> indirect mbufs trying to modify same direct buffer.

I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
is read-only. What that means, all the entities which own such mbufs have to be
aware of that and keep the principle as DPDK can't enforce the rule and there
can't be such sanity check. In this sense, HW doesn't violate it because the
direct mbuf is injected to HW before indirection. When packets are written by
HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
application layer with freeing the original direct mbuf (decrement refcnt by 1).
So, HW doesn't touch the direct buffer once it reaches to upper layer. The
direct buffer will be freed and get available for reuse when all the attached
indirect mbufs are freed.

> Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> Fields for indirect mbufs, straight after attach()?

Good point.
Actually that was my draft (Mellanox internal) version of this patch :-) But I
had to consider a case where priv_size is really given by user. Even though it
is less likely, but if original priv_size is quite big, it can't cover entire
buf_len. For this, I had to increase priv_size to 32-bit but adding another
16bit field (buf_off) looked more plausible.

Thanks for good comments,
Yongseok

> > > >
> > > > Does this make sense?
> > >
> > > I understand the need.
> > >
> > > Another option would be to make the mbuf->buffer point to an external
> > > buffer (not inside the direct mbuf). This would require to add a
> > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > a quick overview.
> > >
> > > [1]
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > -Session05-OlivierMatz-
> > Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > >
> > > The advantage is that it does not require the large data to be inside a
> > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > allocated from a mempool). On the other hand, it is maybe more complex
> > > to implement compared to your solution.
> > 
> > I knew that you presented the slides and frankly, I had considered that option
> > at first. But even with that option, metadata to store refcnt should also be
> > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > end of the data segment. Even though it could have smaller metadata structure,
> > I just wanted to make full use of the existing framework because it is less
> > complex as you mentioned. Given that you presented the idea of external data
> > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > simpler.  I personally think that we can take the idea of external data seg when
> > more demands come from users in the future as it would be a huge change and may
> > break current ABI/API. When the day comes, I'll gladly participate in the
> > discussions and write codes for it if I can be helpful.
> > 
> > Do you think this patch is okay for now?
> > 
> > 
> > Thanks for your comments,
> > Yongseok
  
Ananyev, Konstantin April 11, 2018, 11:39 a.m. UTC | #7
Hi Yongseok,

> > >
> > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > Hi Yongseok,
> > > >
> > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > rte_pktmbuf_attach_at().
> > > > > > >
> > > > > > > Possible use-cases could be:
> > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > > >
> > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > >
> > > > > > I think the current API is already able to do what you want.
> > > > > >
> > > > > > 1/ Here is a mbuf m with its data
> > > > > >
> > > > > >                off
> > > > > >                <-->
> > > > > >                       len
> > > > > >           +----+   <---------->
> > > > > >           |    |
> > > > > >         +-|----v----------------------+
> > > > > >         | |    -----------------------|
> > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > >         |      -----------------------|
> > > > > >         +-----------------------------+
> > > > > >
> > > > > >
> > > > > > 2/ clone m:
> > > > > >
> > > > > >   c = rte_pktmbuf_alloc(pool);
> > > > > >   rte_pktmbuf_attach(c, m);
> > > > > >
> > > > > >   Note that c has its own offset and length fields.
> > > > > >
> > > > > >
> > > > > >                off
> > > > > >                <-->
> > > > > >                       len
> > > > > >           +----+   <---------->
> > > > > >           |    |
> > > > > >         +-|----v----------------------+
> > > > > >         | |    -----------------------|
> > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > >         |      -----------------------|
> > > > > >         +------^----------------------+
> > > > > >                |
> > > > > >           +----+
> > > > > > indirect  |
> > > > > >         +-|---------------------------+
> > > > > >         | |    -----------------------|
> > > > > > c       | buf  |                     ||
> > > > > >         |      -----------------------|
> > > > > >         +-----------------------------+
> > > > > >
> > > > > >                 off    len
> > > > > >                 <--><---------->
> > > > > >
> > > > > >
> > > > > > 3/ remove some data from c without changing m
> > > > > >
> > > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > > >
> > > > > >
> > > > > > Please let me know if it fits your needs.
> > > > >
> > > > > No, it doesn't.
> > > > >
> > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > difference offsets in m,
> > > > >
> > > > > rte_pktmbuf_adj(c1, 10);
> > > > > rte_pktmbuf_adj(c2, 20);
> > > > >
> > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > header should be linked by h1->next = c2.
> > > >
> > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > length) that will:
> > > >   - alloc and attach indirect mbuf for each segment of m that is
> > > >     in the range [offset : length+offset].
> > > >   - prepend an empty and writable mbuf for the headers
> > > >
> > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > which actually shrink the headroom, this case can be properly handled.
> > > >
> > > > What do you mean by properly handled?
> > > >
> > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > won't be protected.
> > > >
> > > > From an application point of view, indirect mbufs, or direct mbufs that
> > > > have refcnt != 1, should be both considered as read-only because they
> > > > may share their data. How an application can know if the data is shared
> > > > or not?
> > > >
> > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > understanding is correct, you want to have indirect mbufs with RW data.
> > >
> > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > enough to handle that use-case.
> > >
> > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > indirect referencing.
> >
> > But just to make HW to RX multiple packets into one mbuf,
> > data_off inside indirect mbuf should be enough, correct?
> Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> to saturate the network link.

There were few complains that 64KB max is a limitation for some use-cases.
I am not against increasing it, but I don't think we have free space on first cache-line for that
without another big rework of mbuf layout. 
Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too. 

> 
> > As I understand, what you'd like to achieve with this new field -
> > ability to manipulate packet boundaries after RX, probably at upper layer.
> > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > indirect mbufs trying to modify same direct buffer.
> 
> I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> is read-only. What that means, all the entities which own such mbufs have to be
> aware of that and keep the principle as DPDK can't enforce the rule and there
> can't be such sanity check. In this sense, HW doesn't violate it because the
> direct mbuf is injected to HW before indirection. When packets are written by
> HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> application layer with freeing the original direct mbuf (decrement refcnt by 1).
> So, HW doesn't touch the direct buffer once it reaches to upper layer.

Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
which implies that it should be possible by some entity (upper layer?) to manipulate these
indirect mbufs.
And we don't know how exactly it will be done.

> The direct buffer will be freed and get available for reuse when all the attached
> indirect mbufs are freed.
> 
> > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > Fields for indirect mbufs, straight after attach()?
> 
> Good point.
> Actually that was my draft (Mellanox internal) version of this patch :-) But I
> had to consider a case where priv_size is really given by user. Even though it
> is less likely, but if original priv_size is quite big, it can't cover entire
> buf_len. For this, I had to increase priv_size to 32-bit but adding another
> 16bit field (buf_off) looked more plausible.

As I remember, we can't have mbufs bigger then 64K,
so priv_size + buf_len should be always less than 64K, correct?
Konstantin  

> 
> Thanks for good comments,
> Yongseok
> 
> > > > >
> > > > > Does this make sense?
> > > >
> > > > I understand the need.
> > > >
> > > > Another option would be to make the mbuf->buffer point to an external
> > > > buffer (not inside the direct mbuf). This would require to add a
> > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > > a quick overview.
> > > >
> > > > [1]
> > >
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > > -Session05-OlivierMatz-
> > >
> Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > > >
> > > > The advantage is that it does not require the large data to be inside a
> > > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > > allocated from a mempool). On the other hand, it is maybe more complex
> > > > to implement compared to your solution.
> > >
> > > I knew that you presented the slides and frankly, I had considered that option
> > > at first. But even with that option, metadata to store refcnt should also be
> > > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > > end of the data segment. Even though it could have smaller metadata structure,
> > > I just wanted to make full use of the existing framework because it is less
> > > complex as you mentioned. Given that you presented the idea of external data
> > > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > > simpler.  I personally think that we can take the idea of external data seg when
> > > more demands come from users in the future as it would be a huge change and may
> > > break current ABI/API. When the day comes, I'll gladly participate in the
> > > discussions and write codes for it if I can be helpful.
> > >
> > > Do you think this patch is okay for now?
> > >
> > >
> > > Thanks for your comments,
> > > Yongseok
  
Andrew Rybchenko April 11, 2018, 2:02 p.m. UTC | #8
On 04/11/2018 02:39 PM, Ananyev, Konstantin wrote:
> Hi Yongseok,
>
>>>> On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
>>>>> Hi Yongseok,
>>>>>
>>>>> On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
>>>>>> On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
>>>>>>>> When attaching a mbuf, indirect mbuf has to point to start of buffer of
>>>>>>>> direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
>>>>>>>> flexible. Indirect mbuf can point to any part of direct mbuf by calling
>>>>>>>> rte_pktmbuf_attach_at().
>>>>>>>>
>>>>>>>> Possible use-cases could be:
>>>>>>>> - If a packet has multiple layers of encapsulation, multiple indirect
>>>>>>>>    buffers can reference different layers of the encapsulated packet.
>>>>>>>> - A large direct mbuf can even contain multiple packets in series and
>>>>>>>>    each packet can be referenced by multiple mbuf indirections.
>>>>>>>>
>>>>>>>> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
>>>>>>> I think the current API is already able to do what you want.
>>>>>>>
>>>>>>> 1/ Here is a mbuf m with its data
>>>>>>>
>>>>>>>                 off
>>>>>>>                 <-->
>>>>>>>                        len
>>>>>>>            +----+   <---------->
>>>>>>>            |    |
>>>>>>>          +-|----v----------------------+
>>>>>>>          | |    -----------------------|
>>>>>>> m       | buf  |    XXXXXXXXXXX      ||
>>>>>>>          |      -----------------------|
>>>>>>>          +-----------------------------+
>>>>>>>
>>>>>>>
>>>>>>> 2/ clone m:
>>>>>>>
>>>>>>>    c = rte_pktmbuf_alloc(pool);
>>>>>>>    rte_pktmbuf_attach(c, m);
>>>>>>>
>>>>>>>    Note that c has its own offset and length fields.
>>>>>>>
>>>>>>>
>>>>>>>                 off
>>>>>>>                 <-->
>>>>>>>                        len
>>>>>>>            +----+   <---------->
>>>>>>>            |    |
>>>>>>>          +-|----v----------------------+
>>>>>>>          | |    -----------------------|
>>>>>>> m       | buf  |    XXXXXXXXXXX      ||
>>>>>>>          |      -----------------------|
>>>>>>>          +------^----------------------+
>>>>>>>                 |
>>>>>>>            +----+
>>>>>>> indirect  |
>>>>>>>          +-|---------------------------+
>>>>>>>          | |    -----------------------|
>>>>>>> c       | buf  |                     ||
>>>>>>>          |      -----------------------|
>>>>>>>          +-----------------------------+
>>>>>>>
>>>>>>>                  off    len
>>>>>>>                  <--><---------->
>>>>>>>
>>>>>>>
>>>>>>> 3/ remove some data from c without changing m
>>>>>>>
>>>>>>>     rte_pktmbuf_adj(c, 10)   // at head
>>>>>>>     rte_pktmbuf_trim(c, 10)  // at tail
>>>>>>>
>>>>>>>
>>>>>>> Please let me know if it fits your needs.
>>>>>> No, it doesn't.
>>>>>>
>>>>>> Trimming head and tail with the current APIs removes data and make the space
>>>>>> available. Adjusting packet head means giving more headroom, not shifting the
>>>>>> buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
>>>>>> difference offsets in m,
>>>>>>
>>>>>> rte_pktmbuf_adj(c1, 10);
>>>>>> rte_pktmbuf_adj(c2, 20);
>>>>>>
>>>>>> then the owner of c2 regard the first (off+20)B as available headroom. If it
>>>>>> wants to attach outer header, it will overwrite the headroom even though the
>>>>>> owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
>>>>>> header should be linked by h1->next = c2.
>>>>> Yes, after these operations c1, c2 and m should become read-only. So, to
>>>>> prepend headers, another mbuf has to be inserted before as you suggest. It
>>>>> is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
>>>>> length) that will:
>>>>>    - alloc and attach indirect mbuf for each segment of m that is
>>>>>      in the range [offset : length+offset].
>>>>>    - prepend an empty and writable mbuf for the headers
>>>>>
>>>>>> If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
>>>>>> which actually shrink the headroom, this case can be properly handled.
>>>>> What do you mean by properly handled?
>>>>>
>>>>> Yes, prepending data or adding data in the indirect mbuf won't override
>>>>> the direct mbuf. But prepending data or adding data in the direct mbuf m
>>>>> won't be protected.
>>>>>
>>>>>  From an application point of view, indirect mbufs, or direct mbufs that
>>>>> have refcnt != 1, should be both considered as read-only because they
>>>>> may share their data. How an application can know if the data is shared
>>>>> or not?
>>>>>
>>>>> Maybe we need a flag to differentiate mbufs that are read-only
>>>>> (something like SHARED_DATA, or simply READONLY). In your case, if my
>>>>> understanding is correct, you want to have indirect mbufs with RW data.
>>>> Agree that indirect mbuf must be treated as read-only, Then the current code is
>>>> enough to handle that use-case.
>>>>
>>>>>> And another use-case (this is my actual use-case) is to make a large mbuf have
>>>>>> multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
>>>>>> because it transfers multiple packets to a single large buffer to reduce PCIe
>>>>>> overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
>>>>>> Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
>>>>>> indirect referencing.
>>> But just to make HW to RX multiple packets into one mbuf,
>>> data_off inside indirect mbuf should be enough, correct?
>> Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
>> to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
>> a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
>> better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
>> to saturate the network link.
> There were few complains that 64KB max is a limitation for some use-cases.
> I am not against increasing it, but I don't think we have free space on first cache-line for that
> without another big rework of mbuf layout.
> Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too.
>
>>> As I understand, what you'd like to achieve with this new field -
>>> ability to manipulate packet boundaries after RX, probably at upper layer.
>>> As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
>>> indirect mbufs trying to modify same direct buffer.
>> I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
>> is read-only. What that means, all the entities which own such mbufs have to be
>> aware of that and keep the principle as DPDK can't enforce the rule and there
>> can't be such sanity check. In this sense, HW doesn't violate it because the
>> direct mbuf is injected to HW before indirection. When packets are written by
>> HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
>> application layer with freeing the original direct mbuf (decrement refcnt by 1).
>> So, HW doesn't touch the direct buffer once it reaches to upper layer.
> Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> which implies that it should be possible by some entity (upper layer?) to manipulate these
> indirect mbufs.
> And we don't know how exactly it will be done.
>
>> The direct buffer will be freed and get available for reuse when all the attached
>> indirect mbufs are freed.
>>
>>> Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
>>> Fields for indirect mbufs, straight after attach()?
>> Good point.
>> Actually that was my draft (Mellanox internal) version of this patch :-) But I
>> had to consider a case where priv_size is really given by user. Even though it
>> is less likely, but if original priv_size is quite big, it can't cover entire
>> buf_len. For this, I had to increase priv_size to 32-bit but adding another
>> 16bit field (buf_off) looked more plausible.
> As I remember, we can't have mbufs bigger then 64K,
> so priv_size + buf_len should be always less than 64K, correct?

It sounds like it is suggested to use/customize priv_size to limit 
indirect mbuf
range in the direct one. It does not work from the box since priv_size is
used to find out direct mbuf by indirect (see rte_mbuf_from_indirect()).

Andrew.
  
Yongseok Koh April 11, 2018, 5:08 p.m. UTC | #9
On Wed, Apr 11, 2018 at 11:39:47AM +0000, Ananyev, Konstantin wrote:
> 
> Hi Yongseok,
> 
> > > >
> > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > > Hi Yongseok,
> > > > >
> > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > > rte_pktmbuf_attach_at().
> > > > > > > >
> > > > > > > > Possible use-cases could be:
> > > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > > > >
> > > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > > >
> > > > > > > I think the current API is already able to do what you want.
> > > > > > >
> > > > > > > 1/ Here is a mbuf m with its data
> > > > > > >
> > > > > > >                off
> > > > > > >                <-->
> > > > > > >                       len
> > > > > > >           +----+   <---------->
> > > > > > >           |    |
> > > > > > >         +-|----v----------------------+
> > > > > > >         | |    -----------------------|
> > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > >         |      -----------------------|
> > > > > > >         +-----------------------------+
> > > > > > >
> > > > > > >
> > > > > > > 2/ clone m:
> > > > > > >
> > > > > > >   c = rte_pktmbuf_alloc(pool);
> > > > > > >   rte_pktmbuf_attach(c, m);
> > > > > > >
> > > > > > >   Note that c has its own offset and length fields.
> > > > > > >
> > > > > > >
> > > > > > >                off
> > > > > > >                <-->
> > > > > > >                       len
> > > > > > >           +----+   <---------->
> > > > > > >           |    |
> > > > > > >         +-|----v----------------------+
> > > > > > >         | |    -----------------------|
> > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > >         |      -----------------------|
> > > > > > >         +------^----------------------+
> > > > > > >                |
> > > > > > >           +----+
> > > > > > > indirect  |
> > > > > > >         +-|---------------------------+
> > > > > > >         | |    -----------------------|
> > > > > > > c       | buf  |                     ||
> > > > > > >         |      -----------------------|
> > > > > > >         +-----------------------------+
> > > > > > >
> > > > > > >                 off    len
> > > > > > >                 <--><---------->
> > > > > > >
> > > > > > >
> > > > > > > 3/ remove some data from c without changing m
> > > > > > >
> > > > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > > > >
> > > > > > >
> > > > > > > Please let me know if it fits your needs.
> > > > > >
> > > > > > No, it doesn't.
> > > > > >
> > > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > > difference offsets in m,
> > > > > >
> > > > > > rte_pktmbuf_adj(c1, 10);
> > > > > > rte_pktmbuf_adj(c2, 20);
> > > > > >
> > > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > > header should be linked by h1->next = c2.
> > > > >
> > > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > > length) that will:
> > > > >   - alloc and attach indirect mbuf for each segment of m that is
> > > > >     in the range [offset : length+offset].
> > > > >   - prepend an empty and writable mbuf for the headers
> > > > >
> > > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > > which actually shrink the headroom, this case can be properly handled.
> > > > >
> > > > > What do you mean by properly handled?
> > > > >
> > > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > > won't be protected.
> > > > >
> > > > > From an application point of view, indirect mbufs, or direct mbufs that
> > > > > have refcnt != 1, should be both considered as read-only because they
> > > > > may share their data. How an application can know if the data is shared
> > > > > or not?
> > > > >
> > > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > > understanding is correct, you want to have indirect mbufs with RW data.
> > > >
> > > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > > enough to handle that use-case.
> > > >
> > > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > > indirect referencing.
> > >
> > > But just to make HW to RX multiple packets into one mbuf,
> > > data_off inside indirect mbuf should be enough, correct?
> > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> > a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> > to saturate the network link.
> 
> There were few complains that 64KB max is a limitation for some use-cases.
> I am not against increasing it, but I don't think we have free space on first cache-line for that
> without another big rework of mbuf layout. 
> Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too. 
> 
> > 
> > > As I understand, what you'd like to achieve with this new field -
> > > ability to manipulate packet boundaries after RX, probably at upper layer.
> > > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > > indirect mbufs trying to modify same direct buffer.
> > 
> > I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> > is read-only. What that means, all the entities which own such mbufs have to be
> > aware of that and keep the principle as DPDK can't enforce the rule and there
> > can't be such sanity check. In this sense, HW doesn't violate it because the
> > direct mbuf is injected to HW before indirection. When packets are written by
> > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> > application layer with freeing the original direct mbuf (decrement refcnt by 1).
> > So, HW doesn't touch the direct buffer once it reaches to upper layer.
> 
> Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> which implies that it should be possible by some entity (upper layer?) to manipulate these
> indirect mbufs.
> And we don't know how exactly it will be done.

That's a valid concern. I can make it private by merging into the _attach_to()
func, or I just can add a comment in the API doc. However, if users are aware
that a mbuf is read-only and we expect them to keep it intact by their own
judgement, they would/should not use those APIs. We can't stop them modifying
content or the buffer itself anyway. Will add more comments of this discussion
regarding read-only mode.

> > The direct buffer will be freed and get available for reuse when all the attached
> > indirect mbufs are freed.
> > 
> > > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > > Fields for indirect mbufs, straight after attach()?
> > 
> > Good point.
> > Actually that was my draft (Mellanox internal) version of this patch :-) But I
> > had to consider a case where priv_size is really given by user. Even though it
> > is less likely, but if original priv_size is quite big, it can't cover entire
> > buf_len. For this, I had to increase priv_size to 32-bit but adding another
> > 16bit field (buf_off) looked more plausible.
> 
> As I remember, we can't have mbufs bigger then 64K,
> so priv_size + buf_len should be always less than 64K, correct?

Can you let me know where I can find the constraint? I checked
rte_pktmbuf_pool_create() and rte_pktmbuf_init() again to not make any mistake
but there's no such limitation.

	elt_size = sizeof(struct rte_mbuf) + (unsigned)priv_size +
		(unsigned)data_room_size;

The max of data_room_size is 64kB, so is priv_size. m->buf_addr starts from 'm +
sizeof(*m) + priv_size' and m->buf_len can't be larger than UINT16_MAX. So,
priv_size couldn't be used for this purpose.

Yongseok

> > > > > >
> > > > > > Does this make sense?
> > > > >
> > > > > I understand the need.
> > > > >
> > > > > Another option would be to make the mbuf->buffer point to an external
> > > > > buffer (not inside the direct mbuf). This would require to add a
> > > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > > > a quick overview.
> > > > >
> > > > > [1]
> > > >
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > > > -Session05-OlivierMatz-
> > > >
> > Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > > > >
> > > > > The advantage is that it does not require the large data to be inside a
> > > > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > > > allocated from a mempool). On the other hand, it is maybe more complex
> > > > > to implement compared to your solution.
> > > >
> > > > I knew that you presented the slides and frankly, I had considered that option
> > > > at first. But even with that option, metadata to store refcnt should also be
> > > > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > > > end of the data segment. Even though it could have smaller metadata structure,
> > > > I just wanted to make full use of the existing framework because it is less
> > > > complex as you mentioned. Given that you presented the idea of external data
> > > > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > > > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > > > simpler.  I personally think that we can take the idea of external data seg when
> > > > more demands come from users in the future as it would be a huge change and may
> > > > break current ABI/API. When the day comes, I'll gladly participate in the
> > > > discussions and write codes for it if I can be helpful.
> > > >
> > > > Do you think this patch is okay for now?
> > > >
> > > >
> > > > Thanks for your comments,
> > > > Yongseok
  
Yongseok Koh April 11, 2018, 5:18 p.m. UTC | #10
On Wed, Apr 11, 2018 at 05:02:50PM +0300, Andrew Rybchenko wrote:
> On 04/11/2018 02:39 PM, Ananyev, Konstantin wrote:
> > Hi Yongseok,
> > 
> > > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > > > Hi Yongseok,
> > > > > > 
> > > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > > > rte_pktmbuf_attach_at().
> > > > > > > > > 
> > > > > > > > > Possible use-cases could be:
> > > > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > > > >    buffers can reference different layers of the encapsulated packet.
> > > > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > > > >    each packet can be referenced by multiple mbuf indirections.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > > > > I think the current API is already able to do what you want.
> > > > > > > > 
> > > > > > > > 1/ Here is a mbuf m with its data
> > > > > > > > 
> > > > > > > >                 off
> > > > > > > >                 <-->
> > > > > > > >                        len
> > > > > > > >            +----+   <---------->
> > > > > > > >            |    |
> > > > > > > >          +-|----v----------------------+
> > > > > > > >          | |    -----------------------|
> > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > >          |      -----------------------|
> > > > > > > >          +-----------------------------+
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 2/ clone m:
> > > > > > > > 
> > > > > > > >    c = rte_pktmbuf_alloc(pool);
> > > > > > > >    rte_pktmbuf_attach(c, m);
> > > > > > > > 
> > > > > > > >    Note that c has its own offset and length fields.
> > > > > > > > 
> > > > > > > > 
> > > > > > > >                 off
> > > > > > > >                 <-->
> > > > > > > >                        len
> > > > > > > >            +----+   <---------->
> > > > > > > >            |    |
> > > > > > > >          +-|----v----------------------+
> > > > > > > >          | |    -----------------------|
> > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > >          |      -----------------------|
> > > > > > > >          +------^----------------------+
> > > > > > > >                 |
> > > > > > > >            +----+
> > > > > > > > indirect  |
> > > > > > > >          +-|---------------------------+
> > > > > > > >          | |    -----------------------|
> > > > > > > > c       | buf  |                     ||
> > > > > > > >          |      -----------------------|
> > > > > > > >          +-----------------------------+
> > > > > > > > 
> > > > > > > >                  off    len
> > > > > > > >                  <--><---------->
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 3/ remove some data from c without changing m
> > > > > > > > 
> > > > > > > >     rte_pktmbuf_adj(c, 10)   // at head
> > > > > > > >     rte_pktmbuf_trim(c, 10)  // at tail
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Please let me know if it fits your needs.
> > > > > > > No, it doesn't.
> > > > > > > 
> > > > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > > > difference offsets in m,
> > > > > > > 
> > > > > > > rte_pktmbuf_adj(c1, 10);
> > > > > > > rte_pktmbuf_adj(c2, 20);
> > > > > > > 
> > > > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > > > header should be linked by h1->next = c2.
> > > > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > > > length) that will:
> > > > > >    - alloc and attach indirect mbuf for each segment of m that is
> > > > > >      in the range [offset : length+offset].
> > > > > >    - prepend an empty and writable mbuf for the headers
> > > > > > 
> > > > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > > > which actually shrink the headroom, this case can be properly handled.
> > > > > > What do you mean by properly handled?
> > > > > > 
> > > > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > > > won't be protected.
> > > > > > 
> > > > > >  From an application point of view, indirect mbufs, or direct mbufs that
> > > > > > have refcnt != 1, should be both considered as read-only because they
> > > > > > may share their data. How an application can know if the data is shared
> > > > > > or not?
> > > > > > 
> > > > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > > > understanding is correct, you want to have indirect mbufs with RW data.
> > > > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > > > enough to handle that use-case.
> > > > > 
> > > > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > > > indirect referencing.
> > > > But just to make HW to RX multiple packets into one mbuf,
> > > > data_off inside indirect mbuf should be enough, correct?
> > > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> > > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> > > a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> > > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> > > to saturate the network link.
> > There were few complains that 64KB max is a limitation for some use-cases.
> > I am not against increasing it, but I don't think we have free space on first cache-line for that
> > without another big rework of mbuf layout.
> > Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too.
> > 
> > > > As I understand, what you'd like to achieve with this new field -
> > > > ability to manipulate packet boundaries after RX, probably at upper layer.
> > > > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > > > indirect mbufs trying to modify same direct buffer.
> > > I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> > > is read-only. What that means, all the entities which own such mbufs have to be
> > > aware of that and keep the principle as DPDK can't enforce the rule and there
> > > can't be such sanity check. In this sense, HW doesn't violate it because the
> > > direct mbuf is injected to HW before indirection. When packets are written by
> > > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> > > application layer with freeing the original direct mbuf (decrement refcnt by 1).
> > > So, HW doesn't touch the direct buffer once it reaches to upper layer.
> > Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> > which implies that it should be possible by some entity (upper layer?) to manipulate these
> > indirect mbufs.
> > And we don't know how exactly it will be done.
> > 
> > > The direct buffer will be freed and get available for reuse when all the attached
> > > indirect mbufs are freed.
> > > 
> > > > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > > > Fields for indirect mbufs, straight after attach()?
> > > Good point.
> > > Actually that was my draft (Mellanox internal) version of this patch :-) But I
> > > had to consider a case where priv_size is really given by user. Even though it
> > > is less likely, but if original priv_size is quite big, it can't cover entire
> > > buf_len. For this, I had to increase priv_size to 32-bit but adding another
> > > 16bit field (buf_off) looked more plausible.
> > As I remember, we can't have mbufs bigger then 64K,
> > so priv_size + buf_len should be always less than 64K, correct?
> 
> It sounds like it is suggested to use/customize priv_size to limit indirect
> mbuf range in the direct one. It does not work from the box since priv_size is
> used to find out direct mbuf by indirect (see rte_mbuf_from_indirect()).

?? That's exactly why he suggested to use priv_size...

Thanks,
Yongseok
  
Ananyev, Konstantin April 12, 2018, 4:34 p.m. UTC | #11
> >
> > > > >
> > > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > > > Hi Yongseok,
> > > > > >
> > > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > > > rte_pktmbuf_attach_at().
> > > > > > > > >
> > > > > > > > > Possible use-cases could be:
> > > > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > > > >
> > > > > > > > I think the current API is already able to do what you want.
> > > > > > > >
> > > > > > > > 1/ Here is a mbuf m with its data
> > > > > > > >
> > > > > > > >                off
> > > > > > > >                <-->
> > > > > > > >                       len
> > > > > > > >           +----+   <---------->
> > > > > > > >           |    |
> > > > > > > >         +-|----v----------------------+
> > > > > > > >         | |    -----------------------|
> > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > >         |      -----------------------|
> > > > > > > >         +-----------------------------+
> > > > > > > >
> > > > > > > >
> > > > > > > > 2/ clone m:
> > > > > > > >
> > > > > > > >   c = rte_pktmbuf_alloc(pool);
> > > > > > > >   rte_pktmbuf_attach(c, m);
> > > > > > > >
> > > > > > > >   Note that c has its own offset and length fields.
> > > > > > > >
> > > > > > > >
> > > > > > > >                off
> > > > > > > >                <-->
> > > > > > > >                       len
> > > > > > > >           +----+   <---------->
> > > > > > > >           |    |
> > > > > > > >         +-|----v----------------------+
> > > > > > > >         | |    -----------------------|
> > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > >         |      -----------------------|
> > > > > > > >         +------^----------------------+
> > > > > > > >                |
> > > > > > > >           +----+
> > > > > > > > indirect  |
> > > > > > > >         +-|---------------------------+
> > > > > > > >         | |    -----------------------|
> > > > > > > > c       | buf  |                     ||
> > > > > > > >         |      -----------------------|
> > > > > > > >         +-----------------------------+
> > > > > > > >
> > > > > > > >                 off    len
> > > > > > > >                 <--><---------->
> > > > > > > >
> > > > > > > >
> > > > > > > > 3/ remove some data from c without changing m
> > > > > > > >
> > > > > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > > > > >
> > > > > > > >
> > > > > > > > Please let me know if it fits your needs.
> > > > > > >
> > > > > > > No, it doesn't.
> > > > > > >
> > > > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > > > difference offsets in m,
> > > > > > >
> > > > > > > rte_pktmbuf_adj(c1, 10);
> > > > > > > rte_pktmbuf_adj(c2, 20);
> > > > > > >
> > > > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > > > header should be linked by h1->next = c2.
> > > > > >
> > > > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > > > length) that will:
> > > > > >   - alloc and attach indirect mbuf for each segment of m that is
> > > > > >     in the range [offset : length+offset].
> > > > > >   - prepend an empty and writable mbuf for the headers
> > > > > >
> > > > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > > > which actually shrink the headroom, this case can be properly handled.
> > > > > >
> > > > > > What do you mean by properly handled?
> > > > > >
> > > > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > > > won't be protected.
> > > > > >
> > > > > > From an application point of view, indirect mbufs, or direct mbufs that
> > > > > > have refcnt != 1, should be both considered as read-only because they
> > > > > > may share their data. How an application can know if the data is shared
> > > > > > or not?
> > > > > >
> > > > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > > > understanding is correct, you want to have indirect mbufs with RW data.
> > > > >
> > > > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > > > enough to handle that use-case.
> > > > >
> > > > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > > > indirect referencing.
> > > >
> > > > But just to make HW to RX multiple packets into one mbuf,
> > > > data_off inside indirect mbuf should be enough, correct?
> > > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> > > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> > > a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> > > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> > > to saturate the network link.
> >
> > There were few complains that 64KB max is a limitation for some use-cases.
> > I am not against increasing it, but I don't think we have free space on first cache-line for that
> > without another big rework of mbuf layout.
> > Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too.
> >
> > >
> > > > As I understand, what you'd like to achieve with this new field -
> > > > ability to manipulate packet boundaries after RX, probably at upper layer.
> > > > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > > > indirect mbufs trying to modify same direct buffer.
> > >
> > > I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> > > is read-only. What that means, all the entities which own such mbufs have to be
> > > aware of that and keep the principle as DPDK can't enforce the rule and there
> > > can't be such sanity check. In this sense, HW doesn't violate it because the
> > > direct mbuf is injected to HW before indirection. When packets are written by
> > > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> > > application layer with freeing the original direct mbuf (decrement refcnt by 1).
> > > So, HW doesn't touch the direct buffer once it reaches to upper layer.
> >
> > Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> > which implies that it should be possible by some entity (upper layer?) to manipulate these
> > indirect mbufs.
> > And we don't know how exactly it will be done.
> 
> That's a valid concern. I can make it private by merging into the _attach_to()
> func, or I just can add a comment in the API doc. However, if users are aware
> that a mbuf is read-only and we expect them to keep it intact by their own
> judgement, they would/should not use those APIs. We can't stop them modifying
> content or the buffer itself anyway. Will add more comments of this discussion
> regarding read-only mode.

Ok, so these functions are intended to be used only by PMD level?
But in that case do you need them at all?
Isn't it possible implement same thing with just data_off?
I mean your PMD knows in advance what is the buf_len of mbuf and at startup
time it can decide it going to slice it how to slice it into multiple packets.
So each offset is known in advance and you don't need to worry that you'll overwrite
neighbor packet's data. 

> 
> > > The direct buffer will be freed and get available for reuse when all the attached
> > > indirect mbufs are freed.
> > >
> > > > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > > > Fields for indirect mbufs, straight after attach()?
> > >
> > > Good point.
> > > Actually that was my draft (Mellanox internal) version of this patch :-) But I
> > > had to consider a case where priv_size is really given by user. Even though it
> > > is less likely, but if original priv_size is quite big, it can't cover entire
> > > buf_len. For this, I had to increase priv_size to 32-bit but adding another
> > > 16bit field (buf_off) looked more plausible.
> >
> > As I remember, we can't have mbufs bigger then 64K,
> > so priv_size + buf_len should be always less than 64K, correct?
> 
> Can you let me know where I can find the constraint? I checked
> rte_pktmbuf_pool_create() and rte_pktmbuf_init() again to not make any mistake
> but there's no such limitation.
> 
> 	elt_size = sizeof(struct rte_mbuf) + (unsigned)priv_size +
> 		(unsigned)data_room_size;


Ok I scanned through librte_mbuf and didn't find any limitations.
Seems like a false impression from my side.
Anyway that seems like a corner case to have priv_szie + buf_len >64KB.
Do you really need to support it?

Konstantin

> 
> The max of data_room_size is 64kB, so is priv_size. m->buf_addr starts from 'm +
> sizeof(*m) + priv_size' and m->buf_len can't be larger than UINT16_MAX. So,
> priv_size couldn't be used for this purpose.
> 
> Yongseok
> 
> > > > > > >
> > > > > > > Does this make sense?
> > > > > >
> > > > > > I understand the need.
> > > > > >
> > > > > > Another option would be to make the mbuf->buffer point to an external
> > > > > > buffer (not inside the direct mbuf). This would require to add a
> > > > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > > > > a quick overview.
> > > > > >
> > > > > > [1]
> > > > >
> > >
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > > > > -Session05-OlivierMatz-
> > > > >
> > >
> Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > > > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > > > > >
> > > > > > The advantage is that it does not require the large data to be inside a
> > > > > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > > > > allocated from a mempool). On the other hand, it is maybe more complex
> > > > > > to implement compared to your solution.
> > > > >
> > > > > I knew that you presented the slides and frankly, I had considered that option
> > > > > at first. But even with that option, metadata to store refcnt should also be
> > > > > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > > > > end of the data segment. Even though it could have smaller metadata structure,
> > > > > I just wanted to make full use of the existing framework because it is less
> > > > > complex as you mentioned. Given that you presented the idea of external data
> > > > > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > > > > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > > > > simpler.  I personally think that we can take the idea of external data seg when
> > > > > more demands come from users in the future as it would be a huge change and may
> > > > > break current ABI/API. When the day comes, I'll gladly participate in the
> > > > > discussions and write codes for it if I can be helpful.
> > > > >
> > > > > Do you think this patch is okay for now?
> > > > >
> > > > >
> > > > > Thanks for your comments,
> > > > > Yongseok
  
Yongseok Koh April 12, 2018, 6:58 p.m. UTC | #12
On Thu, Apr 12, 2018 at 04:34:56PM +0000, Ananyev, Konstantin wrote:
> > >
> > > > > >
> > > > > > On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:
> > > > > > > Hi Yongseok,
> > > > > > >
> > > > > > > On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:
> > > > > > > > On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:
> > > > > > > > > > When attaching a mbuf, indirect mbuf has to point to start of buffer of
> > > > > > > > > > direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
> > > > > > > > > > flexible. Indirect mbuf can point to any part of direct mbuf by calling
> > > > > > > > > > rte_pktmbuf_attach_at().
> > > > > > > > > >
> > > > > > > > > > Possible use-cases could be:
> > > > > > > > > > - If a packet has multiple layers of encapsulation, multiple indirect
> > > > > > > > > >   buffers can reference different layers of the encapsulated packet.
> > > > > > > > > > - A large direct mbuf can even contain multiple packets in series and
> > > > > > > > > >   each packet can be referenced by multiple mbuf indirections.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > > > > > > > >
> > > > > > > > > I think the current API is already able to do what you want.
> > > > > > > > >
> > > > > > > > > 1/ Here is a mbuf m with its data
> > > > > > > > >
> > > > > > > > >                off
> > > > > > > > >                <-->
> > > > > > > > >                       len
> > > > > > > > >           +----+   <---------->
> > > > > > > > >           |    |
> > > > > > > > >         +-|----v----------------------+
> > > > > > > > >         | |    -----------------------|
> > > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > > >         |      -----------------------|
> > > > > > > > >         +-----------------------------+
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2/ clone m:
> > > > > > > > >
> > > > > > > > >   c = rte_pktmbuf_alloc(pool);
> > > > > > > > >   rte_pktmbuf_attach(c, m);
> > > > > > > > >
> > > > > > > > >   Note that c has its own offset and length fields.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >                off
> > > > > > > > >                <-->
> > > > > > > > >                       len
> > > > > > > > >           +----+   <---------->
> > > > > > > > >           |    |
> > > > > > > > >         +-|----v----------------------+
> > > > > > > > >         | |    -----------------------|
> > > > > > > > > m       | buf  |    XXXXXXXXXXX      ||
> > > > > > > > >         |      -----------------------|
> > > > > > > > >         +------^----------------------+
> > > > > > > > >                |
> > > > > > > > >           +----+
> > > > > > > > > indirect  |
> > > > > > > > >         +-|---------------------------+
> > > > > > > > >         | |    -----------------------|
> > > > > > > > > c       | buf  |                     ||
> > > > > > > > >         |      -----------------------|
> > > > > > > > >         +-----------------------------+
> > > > > > > > >
> > > > > > > > >                 off    len
> > > > > > > > >                 <--><---------->
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 3/ remove some data from c without changing m
> > > > > > > > >
> > > > > > > > >    rte_pktmbuf_adj(c, 10)   // at head
> > > > > > > > >    rte_pktmbuf_trim(c, 10)  // at tail
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Please let me know if it fits your needs.
> > > > > > > >
> > > > > > > > No, it doesn't.
> > > > > > > >
> > > > > > > > Trimming head and tail with the current APIs removes data and make the space
> > > > > > > > available. Adjusting packet head means giving more headroom, not shifting the
> > > > > > > > buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
> > > > > > > > difference offsets in m,
> > > > > > > >
> > > > > > > > rte_pktmbuf_adj(c1, 10);
> > > > > > > > rte_pktmbuf_adj(c2, 20);
> > > > > > > >
> > > > > > > > then the owner of c2 regard the first (off+20)B as available headroom. If it
> > > > > > > > wants to attach outer header, it will overwrite the headroom even though the
> > > > > > > > owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
> > > > > > > > header should be linked by h1->next = c2.
> > > > > > >
> > > > > > > Yes, after these operations c1, c2 and m should become read-only. So, to
> > > > > > > prepend headers, another mbuf has to be inserted before as you suggest. It
> > > > > > > is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
> > > > > > > length) that will:
> > > > > > >   - alloc and attach indirect mbuf for each segment of m that is
> > > > > > >     in the range [offset : length+offset].
> > > > > > >   - prepend an empty and writable mbuf for the headers
> > > > > > >
> > > > > > > > If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
> > > > > > > > which actually shrink the headroom, this case can be properly handled.
> > > > > > >
> > > > > > > What do you mean by properly handled?
> > > > > > >
> > > > > > > Yes, prepending data or adding data in the indirect mbuf won't override
> > > > > > > the direct mbuf. But prepending data or adding data in the direct mbuf m
> > > > > > > won't be protected.
> > > > > > >
> > > > > > > From an application point of view, indirect mbufs, or direct mbufs that
> > > > > > > have refcnt != 1, should be both considered as read-only because they
> > > > > > > may share their data. How an application can know if the data is shared
> > > > > > > or not?
> > > > > > >
> > > > > > > Maybe we need a flag to differentiate mbufs that are read-only
> > > > > > > (something like SHARED_DATA, or simply READONLY). In your case, if my
> > > > > > > understanding is correct, you want to have indirect mbufs with RW data.
> > > > > >
> > > > > > Agree that indirect mbuf must be treated as read-only, Then the current code is
> > > > > > enough to handle that use-case.
> > > > > >
> > > > > > > > And another use-case (this is my actual use-case) is to make a large mbuf have
> > > > > > > > multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
> > > > > > > > because it transfers multiple packets to a single large buffer to reduce PCIe
> > > > > > > > overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
> > > > > > > > Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
> > > > > > > > indirect referencing.
> > > > >
> > > > > But just to make HW to RX multiple packets into one mbuf,
> > > > > data_off inside indirect mbuf should be enough, correct?
> > > > Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
> > > > to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
> > > > a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
> > > > better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
> > > > to saturate the network link.
> > >
> > > There were few complains that 64KB max is a limitation for some use-cases.
> > > I am not against increasing it, but I don't think we have free space on first cache-line for that
> > > without another big rework of mbuf layout.
> > > Considering that we need to increase size for buf_len, data_off, data_len, and probably priv_size too.
> > >
> > > >
> > > > > As I understand, what you'd like to achieve with this new field -
> > > > > ability to manipulate packet boundaries after RX, probably at upper layer.
> > > > > As Olivier pointed above, that doesn't sound as safe approach - as you have multiple
> > > > > indirect mbufs trying to modify same direct buffer.
> > > >
> > > > I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
> > > > is read-only. What that means, all the entities which own such mbufs have to be
> > > > aware of that and keep the principle as DPDK can't enforce the rule and there
> > > > can't be such sanity check. In this sense, HW doesn't violate it because the
> > > > direct mbuf is injected to HW before indirection. When packets are written by
> > > > HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
> > > > application layer with freeing the original direct mbuf (decrement refcnt by 1).
> > > > So, HW doesn't touch the direct buffer once it reaches to upper layer.
> > >
> > > Yes, I understand that. But as I can see you introduced functions to adjust head and tail,
> > > which implies that it should be possible by some entity (upper layer?) to manipulate these
> > > indirect mbufs.
> > > And we don't know how exactly it will be done.
> > 
> > That's a valid concern. I can make it private by merging into the _attach_to()
> > func, or I just can add a comment in the API doc. However, if users are aware
> > that a mbuf is read-only and we expect them to keep it intact by their own
> > judgement, they would/should not use those APIs. We can't stop them modifying
> > content or the buffer itself anyway. Will add more comments of this discussion
> > regarding read-only mode.
> 
> Ok, so these functions are intended to be used only by PMD level?
> But in that case do you need them at all?
> Isn't it possible implement same thing with just data_off?
> I mean your PMD knows in advance what is the buf_len of mbuf and at startup
> time it can decide it going to slice it how to slice it into multiple packets.
> So each offset is known in advance and you don't need to worry that you'll overwrite
> neighbor packet's data. 

Since Olivier's last comment, I've been thinking about the approach all over
again. It looks like I'm trapped in self-contradiction. The reason why I didn't
want to use data_off was to provide valid headroom for each Rx packet and let
users freely write the headroom. But, given that indirect mbuf should be
considered read-only, this isn't a right approach. Instead of slicing a buffer
with mbuf indirection and manipulating boundaries, the idea of external data (as
Olivier suggested) would fit better. Even though it is more complex, it is
doable. I summarized ideas yesterday and will come up with a new patch soon.

Briefly, I think reserved bit 61 of ol_flags can be used to indicate externally
attached mbuf. The following is my initial thought.

#define EXT_ATTACHED_MBUF    (1ULL << 61)

struct rte_pktmbuf_ext_shared_info {
	refcnt;
	*free_cb();
	*opaque /* arg for free_cb() */
}

rte_pktmbuf_get_ext_shinfo() {
	/* Put shared info at the end of external buffer */
	return (struct rte_pktmbuf_ext_shared_info *)(m->buf_addr + m->buf_len);
}

rte_pktmbuf_attach_ext_buf(m, buf_addr, buf_len, free_cb, opaque) {
	struct rte_pktmbuf_ext_shared_info *shinfo;

	m->buf_addr = buf_addr;
	m->buf_iova = rte_mempool_virt2iova(buf_addr);
	/* Have to add some calculation for alignment */
	m->buf_len = buf_len - sizeof (*shinfo);
	shinfo = m->buf_addr + m->buf_len;
	...
	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
	m->ol_flags |= EXT_ATTACHED_MBUF;
	atomic set shinfo->refcnt = 1;

	shinfo->free_cb = free_cb;
	shinfo->opaque = opaque;

	...
}
rte_pktmbuf_detach_ext_buf(m)

#define RTE_MBUF_EXT(mb)   ((mb)->ol_flags & EXT_ATTACHED_MBUF)

In rte_pktmbuf_prefree_seg(),

		if (RTE_MBUF_INDIRECT(m))
			rte_pktmbuf_detach(m);
		else if (RTE_MBUF_EXT(m))
			rte_pktmbuf_detach_ext_buf(m);

And in rte_pktmbuf_attach(), if the mbuf attaching to is externally attached,
then just increase refcnt in shinfo so that multiple mbufs can refer to the same
external buffer.

Please feel free to share any concern/idea.

> > > > The direct buffer will be freed and get available for reuse when all the attached
> > > > indirect mbufs are freed.
> > > >
> > > > > Though if you really need to do that, why it can be achieved by updating buf_len and priv_size
> > > > > Fields for indirect mbufs, straight after attach()?
> > > >
> > > > Good point.
> > > > Actually that was my draft (Mellanox internal) version of this patch :-) But I
> > > > had to consider a case where priv_size is really given by user. Even though it
> > > > is less likely, but if original priv_size is quite big, it can't cover entire
> > > > buf_len. For this, I had to increase priv_size to 32-bit but adding another
> > > > 16bit field (buf_off) looked more plausible.
> > >
> > > As I remember, we can't have mbufs bigger then 64K,
> > > so priv_size + buf_len should be always less than 64K, correct?
> > 
> > Can you let me know where I can find the constraint? I checked
> > rte_pktmbuf_pool_create() and rte_pktmbuf_init() again to not make any mistake
> > but there's no such limitation.
> > 
> > 	elt_size = sizeof(struct rte_mbuf) + (unsigned)priv_size +
> > 		(unsigned)data_room_size;
> 
> 
> Ok I scanned through librte_mbuf and didn't find any limitations.
> Seems like a false impression from my side.
> Anyway that seems like a corner case to have priv_szie + buf_len >64KB.
> Do you really need to support it?

If a user must have 64kB buffer (it's valid, no violation) and the priv_size is
just a few bytes. Then, does library have to force the user to sacrifice a few
bytes for priv_size? Do you think it's a corner case? Still using priv_size
doesn't seem to be a good idea.

Yongseok

> > The max of data_room_size is 64kB, so is priv_size. m->buf_addr starts from 'm +
> > sizeof(*m) + priv_size' and m->buf_len can't be larger than UINT16_MAX. So,
> > priv_size couldn't be used for this purpose.
> > 
> > Yongseok
> > 
> > > > > > > >
> > > > > > > > Does this make sense?
> > > > > > >
> > > > > > > I understand the need.
> > > > > > >
> > > > > > > Another option would be to make the mbuf->buffer point to an external
> > > > > > > buffer (not inside the direct mbuf). This would require to add a
> > > > > > > mbuf->free_cb. See "Mbuf with external data buffer" (page 19) in [1] for
> > > > > > > a quick overview.
> > > > > > >
> > > > > > > [1]
> > > > > >
> > > >
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdpdksummit.com%2FArchive%2Fpdf%2F2016Userspace%2FDay01
> > > > > > -Session05-OlivierMatz-
> > > > > >
> > > >
> > Userspace2016.pdf&data=02%7C01%7Cyskoh%40mellanox.com%7Ca5405edb36e445e6540808d59e339a38%7Ca652971c7d2e4d9ba6a4d
> > > > > > 149256f461b%7C0%7C0%7C636588866861082855&sdata=llw%2BwiY5cC56naOUhBbIg8TKtfFN6VZcIRY5PV7VqZs%3D&reserved=0
> > > > > > >
> > > > > > > The advantage is that it does not require the large data to be inside a
> > > > > > > mbuf (requiring a mbuf structure before the buffer, and requiring to be
> > > > > > > allocated from a mempool). On the other hand, it is maybe more complex
> > > > > > > to implement compared to your solution.
> > > > > >
> > > > > > I knew that you presented the slides and frankly, I had considered that option
> > > > > > at first. But even with that option, metadata to store refcnt should also be
> > > > > > allocated and managed anyway. Kernel also maintains the skb_shared_info at the
> > > > > > end of the data segment. Even though it could have smaller metadata structure,
> > > > > > I just wanted to make full use of the existing framework because it is less
> > > > > > complex as you mentioned. Given that you presented the idea of external data
> > > > > > buffer in 2016 and there hasn't been many follow-up discussions/activities so
> > > > > > far, I thought the demand isn't so big yet thus I wanted to make this patch
> > > > > > simpler.  I personally think that we can take the idea of external data seg when
> > > > > > more demands come from users in the future as it would be a huge change and may
> > > > > > break current ABI/API. When the day comes, I'll gladly participate in the
> > > > > > discussions and write codes for it if I can be helpful.
> > > > > >
> > > > > > Do you think this patch is okay for now?
> > > > > >
> > > > > >
> > > > > > Thanks for your comments,
> > > > > > Yongseok
  

Patch

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 62740254d..053db32d0 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -559,6 +559,11 @@  struct rte_mbuf {
 		};
 	};
 
+	/** Buffer offset of direct mbuf if attached. Indirect mbuf can point to
+	 * any part of direct mbuf.
+	 */
+	uint16_t buf_off;
+
 	/** Size of the application private data. In case of an indirect
 	 * mbuf, it stores the direct mbuf private data size. */
 	uint16_t priv_size;
@@ -671,7 +676,9 @@  rte_mbuf_data_dma_addr_default(const struct rte_mbuf *mb)
 static inline struct rte_mbuf *
 rte_mbuf_from_indirect(struct rte_mbuf *mi)
 {
-	return (struct rte_mbuf *)RTE_PTR_SUB(mi->buf_addr, sizeof(*mi) + mi->priv_size);
+	return (struct rte_mbuf *)
+		RTE_PTR_SUB(mi->buf_addr,
+				sizeof(*mi) + mi->priv_size + mi->buf_off);
 }
 
 /**
@@ -1281,6 +1288,98 @@  static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Adjust tailroom of indirect mbuf. If offset is positive, enlarge the
+ * tailroom of the mbuf. If negative, shrink the tailroom.
+ *
+ * If length is out of range, then the function will fail and return -1,
+ * without modifying the indirect mbuf.
+ *
+ * @param mi
+ *   The indirect packet mbuf.
+ * @param len
+ *   The amount of length to adjust (in bytes).
+ * @return
+ *   - 0: On success.
+ *   - -1: On error.
+ */
+static inline int rte_pktmbuf_adj_indirect_tail(struct rte_mbuf *mi, int len)
+{
+	struct rte_mbuf *md;
+	uint16_t tailroom;
+	int delta;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(mi));
+
+	md = rte_mbuf_from_indirect(mi);
+	if (unlikely(mi->buf_len + len <= 0 ||
+			mi->buf_off + mi->buf_len + len >= md->buf_len))
+		return -1;
+
+	mi->buf_len += len;
+
+	tailroom = mi->buf_len - mi->data_off - mi->data_len;
+	delta = tailroom + len;
+	if (delta > 0) {
+		/* Adjust tailroom */
+		delta = 0;
+	} else if (delta + mi->data_len < 0) {
+		/* No data */
+		mi->data_off += delta + mi->data_len;
+		delta = mi->data_len;
+	}
+	mi->data_len += delta;
+	mi->pkt_len += delta;
+	return 0;
+}
+
+/**
+ * Shift buffer reference of indirect mbuf. If offset is positive, push
+ * the offset of the mbuf. If negative, pull the offset.
+ *
+ * Returns a pointer to the start address of the new data area. If offset
+ * is out of range, then the function will fail and return NULL, without
+ * modifying the indirect mbuf.
+ *
+ * @param mi
+ *   The indirect packet mbuf.
+ * @param off
+ *   The amount of offset to adjust (in bytes).
+ * @return
+ *   A pointer to the new start of the data.
+ */
+static inline char *rte_pktmbuf_adj_indirect_head(struct rte_mbuf *mi, int off)
+{
+	int delta;
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(mi));
+
+	if (unlikely(off >= mi->buf_len || mi->buf_off + off < 0))
+		return NULL;
+
+	mi->buf_iova += off;
+	mi->buf_addr = (char *)mi->buf_addr + off;
+	mi->buf_len -= off;
+	mi->buf_off += off;
+
+	delta = off - mi->data_off;
+	if (delta < 0) {
+		/* Adjust headroom */
+		mi->data_off -= off;
+		delta = 0;
+	} else if (delta < mi->data_len) {
+		/* No headroom */
+		mi->data_off = 0;
+	} else {
+		/* No data */
+		mi->data_off = 0;
+		delta = mi->data_len;
+	}
+	mi->data_len -= delta;
+	mi->pkt_len -= delta;
+	return (char *)mi->buf_addr + mi->data_off;
+}
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
  * After attachment we refer the mbuf we attached as 'indirect',
@@ -1315,6 +1414,7 @@  static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
+	mi->buf_off = 0;
 
 	mi->data_off = m->data_off;
 	mi->data_len = m->data_len;
@@ -1336,6 +1436,62 @@  static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
+ * Attach packet mbuf to another packet mbuf pointing by given offset.
+ *
+ * After attachment we refer the mbuf we attached as 'indirect',
+ * while mbuf we attached to as 'direct'.
+ *
+ * The indirect mbuf can reference to anywhere in the buffer of the direct
+ * mbuf by the given offset. And the indirect mbuf is also be trimmed by
+ * the given buffer length.
+ *
+ * As a result, if a direct mbuf has multiple layers of encapsulation,
+ * multiple indirect buffers can reference different layers of the packet.
+ * Or, a large direct mbuf can even contain multiple packets in series and
+ * each packet can be referenced by multiple mbuf indirections.
+ *
+ * Returns a pointer to the start address of the new data area. If offset
+ * or buffer length is out of range, then the function will fail and return
+ * NULL, without attaching the mbuf.
+ *
+ * @param mi
+ *   The indirect packet mbuf.
+ * @param m
+ *   The packet mbuf we're attaching to.
+ * @param off
+ *   The amount of offset to push (in bytes).
+ * @param buf_len
+ *   The buffer length of the indirect mbuf (in bytes).
+ * @return
+ *   A pointer to the new start of the data.
+ */
+static inline char *rte_pktmbuf_attach_at(struct rte_mbuf *mi,
+	struct rte_mbuf *m, uint16_t off, uint16_t buf_len)
+{
+	struct rte_mbuf *md;
+	char *ret;
+
+	if (RTE_MBUF_DIRECT(m))
+		md = m;
+	else
+		md = rte_mbuf_from_indirect(m);
+
+	if (off + buf_len > md->buf_len)
+		return NULL;
+
+	rte_pktmbuf_attach(mi, m);
+
+	/* Push reference of indirect mbuf */
+	ret = rte_pktmbuf_adj_indirect_head(mi, off);
+	RTE_ASSERT(ret != NULL);
+
+	/* Trim reference of indirect mbuf */
+	rte_pktmbuf_adj_indirect_tail(mi, off + buf_len - md->buf_len);
+
+	return ret;
+}
+
+/**
  * Detach an indirect packet mbuf.
  *
  *  - restore original mbuf address and length values.