diff mbox series

[RFC] ethdev: introduce DMA memory mapping for external memory

Message ID aae4d2d1d6ceabe661e22ae8a7591193cea62104.1541335203.git.shahafs@mellanox.com (mailing list archive)
State Superseded, archived
Headers show
Series [RFC] ethdev: introduce DMA memory mapping for external memory | expand

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/checkpatch success coding style OK

Commit Message

Shahaf Shuler Nov. 4, 2018, 12:41 p.m. UTC
Request for comment on the high level changes present on this patch.

The need to use external memory (memory belong to application and not
part of the DPDK hugepages) is allready present.
Starting from storage apps which prefer to manage their own memory blocks
for efficient use of the storage device. Continue with GPU based
application which strives to achieve zero copy while processing the packet
payload on the GPU core. And finally by vSwitch/vRouter application who
just prefer to have a full control over the memory in use (e.g. VPP).

Recent work[1] in the DPDK enabled the use of external memory, however
it mostly focus on VFIO as the only way to map memory.
While VFIO is common, there are other vendors which use different ways
to map memory (e.g. Mellanox and NXP[2]).

The work in this patch moves the DMA mapping to vendor agnostic APIs
located under ethdev. The choice in ethdev was because memory map should
be associated with a specific port(s). Otherwise the memory is being
mapped multiple times to different frameworks and ends up with memory
being wasted on redundant translation table in the host or in the device.

For example, consider a host with Mellanox and Intel devices. Mapping a
memory without specifying to which port will end up with IOMMU
registration and Verbs (Mellanox DMA map) registration.
Another example can be two Mellanox devices on the same host. The memory
will be mapped for both, even though application will use mempool per
device.

To use the suggested APIs the application will allocate a memory block
and will call rte_eth_dma_map. It will map it to every port that needs
DMA access to this memory.
Later on the application could use this memory to populate a mempool or
to attach mbuf with external buffer.
When the memory should no longer be used by the device the application
will call rte_eth_dma_unmap from every port it did registration to.

The Drivers will implement the DMA map/unmap, and it is very likely they
will use the help of the existing VFIO mapping.

Support for hotplug/unplug of device is out of the scope for this patch,
however can be implemented in the same way it is done on VFIO.

Cc: pawelx.wodkowski@intel.com
Cc: anatoly.burakov@intel.com
Cc: gowrishankar.m@linux.vnet.ibm.com
Cc: ferruh.yigit@intel.com
Cc: thomas@monjalon.net
Cc: arybchenko@solarflare.com
Cc: shreyansh.jain@nxp.com

Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>

[1]
commit 73a639085938 ("vfio: allow to map other memory regions")
[2]
http://mails.dpdk.org/archives/dev/2018-September/111978.html
---
 lib/librte_eal/bsdapp/eal/eal.c          | 14 ----------
 lib/librte_eal/common/include/rte_vfio.h | 44 --------------------------------
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 22 ----------------
 lib/librte_ethdev/rte_ethdev.c           | 38 +++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h           | 40 +++++++++++++++++++++++++++++
 lib/librte_ethdev/rte_ethdev_core.h      | 14 ++++++++++
 6 files changed, 92 insertions(+), 80 deletions(-)

Comments

Burakov, Anatoly Nov. 14, 2018, 11:19 a.m. UTC | #1
Hi Shahaf,

Great to see such effort! Few comments below.

Note: halfway through writing my comments i realized that i am starting 
with an assumption that this API is a replacement for current VFIO DMA 
mapping API's. So, if my comments seem out of left field, this is 
probably why :)

On 04-Nov-18 12:41 PM, Shahaf Shuler wrote:
> Request for comment on the high level changes present on this patch.
> 
> The need to use external memory (memory belong to application and not
> part of the DPDK hugepages) is allready present.
> Starting from storage apps which prefer to manage their own memory blocks
> for efficient use of the storage device. Continue with GPU based
> application which strives to achieve zero copy while processing the packet
> payload on the GPU core. And finally by vSwitch/vRouter application who
> just prefer to have a full control over the memory in use (e.g. VPP).
> 
> Recent work[1] in the DPDK enabled the use of external memory, however
> it mostly focus on VFIO as the only way to map memory.
> While VFIO is common, there are other vendors which use different ways
> to map memory (e.g. Mellanox and NXP[2]).
> 
> The work in this patch moves the DMA mapping to vendor agnostic APIs
> located under ethdev. The choice in ethdev was because memory map should
> be associated with a specific port(s). Otherwise the memory is being
> mapped multiple times to different frameworks and ends up with memory
> being wasted on redundant translation table in the host or in the device.

So, anything other than ethdev (e.g. cryptodev) will not be able to map 
memory for DMA?

I have thought about this for some length of time, and i think DMA 
mapping belongs in EAL (more specifically, somewhere at the bus layer), 
rather than at device level. Placing this functionality at device level 
comes with more work to support different device types and puts a burden 
on device driver developers to implement their own mapping functions.

However, i have no familiarity with how MLX/NXP devices do their DMA 
mapping, so maybe the device-centric approach would be better. We could 
provide "standard" mapping functions at the bus level (such as VFIO 
mapping functions for PCI bus), so that this could would not have to be 
reimplemented in the devices.

Moreover, i'm not sure how this is going to work for VFIO. If this is to 
be called for each NIC that needs access to the memory, then we'll end 
up with double mappings for any NIC that uses VFIO, unless you want each 
NIC to be in a separate container.

> 
> For example, consider a host with Mellanox and Intel devices. Mapping a
> memory without specifying to which port will end up with IOMMU
> registration and Verbs (Mellanox DMA map) registration.
> Another example can be two Mellanox devices on the same host. The memory
> will be mapped for both, even though application will use mempool per
> device.
> 
> To use the suggested APIs the application will allocate a memory block
> and will call rte_eth_dma_map. It will map it to every port that needs
> DMA access to this memory.

This bit is unclear to me. What do you mean "map it to every port that 
needs DMA access to this memory"? I don't see how this API solves the 
above problem of mapping the same memory to all devices. How does a 
device know which memory it will need? Does the user specifically have 
to call this API for each and every NIC they're using?

For DPDK-managed memory, everything will still get mapped to every 
device automatically, correct? If so, then such a manual approach for 
external memory will be bad for both usability and drop-in replacement 
of internal-to-external memory, because it introduces inconsistency 
between using internal and external memory. From my point of view, 
either we do *everything* manually (i.e. register all memory for DMA 
explicitly) and thereby avoid this problem but keep the consistency, or 
we do *everything* automatically and deal with duplication of mappings 
somehow (say, by MLX/NXP drivers sharing their mappings through bus 
interface).

> Later on the application could use this memory to populate a mempool or
> to attach mbuf with external buffer.
> When the memory should no longer be used by the device the application
> will call rte_eth_dma_unmap from every port it did registration to.
> 
> The Drivers will implement the DMA map/unmap, and it is very likely they
> will use the help of the existing VFIO mapping.
> 
> Support for hotplug/unplug of device is out of the scope for this patch,
> however can be implemented in the same way it is done on VFIO.
> 
> Cc: pawelx.wodkowski@intel.com
> Cc: anatoly.burakov@intel.com
> Cc: gowrishankar.m@linux.vnet.ibm.com
> Cc: ferruh.yigit@intel.com
> Cc: thomas@monjalon.net
> Cc: arybchenko@solarflare.com
> Cc: shreyansh.jain@nxp.com
> 
> Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
> 
> [1]
> commit 73a639085938 ("vfio: allow to map other memory regions")
> [2]
> http://mails.dpdk.org/archives/dev/2018-September/111978.html
> ---
Shahaf Shuler Nov. 14, 2018, 2:53 p.m. UTC | #2
Hi Anatoly, 

Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
> memory
> 
> Hi Shahaf,
> 
> Great to see such effort! Few comments below.
> 
> Note: halfway through writing my comments i realized that i am starting with
> an assumption that this API is a replacement for current VFIO DMA mapping
> API's. So, if my comments seem out of left field, this is probably why :)
> 
> On 04-Nov-18 12:41 PM, Shahaf Shuler wrote:
> > Request for comment on the high level changes present on this patch.
> >
> > The need to use external memory (memory belong to application and not
> > part of the DPDK hugepages) is allready present.
> > Starting from storage apps which prefer to manage their own memory
> > blocks for efficient use of the storage device. Continue with GPU
> > based application which strives to achieve zero copy while processing
> > the packet payload on the GPU core. And finally by vSwitch/vRouter
> > application who just prefer to have a full control over the memory in use
> (e.g. VPP).
> >
> > Recent work[1] in the DPDK enabled the use of external memory, however
> > it mostly focus on VFIO as the only way to map memory.
> > While VFIO is common, there are other vendors which use different ways
> > to map memory (e.g. Mellanox and NXP[2]).
> >
> > The work in this patch moves the DMA mapping to vendor agnostic APIs
> > located under ethdev. The choice in ethdev was because memory map
> > should be associated with a specific port(s). Otherwise the memory is
> > being mapped multiple times to different frameworks and ends up with
> > memory being wasted on redundant translation table in the host or in the
> device.
> 
> So, anything other than ethdev (e.g. cryptodev) will not be able to map
> memory for DMA?

That's is a fair point. 

> 
> I have thought about this for some length of time, and i think DMA mapping
> belongs in EAL (more specifically, somewhere at the bus layer), rather than at
> device level.

I am not sure I agree here. For example take Intel and Mellanox devices. Both are PCI devices, so how will you distinguish which mapping API to use? 
Also I still think the mapping should be in device granularity and not bus/system granularity, since it is very typical for a memory to be used for DMA be a specific device. 

Maybe we can say the DMA mapping is a rte_device attribute. It is the parent class for all the DPDK devices. 
We need to see w/ vport representors (which all has the same rte_device). On that case I believe the rte_device.map call can register the memory to all of the representors as well (if needed). 

Placing this functionality at device level comes with more work
> to support different device types and puts a burden on device driver
> developers to implement their own mapping functions.

The mapping function can be shared. For example we can still maintain the vfio mapping scheme as part of eal and have all the related driver to call this function. 
The only overhead will be to maintain the function pointer for the dma call. 
With this work, instead of the eal layer to guess which type of DMA mapping the devices in the  system needs or alternatively force them all to work w/ VFIO, each driver will select its own function. 
The driver is the only one which knows what type of DMA mapping its device needs. 

> 
> However, i have no familiarity with how MLX/NXP devices do their DMA
> mapping, so maybe the device-centric approach would be better. We could
> provide "standard" mapping functions at the bus level (such as VFIO mapping
> functions for PCI bus), so that this could would not have to be
> reimplemented in the devices.

Yes, like I said above, I wasn't intending to re-implement all the mapping function again on each driver. Yet, I believe it should be per device. 

> 
> Moreover, i'm not sure how this is going to work for VFIO. If this is to be
> called for each NIC that needs access to the memory, then we'll end up with
> double mappings for any NIC that uses VFIO, unless you want each NIC to be
> in a separate container.

I am not much familiar w/ VFIO (you are the expert๐Ÿ˜Š). 
What will happen if we map the same memory twice (under same container)? The translation on the IOMMU will be doubled? The map will return with error that this memory mapping already exists? 

> 
> >
> > For example, consider a host with Mellanox and Intel devices. Mapping a
> > memory without specifying to which port will end up with IOMMU
> > registration and Verbs (Mellanox DMA map) registration.
> > Another example can be two Mellanox devices on the same host. The
> memory
> > will be mapped for both, even though application will use mempool per
> > device.
> >
> > To use the suggested APIs the application will allocate a memory block
> > and will call rte_eth_dma_map. It will map it to every port that needs
> > DMA access to this memory.
> 
> This bit is unclear to me. What do you mean "map it to every port that
> needs DMA access to this memory"? I don't see how this API solves the
> above problem of mapping the same memory to all devices. How does a
> device know which memory it will need? Does the user specifically have
> to call this API for each and every NIC they're using?

Yes, the user will call this API for every port which needs to have DMA access to this memory.
Remember we are speaking here on external memory the application allocated and wants to use for send/receive.  The device doesn't guess which memory he will need, the user is telling it to him explicitly. 

> 
> For DPDK-managed memory, everything will still get mapped to every
> device automatically, correct? 

Yes, even though it is not the case today. 

If so, then such a manual approach for
> external memory will be bad for both usability and drop-in replacement
> of internal-to-external memory, because it introduces inconsistency
> between using internal and external memory. From my point of view,
> either we do *everything* manually (i.e. register all memory for DMA
> explicitly) and thereby avoid this problem but keep the consistency, or
> we do *everything* automatically and deal with duplication of mappings
> somehow (say, by MLX/NXP drivers sharing their mappings through bus
> interface).

I understand your point, however I am not sure external and internal memory *must* be consist.
The DPDK-managed memory is part of the DPDK subsystems and the DPDK libs are preparing it for the optimal use of the underlying devices. The external memory is different, it is a proprietary memory the application allocated and the DPDK cannot do anything in advance on it.  
Even today there is inconsistency, because if user wants to use external memory it must map it (rte_vfio_dma_map) while he doesn't need to do that for the DPDK-managed memory. 

I guess we can we can add a flag on the device mapping which will say MAP_TO_ALL_DEVICES, to ease the application life in the presence of multiple device in the host. 

> 
> > Later on the application could use this memory to populate a mempool or
> > to attach mbuf with external buffer.
> > When the memory should no longer be used by the device the application
> > will call rte_eth_dma_unmap from every port it did registration to.
> >
> > The Drivers will implement the DMA map/unmap, and it is very likely they
> > will use the help of the existing VFIO mapping.
> >
> > Support for hotplug/unplug of device is out of the scope for this patch,
> > however can be implemented in the same way it is done on VFIO.
> >
> > Cc: pawelx.wodkowski@intel.com
> > Cc: anatoly.burakov@intel.com
> > Cc: gowrishankar.m@linux.vnet.ibm.com
> > Cc: ferruh.yigit@intel.com
> > Cc: thomas@monjalon.net
> > Cc: arybchenko@solarflare.com
> > Cc: shreyansh.jain@nxp.com
> >
> > Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
> >
> > [1]
> > commit 73a639085938 ("vfio: allow to map other memory regions")
> > [2]
> >
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail
> s.dpdk.org%2Farchives%2Fdev%2F2018-
> September%2F111978.html&amp;data=02%7C01%7Cshahafs%40mellanox.co
> m%7C4ad77203f6034b173eb908d64a230121%7Ca652971c7d2e4d9ba6a4d1492
> 56f461b%7C0%7C0%7C636777911537908464&amp;sdata=gUGpiDUQOkHn5N
> %2BtgjSEqiXctkQxAHWBSGyyHhG84UY%3D&amp;reserved=0
> > ---
> --
> Thanks,
> Anatoly
Burakov, Anatoly Nov. 14, 2018, 5:06 p.m. UTC | #3
On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
> Hi Anatoly,
> 
> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>> memory
>>
>> Hi Shahaf,
>>
>> Great to see such effort! Few comments below.
>>
>> Note: halfway through writing my comments i realized that i am starting with
>> an assumption that this API is a replacement for current VFIO DMA mapping
>> API's. So, if my comments seem out of left field, this is probably why :)
>>
>> On 04-Nov-18 12:41 PM, Shahaf Shuler wrote:
>>> Request for comment on the high level changes present on this patch.
>>>
>>> The need to use external memory (memory belong to application and not
>>> part of the DPDK hugepages) is allready present.
>>> Starting from storage apps which prefer to manage their own memory
>>> blocks for efficient use of the storage device. Continue with GPU
>>> based application which strives to achieve zero copy while processing
>>> the packet payload on the GPU core. And finally by vSwitch/vRouter
>>> application who just prefer to have a full control over the memory in use
>> (e.g. VPP).
>>>
>>> Recent work[1] in the DPDK enabled the use of external memory, however
>>> it mostly focus on VFIO as the only way to map memory.
>>> While VFIO is common, there are other vendors which use different ways
>>> to map memory (e.g. Mellanox and NXP[2]).
>>>
>>> The work in this patch moves the DMA mapping to vendor agnostic APIs
>>> located under ethdev. The choice in ethdev was because memory map
>>> should be associated with a specific port(s). Otherwise the memory is
>>> being mapped multiple times to different frameworks and ends up with
>>> memory being wasted on redundant translation table in the host or in the
>> device.
>>
>> So, anything other than ethdev (e.g. cryptodev) will not be able to map
>> memory for DMA?
> 
> That's is a fair point.
> 
>>
>> I have thought about this for some length of time, and i think DMA mapping
>> belongs in EAL (more specifically, somewhere at the bus layer), rather than at
>> device level.
> 
> I am not sure I agree here. For example take Intel and Mellanox devices. Both are PCI devices, so how will you distinguish which mapping API to use?
> Also I still think the mapping should be in device granularity and not bus/system granularity, since it is very typical for a memory to be used for DMA be a specific device.
> 
> Maybe we can say the DMA mapping is a rte_device attribute. It is the parent class for all the DPDK devices.
> We need to see w/ vport representors (which all has the same rte_device). On that case I believe the rte_device.map call can register the memory to all of the representors as well (if needed).
> 
> Placing this functionality at device level comes with more work
>> to support different device types and puts a burden on device driver
>> developers to implement their own mapping functions.
> 
> The mapping function can be shared. For example we can still maintain the vfio mapping scheme as part of eal and have all the related driver to call this function.
> The only overhead will be to maintain the function pointer for the dma call.
> With this work, instead of the eal layer to guess which type of DMA mapping the devices in the  system needs or alternatively force them all to work w/ VFIO, each driver will select its own function.
> The driver is the only one which knows what type of DMA mapping its device needs.
> 
>>
>> However, i have no familiarity with how MLX/NXP devices do their DMA
>> mapping, so maybe the device-centric approach would be better. We could
>> provide "standard" mapping functions at the bus level (such as VFIO mapping
>> functions for PCI bus), so that this could would not have to be
>> reimplemented in the devices.
> 
> Yes, like I said above, I wasn't intending to re-implement all the mapping function again on each driver. Yet, I believe it should be per device.
> 
>>
>> Moreover, i'm not sure how this is going to work for VFIO. If this is to be
>> called for each NIC that needs access to the memory, then we'll end up with
>> double mappings for any NIC that uses VFIO, unless you want each NIC to be
>> in a separate container.
> 
> I am not much familiar w/ VFIO (you are the expert๐Ÿ˜Š).
> What will happen if we map the same memory twice (under same container)? The translation on the IOMMU will be doubled? The map will return with error that this memory mapping already exists?

The latter. You can't map the same memory twice in the same container. 
You can't even keep NICs in separate containers because then secondary 
processes won't work without serious rework.

So, all VFIO-mapped things will need to share the mappings.

It's not an insurmountable problem, but if we're going to share mapping 
status for VFIO (i.e. track which area is already mapped), then what's 
stopping us from doing the same for other DMA mapping mechanisms? I.e. 
instead of duplicating the mappings in each driver, provide some kind of 
mechanism for devices to share the DMA mapping caches. Apologies if i'm 
talking nonsense - i'm completely unfamiliar with how DMA mapping works 
for MLX/NXP devices :)

> 
>>
>>>
>>> For example, consider a host with Mellanox and Intel devices. Mapping a
>>> memory without specifying to which port will end up with IOMMU
>>> registration and Verbs (Mellanox DMA map) registration.
>>> Another example can be two Mellanox devices on the same host. The
>> memory
>>> will be mapped for both, even though application will use mempool per
>>> device.
>>>
>>> To use the suggested APIs the application will allocate a memory block
>>> and will call rte_eth_dma_map. It will map it to every port that needs
>>> DMA access to this memory.
>>
>> This bit is unclear to me. What do you mean "map it to every port that
>> needs DMA access to this memory"? I don't see how this API solves the
>> above problem of mapping the same memory to all devices. How does a
>> device know which memory it will need? Does the user specifically have
>> to call this API for each and every NIC they're using?
> 
> Yes, the user will call this API for every port which needs to have DMA access to this memory.
> Remember we are speaking here on external memory the application allocated and wants to use for send/receive.  The device doesn't guess which memory he will need, the user is telling it to him explicitly.
> 
>>
>> For DPDK-managed memory, everything will still get mapped to every
>> device automatically, correct?
> 
> Yes, even though it is not the case today.

What do you mean it is not the case? It is the case today. When external 
memory chunk is registered at the heap, a mem event callback is 
triggered just like for regular memory, and this chunk does get mapped 
to VFIO as well as any other subscribed entity. As i recall, NXP NICs 
currently are set up to ignore externally allocated memory, but for the 
general VFIO case, everything is mapped automatically.

> 
> If so, then such a manual approach for
>> external memory will be bad for both usability and drop-in replacement
>> of internal-to-external memory, because it introduces inconsistency
>> between using internal and external memory. From my point of view,
>> either we do *everything* manually (i.e. register all memory for DMA
>> explicitly) and thereby avoid this problem but keep the consistency, or
>> we do *everything* automatically and deal with duplication of mappings
>> somehow (say, by MLX/NXP drivers sharing their mappings through bus
>> interface).
> 
> I understand your point, however I am not sure external and internal memory *must* be consist.
> The DPDK-managed memory is part of the DPDK subsystems and the DPDK libs are preparing it for the optimal use of the underlying devices. The external memory is different, it is a proprietary memory the application allocated and the DPDK cannot do anything in advance on it.

My view for designing external memory support was that it should behave 
like regular DPDK memory for all intents and purposes, and be a drop-in 
replacement, should you choose to use it. I.e. the application should 
not care whether it uses internal or external memory - it all sits in 
the same malloc heaps, it all uses the same socket ID mechanisms, etc. - 
for all intents and purposes, they're one and the same.

> Even today there is inconsistency, because if user wants to use external memory it must map it (rte_vfio_dma_map) while he doesn't need to do that for the DPDK-managed memory.

Well, now i see where your confusion stems from :) You didn't know about 
this:

http://git.dpdk.org/dpdk/tree/lib/librte_eal/common/malloc_heap.c#n1169

This will trigger all mem event callbacks, including VFIO DMA mapping 
callback.

> 
> I guess we can we can add a flag on the device mapping which will say MAP_TO_ALL_DEVICES, to ease the application life in the presence of multiple device in the host.
>
Shahaf Shuler Nov. 15, 2018, 9:46 a.m. UTC | #4
Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
> memory
> 
> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
> > Hi Anatoly,
> >
> > Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
> >> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
> >> memory
> >>
> >> Hi Shahaf,
> >>
> >> Great to see such effort! Few comments below.
> >>
> >> Note: halfway through writing my comments i realized that i am
> >> starting with an assumption that this API is a replacement for
> >> current VFIO DMA mapping API's. So, if my comments seem out of left
> >> field, this is probably why :)
> >>
> >> On 04-Nov-18 12:41 PM, Shahaf Shuler wrote:
> >>> Request for comment on the high level changes present on this patch.
> >>>
> >>> The need to use external memory (memory belong to application and
> >>> not part of the DPDK hugepages) is allready present.
> >>> Starting from storage apps which prefer to manage their own memory
> >>> blocks for efficient use of the storage device. Continue with GPU
> >>> based application which strives to achieve zero copy while
> >>> processing the packet payload on the GPU core. And finally by
> >>> vSwitch/vRouter application who just prefer to have a full control
> >>> over the memory in use
> >> (e.g. VPP).
> >>>
> >>> Recent work[1] in the DPDK enabled the use of external memory,
> >>> however it mostly focus on VFIO as the only way to map memory.
> >>> While VFIO is common, there are other vendors which use different
> >>> ways to map memory (e.g. Mellanox and NXP[2]).
> >>>
> >>> The work in this patch moves the DMA mapping to vendor agnostic APIs
> >>> located under ethdev. The choice in ethdev was because memory map
> >>> should be associated with a specific port(s). Otherwise the memory
> >>> is being mapped multiple times to different frameworks and ends up
> >>> with memory being wasted on redundant translation table in the host
> >>> or in the
> >> device.
> >>
> >> So, anything other than ethdev (e.g. cryptodev) will not be able to
> >> map memory for DMA?
> >
> > That's is a fair point.
> >
> >>
> >> I have thought about this for some length of time, and i think DMA
> >> mapping belongs in EAL (more specifically, somewhere at the bus
> >> layer), rather than at device level.
> >
> > I am not sure I agree here. For example take Intel and Mellanox devices.
> Both are PCI devices, so how will you distinguish which mapping API to use?
> > Also I still think the mapping should be in device granularity and not
> bus/system granularity, since it is very typical for a memory to be used for
> DMA be a specific device.
> >
> > Maybe we can say the DMA mapping is a rte_device attribute. It is the
> parent class for all the DPDK devices.
> > We need to see w/ vport representors (which all has the same rte_device).
> On that case I believe the rte_device.map call can register the memory to all
> of the representors as well (if needed).
> >
> > Placing this functionality at device level comes with more work
> >> to support different device types and puts a burden on device driver
> >> developers to implement their own mapping functions.
> >
> > The mapping function can be shared. For example we can still maintain the
> vfio mapping scheme as part of eal and have all the related driver to call this
> function.
> > The only overhead will be to maintain the function pointer for the dma call.
> > With this work, instead of the eal layer to guess which type of DMA
> mapping the devices in the  system needs or alternatively force them all to
> work w/ VFIO, each driver will select its own function.
> > The driver is the only one which knows what type of DMA mapping its
> device needs.
> >
> >>
> >> However, i have no familiarity with how MLX/NXP devices do their DMA
> >> mapping, so maybe the device-centric approach would be better. We
> >> could provide "standard" mapping functions at the bus level (such as
> >> VFIO mapping functions for PCI bus), so that this could would not
> >> have to be reimplemented in the devices.
> >
> > Yes, like I said above, I wasn't intending to re-implement all the mapping
> function again on each driver. Yet, I believe it should be per device.
> >
> >>
> >> Moreover, i'm not sure how this is going to work for VFIO. If this is
> >> to be called for each NIC that needs access to the memory, then we'll
> >> end up with double mappings for any NIC that uses VFIO, unless you
> >> want each NIC to be in a separate container.
> >
> > I am not much familiar w/ VFIO (you are the expert๐Ÿ˜Š).
> > What will happen if we map the same memory twice (under same
> container)? The translation on the IOMMU will be doubled? The map will
> return with error that this memory mapping already exists?
> 
> The latter. You can't map the same memory twice in the same container.
> You can't even keep NICs in separate containers because then secondary
> processes won't work without serious rework.
> 
> So, all VFIO-mapped things will need to share the mappings.
> 
> It's not an insurmountable problem, but if we're going to share mapping
> status for VFIO (i.e. track which area is already mapped), then what's
> stopping us from doing the same for other DMA mapping mechanisms? I.e.
> instead of duplicating the mappings in each driver, provide some kind of
> mechanism for devices to share the DMA mapping caches. Apologies if i'm
> talking nonsense - i'm completely unfamiliar with how DMA mapping works
> for MLX/NXP devices :)

Unfortunately it cannot be done at least w/ Mellanox. 
In Mellanox the kernel driver is the one which maps the memory. The mapping returns a key which identify a memory region which was just registered to the device.
There is a complete separation between the ports, meaning one port mapping cannot be used by in the other port, even if the key is known. 

The separation is not only in ports, but also in processes (two primary ones, for secondary we have a way to share). If two process work on the same device, the must register the memory independently. 

> 
> >
> >>
> >>>
> >>> For example, consider a host with Mellanox and Intel devices.
> >>> Mapping a memory without specifying to which port will end up with
> >>> IOMMU registration and Verbs (Mellanox DMA map) registration.
> >>> Another example can be two Mellanox devices on the same host. The
> >> memory
> >>> will be mapped for both, even though application will use mempool
> >>> per device.
> >>>
> >>> To use the suggested APIs the application will allocate a memory
> >>> block and will call rte_eth_dma_map. It will map it to every port
> >>> that needs DMA access to this memory.
> >>
> >> This bit is unclear to me. What do you mean "map it to every port
> >> that needs DMA access to this memory"? I don't see how this API
> >> solves the above problem of mapping the same memory to all devices.
> >> How does a device know which memory it will need? Does the user
> >> specifically have to call this API for each and every NIC they're using?
> >
> > Yes, the user will call this API for every port which needs to have DMA
> access to this memory.
> > Remember we are speaking here on external memory the application
> allocated and wants to use for send/receive.  The device doesn't guess which
> memory he will need, the user is telling it to him explicitly.
> >
> >>
> >> For DPDK-managed memory, everything will still get mapped to every
> >> device automatically, correct?
> >
> > Yes, even though it is not the case today.
> 
> What do you mean it is not the case? It is the case today. When external
> memory chunk is registered at the heap, a mem event callback is triggered
> just like for regular memory, and this chunk does get mapped to VFIO as well
> as any other subscribed entity. As i recall, NXP NICs currently are set up to
> ignore externally allocated memory, but for the general VFIO case,
> everything is mapped automatically.
> 
> >
> > If so, then such a manual approach for
> >> external memory will be bad for both usability and drop-in
> >> replacement of internal-to-external memory, because it introduces
> >> inconsistency between using internal and external memory. From my
> >> point of view, either we do *everything* manually (i.e. register all
> >> memory for DMA
> >> explicitly) and thereby avoid this problem but keep the consistency,
> >> or we do *everything* automatically and deal with duplication of
> >> mappings somehow (say, by MLX/NXP drivers sharing their mappings
> >> through bus interface).
> >
> > I understand your point, however I am not sure external and internal
> memory *must* be consist.
> > The DPDK-managed memory is part of the DPDK subsystems and the DPDK
> libs are preparing it for the optimal use of the underlying devices. The
> external memory is different, it is a proprietary memory the application
> allocated and the DPDK cannot do anything in advance on it.
> 
> My view for designing external memory support was that it should behave
> like regular DPDK memory for all intents and purposes, and be a drop-in
> replacement, should you choose to use it. I.e. the application should not care
> whether it uses internal or external memory - it all sits in the same malloc
> heaps, it all uses the same socket ID mechanisms, etc. - for all intents and
> purposes, they're one and the same.
> 
> > Even today there is inconsistency, because if user wants to use external
> memory it must map it (rte_vfio_dma_map) while he doesn't need to do
> that for the DPDK-managed memory.
> 
> Well, now i see where your confusion stems from :) You didn't know about
> this:
> 
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.d
> pdk.org%2Fdpdk%2Ftree%2Flib%2Flibrte_eal%2Fcommon%2Fmalloc_heap.c
> %23n1169&amp;data=02%7C01%7Cshahafs%40mellanox.com%7Ca1eb06c5a
> d794e4e725d08d64a537f23%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
> C0%7C636778119822376211&amp;sdata=duMis6fW2CTmyRqdOAmlZl5cezCfJ
> aeuoo61QGIBIGk%3D&amp;reserved=0
> 
> This will trigger all mem event callbacks, including VFIO DMA mapping
> callback.

I see, so I am indeed confused ๐Ÿ˜Š.
On which cases the application should call the existing rte_vfio_dma_map? If the memory is already mapped and the only way to work with it is through the rte_malloc mechanism.

> 
> >
> > I guess we can we can add a flag on the device mapping which will say
> MAP_TO_ALL_DEVICES, to ease the application life in the presence of
> multiple device in the host.
> >
> 
> 
> --
> Thanks,
> Anatoly
Burakov, Anatoly Nov. 15, 2018, 10:59 a.m. UTC | #5
On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>> memory
>>
>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
>>> Hi Anatoly,
>>>
>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>>>> memory
>>>>
>>>> Hi Shahaf,
>>>>
>>>> Great to see such effort! Few comments below.
>>>>
>>>> Note: halfway through writing my comments i realized that i am
>>>> starting with an assumption that this API is a replacement for
>>>> current VFIO DMA mapping API's. So, if my comments seem out of left
>>>> field, this is probably why :)
>>>>
>>>> On 04-Nov-18 12:41 PM, Shahaf Shuler wrote:
>>>>> Request for comment on the high level changes present on this patch.
>>>>>
>>>>> The need to use external memory (memory belong to application and
>>>>> not part of the DPDK hugepages) is allready present.
>>>>> Starting from storage apps which prefer to manage their own memory
>>>>> blocks for efficient use of the storage device. Continue with GPU
>>>>> based application which strives to achieve zero copy while
>>>>> processing the packet payload on the GPU core. And finally by
>>>>> vSwitch/vRouter application who just prefer to have a full control
>>>>> over the memory in use
>>>> (e.g. VPP).
>>>>>
>>>>> Recent work[1] in the DPDK enabled the use of external memory,
>>>>> however it mostly focus on VFIO as the only way to map memory.
>>>>> While VFIO is common, there are other vendors which use different
>>>>> ways to map memory (e.g. Mellanox and NXP[2]).
>>>>>
>>>>> The work in this patch moves the DMA mapping to vendor agnostic APIs
>>>>> located under ethdev. The choice in ethdev was because memory map
>>>>> should be associated with a specific port(s). Otherwise the memory
>>>>> is being mapped multiple times to different frameworks and ends up
>>>>> with memory being wasted on redundant translation table in the host
>>>>> or in the
>>>> device.
>>>>
>>>> So, anything other than ethdev (e.g. cryptodev) will not be able to
>>>> map memory for DMA?
>>>
>>> That's is a fair point.
>>>
>>>>
>>>> I have thought about this for some length of time, and i think DMA
>>>> mapping belongs in EAL (more specifically, somewhere at the bus
>>>> layer), rather than at device level.
>>>
>>> I am not sure I agree here. For example take Intel and Mellanox devices.
>> Both are PCI devices, so how will you distinguish which mapping API to use?
>>> Also I still think the mapping should be in device granularity and not
>> bus/system granularity, since it is very typical for a memory to be used for
>> DMA be a specific device.
>>>
>>> Maybe we can say the DMA mapping is a rte_device attribute. It is the
>> parent class for all the DPDK devices.
>>> We need to see w/ vport representors (which all has the same rte_device).
>> On that case I believe the rte_device.map call can register the memory to all
>> of the representors as well (if needed).
>>>
>>> Placing this functionality at device level comes with more work
>>>> to support different device types and puts a burden on device driver
>>>> developers to implement their own mapping functions.
>>>
>>> The mapping function can be shared. For example we can still maintain the
>> vfio mapping scheme as part of eal and have all the related driver to call this
>> function.
>>> The only overhead will be to maintain the function pointer for the dma call.
>>> With this work, instead of the eal layer to guess which type of DMA
>> mapping the devices in the  system needs or alternatively force them all to
>> work w/ VFIO, each driver will select its own function.
>>> The driver is the only one which knows what type of DMA mapping its
>> device needs.
>>>
>>>>
>>>> However, i have no familiarity with how MLX/NXP devices do their DMA
>>>> mapping, so maybe the device-centric approach would be better. We
>>>> could provide "standard" mapping functions at the bus level (such as
>>>> VFIO mapping functions for PCI bus), so that this could would not
>>>> have to be reimplemented in the devices.
>>>
>>> Yes, like I said above, I wasn't intending to re-implement all the mapping
>> function again on each driver. Yet, I believe it should be per device.
>>>
>>>>
>>>> Moreover, i'm not sure how this is going to work for VFIO. If this is
>>>> to be called for each NIC that needs access to the memory, then we'll
>>>> end up with double mappings for any NIC that uses VFIO, unless you
>>>> want each NIC to be in a separate container.
>>>
>>> I am not much familiar w/ VFIO (you are the expert๐Ÿ˜Š).
>>> What will happen if we map the same memory twice (under same
>> container)? The translation on the IOMMU will be doubled? The map will
>> return with error that this memory mapping already exists?
>>
>> The latter. You can't map the same memory twice in the same container.
>> You can't even keep NICs in separate containers because then secondary
>> processes won't work without serious rework.
>>
>> So, all VFIO-mapped things will need to share the mappings.
>>
>> It's not an insurmountable problem, but if we're going to share mapping
>> status for VFIO (i.e. track which area is already mapped), then what's
>> stopping us from doing the same for other DMA mapping mechanisms? I.e.
>> instead of duplicating the mappings in each driver, provide some kind of
>> mechanism for devices to share the DMA mapping caches. Apologies if i'm
>> talking nonsense - i'm completely unfamiliar with how DMA mapping works
>> for MLX/NXP devices :)
> 
> Unfortunately it cannot be done at least w/ Mellanox.
> In Mellanox the kernel driver is the one which maps the memory. The mapping returns a key which identify a memory region which was just registered to the device.
> There is a complete separation between the ports, meaning one port mapping cannot be used by in the other port, even if the key is known.
> 
> The separation is not only in ports, but also in processes (two primary ones, for secondary we have a way to share). If two process work on the same device, the must register the memory independently.

Ah, OK.

So, we're right back to where we started. Right now, external memory 
expects to behave the same way as all other memory - you don't need to 
perform DMA mapping for it.

That said, part of the reason *why* it was done that way was because 
there is no way to trigger VFIO DMA mapping for NXP (or was it MLX?) 
devices. If you look at initial versions of the patchset, the DMA 
mapping was actually done manually. Then, i became convinced that doing 
this automatically is the way to go, both because it erases the 
usability differences as far as memory types are concerned, and because 
it enables whatever services that are subscribing to memory events to 
receive notifications about external memory as well (i.e. consistency).

Given that it's still an experimental API, we can tinker with it all we 
like, so it's not set in stone. However, i would really like to keep the 
current automagic thing, because DMA mapping may not be the only user of 
memory callbacks - they can be used for debug purposes, or for any other 
things.

>>>>> For example, consider a host with Mellanox and Intel devices.
>>>>> Mapping a memory without specifying to which port will end up with
>>>>> IOMMU registration and Verbs (Mellanox DMA map) registration.
>>>>> Another example can be two Mellanox devices on the same host. The
>>>> memory
>>>>> will be mapped for both, even though application will use mempool
>>>>> per device.
>>>>>
>>>>> To use the suggested APIs the application will allocate a memory
>>>>> block and will call rte_eth_dma_map. It will map it to every port
>>>>> that needs DMA access to this memory.
>>>>
>>>> This bit is unclear to me. What do you mean "map it to every port
>>>> that needs DMA access to this memory"? I don't see how this API
>>>> solves the above problem of mapping the same memory to all devices.
>>>> How does a device know which memory it will need? Does the user
>>>> specifically have to call this API for each and every NIC they're using?
>>>
>>> Yes, the user will call this API for every port which needs to have DMA
>> access to this memory.
>>> Remember we are speaking here on external memory the application
>> allocated and wants to use for send/receive.  The device doesn't guess which
>> memory he will need, the user is telling it to him explicitly.
>>>
>>>>
>>>> For DPDK-managed memory, everything will still get mapped to every
>>>> device automatically, correct?
>>>
>>> Yes, even though it is not the case today.
>>
>> What do you mean it is not the case? It is the case today. When external
>> memory chunk is registered at the heap, a mem event callback is triggered
>> just like for regular memory, and this chunk does get mapped to VFIO as well
>> as any other subscribed entity. As i recall, NXP NICs currently are set up to
>> ignore externally allocated memory, but for the general VFIO case,
>> everything is mapped automatically.
>>
>>>
>>> If so, then such a manual approach for
>>>> external memory will be bad for both usability and drop-in
>>>> replacement of internal-to-external memory, because it introduces
>>>> inconsistency between using internal and external memory. From my
>>>> point of view, either we do *everything* manually (i.e. register all
>>>> memory for DMA
>>>> explicitly) and thereby avoid this problem but keep the consistency,
>>>> or we do *everything* automatically and deal with duplication of
>>>> mappings somehow (say, by MLX/NXP drivers sharing their mappings
>>>> through bus interface).
>>>
>>> I understand your point, however I am not sure external and internal
>> memory *must* be consist.
>>> The DPDK-managed memory is part of the DPDK subsystems and the DPDK
>> libs are preparing it for the optimal use of the underlying devices. The
>> external memory is different, it is a proprietary memory the application
>> allocated and the DPDK cannot do anything in advance on it.
>>
>> My view for designing external memory support was that it should behave
>> like regular DPDK memory for all intents and purposes, and be a drop-in
>> replacement, should you choose to use it. I.e. the application should not care
>> whether it uses internal or external memory - it all sits in the same malloc
>> heaps, it all uses the same socket ID mechanisms, etc. - for all intents and
>> purposes, they're one and the same.
>>
>>> Even today there is inconsistency, because if user wants to use external
>> memory it must map it (rte_vfio_dma_map) while he doesn't need to do
>> that for the DPDK-managed memory.
>>
>> Well, now i see where your confusion stems from :) You didn't know about
>> this:
>>
>> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.d
>> pdk.org%2Fdpdk%2Ftree%2Flib%2Flibrte_eal%2Fcommon%2Fmalloc_heap.c
>> %23n1169&amp;data=02%7C01%7Cshahafs%40mellanox.com%7Ca1eb06c5a
>> d794e4e725d08d64a537f23%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
>> C0%7C636778119822376211&amp;sdata=duMis6fW2CTmyRqdOAmlZl5cezCfJ
>> aeuoo61QGIBIGk%3D&amp;reserved=0
>>
>> This will trigger all mem event callbacks, including VFIO DMA mapping
>> callback.
> 
> I see, so I am indeed confused ๐Ÿ˜Š.
> On which cases the application should call the existing rte_vfio_dma_map? If the memory is already mapped and the only way to work with it is through the rte_malloc mechanism.

I have to be honest, I didn't consider this question before :D I guess 
there could be cases where using rte_malloc might not be suitable 
because it wastes some memory on malloc elements, i.e. if you want to 
use N pages as memory, you'd have to allocate N+1 pages. If memory is at 
a premium, maybe manual management of it would be better in some cases.

> 
>>
>>>
>>> I guess we can we can add a flag on the device mapping which will say
>> MAP_TO_ALL_DEVICES, to ease the application life in the presence of
>> multiple device in the host.
>>>
>>
>>
>> --
>> Thanks,
>> Anatoly
Shahaf Shuler Nov. 19, 2018, 11:20 a.m. UTC | #6
Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
> memory
> 
> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
> > Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
> >> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
> >> memory
> >>
> >> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
> >>> Hi Anatoly,
> >>>
> >>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
> >>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>> external memory
> >>>>
> >>>> Hi Shahaf,
> >>>>
> >>>> Great to see such effort! Few comments below.
> >>>>
> >>>> Note: halfway through writing my comments i realized that i am
> >>>> starting with an assumption that this API is a replacement for
> >>>> current VFIO DMA mapping API's. So, if my comments seem out of left
> >>>> field, this is probably why :)
> >>>>
> >>>> On 04-Nov-18 12:41 PM, Shahaf Shuler wrote:
> >>>>> Request for comment on the high level changes present on this patch.
> >>>>>
> >>>>> The need to use external memory (memory belong to application and
> >>>>> not part of the DPDK hugepages) is allready present.
> >>>>> Starting from storage apps which prefer to manage their own memory
> >>>>> blocks for efficient use of the storage device. Continue with GPU
> >>>>> based application which strives to achieve zero copy while
> >>>>> processing the packet payload on the GPU core. And finally by
> >>>>> vSwitch/vRouter application who just prefer to have a full control
> >>>>> over the memory in use
> >>>> (e.g. VPP).
> >>>>>
> >>>>> Recent work[1] in the DPDK enabled the use of external memory,
> >>>>> however it mostly focus on VFIO as the only way to map memory.
> >>>>> While VFIO is common, there are other vendors which use different
> >>>>> ways to map memory (e.g. Mellanox and NXP[2]).
> >>>>>
> >>>>> The work in this patch moves the DMA mapping to vendor agnostic
> >>>>> APIs located under ethdev. The choice in ethdev was because
> memory
> >>>>> map should be associated with a specific port(s). Otherwise the
> >>>>> memory is being mapped multiple times to different frameworks and
> >>>>> ends up with memory being wasted on redundant translation table in
> >>>>> the host or in the
> >>>> device.
> >>>>
> >>>> So, anything other than ethdev (e.g. cryptodev) will not be able to
> >>>> map memory for DMA?
> >>>
> >>> That's is a fair point.
> >>>
> >>>>
> >>>> I have thought about this for some length of time, and i think DMA
> >>>> mapping belongs in EAL (more specifically, somewhere at the bus
> >>>> layer), rather than at device level.
> >>>
> >>> I am not sure I agree here. For example take Intel and Mellanox devices.
> >> Both are PCI devices, so how will you distinguish which mapping API to
> use?
> >>> Also I still think the mapping should be in device granularity and
> >>> not
> >> bus/system granularity, since it is very typical for a memory to be
> >> used for DMA be a specific device.
> >>>
> >>> Maybe we can say the DMA mapping is a rte_device attribute. It is
> >>> the
> >> parent class for all the DPDK devices.
> >>> We need to see w/ vport representors (which all has the same
> rte_device).
> >> On that case I believe the rte_device.map call can register the
> >> memory to all of the representors as well (if needed).
> >>>
> >>> Placing this functionality at device level comes with more work
> >>>> to support different device types and puts a burden on device
> >>>> driver developers to implement their own mapping functions.
> >>>
> >>> The mapping function can be shared. For example we can still
> >>> maintain the
> >> vfio mapping scheme as part of eal and have all the related driver to
> >> call this function.
> >>> The only overhead will be to maintain the function pointer for the dma
> call.
> >>> With this work, instead of the eal layer to guess which type of DMA
> >> mapping the devices in the  system needs or alternatively force them
> >> all to work w/ VFIO, each driver will select its own function.
> >>> The driver is the only one which knows what type of DMA mapping its
> >> device needs.
> >>>
> >>>>
> >>>> However, i have no familiarity with how MLX/NXP devices do their
> >>>> DMA mapping, so maybe the device-centric approach would be better.
> >>>> We could provide "standard" mapping functions at the bus level
> >>>> (such as VFIO mapping functions for PCI bus), so that this could
> >>>> would not have to be reimplemented in the devices.
> >>>
> >>> Yes, like I said above, I wasn't intending to re-implement all the
> >>> mapping
> >> function again on each driver. Yet, I believe it should be per device.
> >>>
> >>>>
> >>>> Moreover, i'm not sure how this is going to work for VFIO. If this
> >>>> is to be called for each NIC that needs access to the memory, then
> >>>> we'll end up with double mappings for any NIC that uses VFIO,
> >>>> unless you want each NIC to be in a separate container.
> >>>
> >>> I am not much familiar w/ VFIO (you are the expert๐Ÿ˜Š).
> >>> What will happen if we map the same memory twice (under same
> >> container)? The translation on the IOMMU will be doubled? The map
> >> will return with error that this memory mapping already exists?
> >>
> >> The latter. You can't map the same memory twice in the same container.
> >> You can't even keep NICs in separate containers because then
> >> secondary processes won't work without serious rework.
> >>
> >> So, all VFIO-mapped things will need to share the mappings.
> >>
> >> It's not an insurmountable problem, but if we're going to share
> >> mapping status for VFIO (i.e. track which area is already mapped),
> >> then what's stopping us from doing the same for other DMA mapping
> mechanisms? I.e.
> >> instead of duplicating the mappings in each driver, provide some kind
> >> of mechanism for devices to share the DMA mapping caches. Apologies
> >> if i'm talking nonsense - i'm completely unfamiliar with how DMA
> >> mapping works for MLX/NXP devices :)
> >
> > Unfortunately it cannot be done at least w/ Mellanox.
> > In Mellanox the kernel driver is the one which maps the memory. The
> mapping returns a key which identify a memory region which was just
> registered to the device.
> > There is a complete separation between the ports, meaning one port
> mapping cannot be used by in the other port, even if the key is known.
> >
> > The separation is not only in ports, but also in processes (two primary ones,
> for secondary we have a way to share). If two process work on the same
> device, the must register the memory independently.
> 
> Ah, OK.
> 
> So, we're right back to where we started. Right now, external memory
> expects to behave the same way as all other memory - you don't need to
> perform DMA mapping for it.
> 
> That said, part of the reason *why* it was done that way was because there
> is no way to trigger VFIO DMA mapping for NXP (or was it MLX?) devices. If
> you look at initial versions of the patchset, the DMA mapping was actually
> done manually. Then, i became convinced that doing this automatically is the
> way to go, both because it erases the usability differences as far as memory
> types are concerned, and because it enables whatever services that are
> subscribing to memory events to receive notifications about external
> memory as well (i.e. consistency).
> 
> Given that it's still an experimental API, we can tinker with it all we like, so it's
> not set in stone. However, i would really like to keep the current automagic
> thing, because DMA mapping may not be the only user of memory callbacks -
> they can be used for debug purposes, or for any other things.

Memory callbacks are good to have regardless.
The question we need to answer is whether or not we are going to provide the DMA map abstraction for external memory. See more below. 


> 
> >>>>> For example, consider a host with Mellanox and Intel devices.
> >>>>> Mapping a memory without specifying to which port will end up with
> >>>>> IOMMU registration and Verbs (Mellanox DMA map) registration.
> >>>>> Another example can be two Mellanox devices on the same host. The
> >>>> memory
> >>>>> will be mapped for both, even though application will use mempool
> >>>>> per device.
> >>>>>
> >>>>> To use the suggested APIs the application will allocate a memory
> >>>>> block and will call rte_eth_dma_map. It will map it to every port
> >>>>> that needs DMA access to this memory.
> >>>>
> >>>> This bit is unclear to me. What do you mean "map it to every port
> >>>> that needs DMA access to this memory"? I don't see how this API
> >>>> solves the above problem of mapping the same memory to all devices.
> >>>> How does a device know which memory it will need? Does the user
> >>>> specifically have to call this API for each and every NIC they're using?
> >>>
> >>> Yes, the user will call this API for every port which needs to have
> >>> DMA
> >> access to this memory.
> >>> Remember we are speaking here on external memory the application
> >> allocated and wants to use for send/receive.  The device doesn't
> >> guess which memory he will need, the user is telling it to him explicitly.
> >>>
> >>>>
> >>>> For DPDK-managed memory, everything will still get mapped to every
> >>>> device automatically, correct?
> >>>
> >>> Yes, even though it is not the case today.
> >>
> >> What do you mean it is not the case? It is the case today. When
> >> external memory chunk is registered at the heap, a mem event callback
> >> is triggered just like for regular memory, and this chunk does get
> >> mapped to VFIO as well as any other subscribed entity. As i recall,
> >> NXP NICs currently are set up to ignore externally allocated memory,
> >> but for the general VFIO case, everything is mapped automatically.
> >>
> >>>
> >>> If so, then such a manual approach for
> >>>> external memory will be bad for both usability and drop-in
> >>>> replacement of internal-to-external memory, because it introduces
> >>>> inconsistency between using internal and external memory. From my
> >>>> point of view, either we do *everything* manually (i.e. register
> >>>> all memory for DMA
> >>>> explicitly) and thereby avoid this problem but keep the
> >>>> consistency, or we do *everything* automatically and deal with
> >>>> duplication of mappings somehow (say, by MLX/NXP drivers sharing
> >>>> their mappings through bus interface).
> >>>
> >>> I understand your point, however I am not sure external and internal
> >> memory *must* be consist.
> >>> The DPDK-managed memory is part of the DPDK subsystems and the
> DPDK
> >> libs are preparing it for the optimal use of the underlying devices.
> >> The external memory is different, it is a proprietary memory the
> >> application allocated and the DPDK cannot do anything in advance on it.
> >>
> >> My view for designing external memory support was that it should
> >> behave like regular DPDK memory for all intents and purposes, and be
> >> a drop-in replacement, should you choose to use it. I.e. the
> >> application should not care whether it uses internal or external
> >> memory - it all sits in the same malloc heaps, it all uses the same
> >> socket ID mechanisms, etc. - for all intents and purposes, they're one and
> the same.
> >>
> >>> Even today there is inconsistency, because if user wants to use
> >>> external
> >> memory it must map it (rte_vfio_dma_map) while he doesn't need to do
> >> that for the DPDK-managed memory.
> >>
> >> Well, now i see where your confusion stems from :) You didn't know
> >> about
> >> this:
> >>
> >>
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit
> >> .d
> pdk.org%2Fdpdk%2Ftree%2Flib%2Flibrte_eal%2Fcommon%2Fmalloc_heap.c
> >>
> %23n1169&amp;data=02%7C01%7Cshahafs%40mellanox.com%7Ca1eb06c5a
> >>
> d794e4e725d08d64a537f23%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
> >>
> C0%7C636778119822376211&amp;sdata=duMis6fW2CTmyRqdOAmlZl5cezCfJ
> >> aeuoo61QGIBIGk%3D&amp;reserved=0
> >>
> >> This will trigger all mem event callbacks, including VFIO DMA mapping
> >> callback.
> >
> > I see, so I am indeed confused ๐Ÿ˜Š.
> > On which cases the application should call the existing rte_vfio_dma_map?
> If the memory is already mapped and the only way to work with it is through
> the rte_malloc mechanism.
> 
> I have to be honest, I didn't consider this question before :D I guess there
> could be cases where using rte_malloc might not be suitable because it
> wastes some memory on malloc elements, i.e. if you want to use N pages as
> memory, you'd have to allocate N+1 pages. If memory is at a premium,
> maybe manual management of it would be better in some cases.

I had similar thoughts, more related to the usability from the user side.
When application allocated allocates external memory it just wants to use it for DMA, i.e. put it as the mbuf buf_addr or to populate it w/ a mempool. 
It is an "overhead" to create a socket for this external memory, to populate it w/ the memory, and later on to malloc from this socket (or use the socket id for the mempool creation). 
Not to mention the fact that maybe the application wants to manage this memory differently than how rte_malloc does. 

On the other hand, mapping memory to device before using it for dma is far more intuitive. 

> 
> >
> >>
> >>>
> >>> I guess we can we can add a flag on the device mapping which will
> >>> say
> >> MAP_TO_ALL_DEVICES, to ease the application life in the presence of
> >> multiple device in the host.
> >>>
> >>
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly
Stephen Hemminger Nov. 19, 2018, 5:04 p.m. UTC | #7
On Thu, 15 Nov 2018 10:59:36 +0000
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

> > Unfortunately it cannot be done at least w/ Mellanox.
> > In Mellanox the kernel driver is the one which maps the memory. The mapping returns a key which identify a memory region which was just registered to the device.
> > There is a complete separation between the ports, meaning one port mapping cannot be used by in the other port, even if the key is known.
> > 
> > The separation is not only in ports, but also in processes (two primary ones, for secondary we have a way to share). If two process work on the same device, the must register the memory independently.  
> 
> Ah, OK.
> 
> So, we're right back to where we started. Right now, external memory 
> expects to behave the same way as all other memory - you don't need to 
> perform DMA mapping for it.
> 
> That said, part of the reason *why* it was done that way was because 
> there is no way to trigger VFIO DMA mapping for NXP (or was it MLX?) 
> devices. If you look at initial versions of the patchset, the DMA 
> mapping was actually done manually. Then, i became convinced that doing 
> this automatically is the way to go, both because it erases the 
> usability differences as far as memory types are concerned, and because 
> it enables whatever services that are subscribing to memory events to 
> receive notifications about external memory as well (i.e. consistency).
> 
> Given that it's still an experimental API, we can tinker with it all we 
> like, so it's not set in stone. However, i would really like to keep the 
> current automagic thing, because DMA mapping may not be the only user of 
> memory callbacks - they can be used for debug purposes, or for any other 
> things.

If you look at the netvsc PMD, you will discover it to has the kernel
setup the receive memory area. There is some flexibility about where it
is mmap'd but the memory is coming from pinned kernel memory.
Burakov, Anatoly Nov. 19, 2018, 5:18 p.m. UTC | #8
On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>> memory
>>
>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>>>> memory
>>>>
>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
>>>>> Hi Anatoly,
>>>>>
>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>> external memory
>>>>>>
>>>>>> Hi Shahaf,
>>>>>>
>>>>>> Great to see such effort! Few comments below.
>>>>>>
>>>>>> Note: halfway through writing my comments i realized that i am
>>>>>> starting with an assumption that this API is a replacement for
>>>>>> current VFIO DMA mapping API's. So, if my comments seem out of left
>>>>>> field, this is probably why :)
>>>>>>
>>>>>> On 04-Nov-18 12:41 PM, Shahaf Shuler wrote:
>>>>>>> Request for comment on the high level changes present on this patch.
>>>>>>>
>>>>>>> The need to use external memory (memory belong to application and
>>>>>>> not part of the DPDK hugepages) is allready present.
>>>>>>> Starting from storage apps which prefer to manage their own memory
>>>>>>> blocks for efficient use of the storage device. Continue with GPU
>>>>>>> based application which strives to achieve zero copy while
>>>>>>> processing the packet payload on the GPU core. And finally by
>>>>>>> vSwitch/vRouter application who just prefer to have a full control
>>>>>>> over the memory in use
>>>>>> (e.g. VPP).
>>>>>>>
>>>>>>> Recent work[1] in the DPDK enabled the use of external memory,
>>>>>>> however it mostly focus on VFIO as the only way to map memory.
>>>>>>> While VFIO is common, there are other vendors which use different
>>>>>>> ways to map memory (e.g. Mellanox and NXP[2]).
>>>>>>>
>>>>>>> The work in this patch moves the DMA mapping to vendor agnostic
>>>>>>> APIs located under ethdev. The choice in ethdev was because
>> memory
>>>>>>> map should be associated with a specific port(s). Otherwise the
>>>>>>> memory is being mapped multiple times to different frameworks and
>>>>>>> ends up with memory being wasted on redundant translation table in
>>>>>>> the host or in the
>>>>>> device.
>>>>>>
>>>>>> So, anything other than ethdev (e.g. cryptodev) will not be able to
>>>>>> map memory for DMA?
>>>>>
>>>>> That's is a fair point.
>>>>>
>>>>>>
>>>>>> I have thought about this for some length of time, and i think DMA
>>>>>> mapping belongs in EAL (more specifically, somewhere at the bus
>>>>>> layer), rather than at device level.
>>>>>
>>>>> I am not sure I agree here. For example take Intel and Mellanox devices.
>>>> Both are PCI devices, so how will you distinguish which mapping API to
>> use?
>>>>> Also I still think the mapping should be in device granularity and
>>>>> not
>>>> bus/system granularity, since it is very typical for a memory to be
>>>> used for DMA be a specific device.
>>>>>
>>>>> Maybe we can say the DMA mapping is a rte_device attribute. It is
>>>>> the
>>>> parent class for all the DPDK devices.
>>>>> We need to see w/ vport representors (which all has the same
>> rte_device).
>>>> On that case I believe the rte_device.map call can register the
>>>> memory to all of the representors as well (if needed).
>>>>>
>>>>> Placing this functionality at device level comes with more work
>>>>>> to support different device types and puts a burden on device
>>>>>> driver developers to implement their own mapping functions.
>>>>>
>>>>> The mapping function can be shared. For example we can still
>>>>> maintain the
>>>> vfio mapping scheme as part of eal and have all the related driver to
>>>> call this function.
>>>>> The only overhead will be to maintain the function pointer for the dma
>> call.
>>>>> With this work, instead of the eal layer to guess which type of DMA
>>>> mapping the devices in the  system needs or alternatively force them
>>>> all to work w/ VFIO, each driver will select its own function.
>>>>> The driver is the only one which knows what type of DMA mapping its
>>>> device needs.
>>>>>
>>>>>>
>>>>>> However, i have no familiarity with how MLX/NXP devices do their
>>>>>> DMA mapping, so maybe the device-centric approach would be better.
>>>>>> We could provide "standard" mapping functions at the bus level
>>>>>> (such as VFIO mapping functions for PCI bus), so that this could
>>>>>> would not have to be reimplemented in the devices.
>>>>>
>>>>> Yes, like I said above, I wasn't intending to re-implement all the
>>>>> mapping
>>>> function again on each driver. Yet, I believe it should be per device.
>>>>>
>>>>>>
>>>>>> Moreover, i'm not sure how this is going to work for VFIO. If this
>>>>>> is to be called for each NIC that needs access to the memory, then
>>>>>> we'll end up with double mappings for any NIC that uses VFIO,
>>>>>> unless you want each NIC to be in a separate container.
>>>>>
>>>>> I am not much familiar w/ VFIO (you are the expert๐Ÿ˜Š).
>>>>> What will happen if we map the same memory twice (under same
>>>> container)? The translation on the IOMMU will be doubled? The map
>>>> will return with error that this memory mapping already exists?
>>>>
>>>> The latter. You can't map the same memory twice in the same container.
>>>> You can't even keep NICs in separate containers because then
>>>> secondary processes won't work without serious rework.
>>>>
>>>> So, all VFIO-mapped things will need to share the mappings.
>>>>
>>>> It's not an insurmountable problem, but if we're going to share
>>>> mapping status for VFIO (i.e. track which area is already mapped),
>>>> then what's stopping us from doing the same for other DMA mapping
>> mechanisms? I.e.
>>>> instead of duplicating the mappings in each driver, provide some kind
>>>> of mechanism for devices to share the DMA mapping caches. Apologies
>>>> if i'm talking nonsense - i'm completely unfamiliar with how DMA
>>>> mapping works for MLX/NXP devices :)
>>>
>>> Unfortunately it cannot be done at least w/ Mellanox.
>>> In Mellanox the kernel driver is the one which maps the memory. The
>> mapping returns a key which identify a memory region which was just
>> registered to the device.
>>> There is a complete separation between the ports, meaning one port
>> mapping cannot be used by in the other port, even if the key is known.
>>>
>>> The separation is not only in ports, but also in processes (two primary ones,
>> for secondary we have a way to share). If two process work on the same
>> device, the must register the memory independently.
>>
>> Ah, OK.
>>
>> So, we're right back to where we started. Right now, external memory
>> expects to behave the same way as all other memory - you don't need to
>> perform DMA mapping for it.
>>
>> That said, part of the reason *why* it was done that way was because there
>> is no way to trigger VFIO DMA mapping for NXP (or was it MLX?) devices. If
>> you look at initial versions of the patchset, the DMA mapping was actually
>> done manually. Then, i became convinced that doing this automatically is the
>> way to go, both because it erases the usability differences as far as memory
>> types are concerned, and because it enables whatever services that are
>> subscribing to memory events to receive notifications about external
>> memory as well (i.e. consistency).
>>
>> Given that it's still an experimental API, we can tinker with it all we like, so it's
>> not set in stone. However, i would really like to keep the current automagic
>> thing, because DMA mapping may not be the only user of memory callbacks -
>> they can be used for debug purposes, or for any other things.
> 
> Memory callbacks are good to have regardless.
> The question we need to answer is whether or not we are going to provide the DMA map abstraction for external memory. See more below.
> 
> 
>>
>>>>>>> For example, consider a host with Mellanox and Intel devices.
>>>>>>> Mapping a memory without specifying to which port will end up with
>>>>>>> IOMMU registration and Verbs (Mellanox DMA map) registration.
>>>>>>> Another example can be two Mellanox devices on the same host. The
>>>>>> memory
>>>>>>> will be mapped for both, even though application will use mempool
>>>>>>> per device.
>>>>>>>
>>>>>>> To use the suggested APIs the application will allocate a memory
>>>>>>> block and will call rte_eth_dma_map. It will map it to every port
>>>>>>> that needs DMA access to this memory.
>>>>>>
>>>>>> This bit is unclear to me. What do you mean "map it to every port
>>>>>> that needs DMA access to this memory"? I don't see how this API
>>>>>> solves the above problem of mapping the same memory to all devices.
>>>>>> How does a device know which memory it will need? Does the user
>>>>>> specifically have to call this API for each and every NIC they're using?
>>>>>
>>>>> Yes, the user will call this API for every port which needs to have
>>>>> DMA
>>>> access to this memory.
>>>>> Remember we are speaking here on external memory the application
>>>> allocated and wants to use for send/receive.  The device doesn't
>>>> guess which memory he will need, the user is telling it to him explicitly.
>>>>>
>>>>>>
>>>>>> For DPDK-managed memory, everything will still get mapped to every
>>>>>> device automatically, correct?
>>>>>
>>>>> Yes, even though it is not the case today.
>>>>
>>>> What do you mean it is not the case? It is the case today. When
>>>> external memory chunk is registered at the heap, a mem event callback
>>>> is triggered just like for regular memory, and this chunk does get
>>>> mapped to VFIO as well as any other subscribed entity. As i recall,
>>>> NXP NICs currently are set up to ignore externally allocated memory,
>>>> but for the general VFIO case, everything is mapped automatically.
>>>>
>>>>>
>>>>> If so, then such a manual approach for
>>>>>> external memory will be bad for both usability and drop-in
>>>>>> replacement of internal-to-external memory, because it introduces
>>>>>> inconsistency between using internal and external memory. From my
>>>>>> point of view, either we do *everything* manually (i.e. register
>>>>>> all memory for DMA
>>>>>> explicitly) and thereby avoid this problem but keep the
>>>>>> consistency, or we do *everything* automatically and deal with
>>>>>> duplication of mappings somehow (say, by MLX/NXP drivers sharing
>>>>>> their mappings through bus interface).
>>>>>
>>>>> I understand your point, however I am not sure external and internal
>>>> memory *must* be consist.
>>>>> The DPDK-managed memory is part of the DPDK subsystems and the
>> DPDK
>>>> libs are preparing it for the optimal use of the underlying devices.
>>>> The external memory is different, it is a proprietary memory the
>>>> application allocated and the DPDK cannot do anything in advance on it.
>>>>
>>>> My view for designing external memory support was that it should
>>>> behave like regular DPDK memory for all intents and purposes, and be
>>>> a drop-in replacement, should you choose to use it. I.e. the
>>>> application should not care whether it uses internal or external
>>>> memory - it all sits in the same malloc heaps, it all uses the same
>>>> socket ID mechanisms, etc. - for all intents and purposes, they're one and
>> the same.
>>>>
>>>>> Even today there is inconsistency, because if user wants to use
>>>>> external
>>>> memory it must map it (rte_vfio_dma_map) while he doesn't need to do
>>>> that for the DPDK-managed memory.
>>>>
>>>> Well, now i see where your confusion stems from :) You didn't know
>>>> about
>>>> this:
>>>>
>>>>
>> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit
>>>> .d
>> pdk.org%2Fdpdk%2Ftree%2Flib%2Flibrte_eal%2Fcommon%2Fmalloc_heap.c
>>>>
>> %23n1169&amp;data=02%7C01%7Cshahafs%40mellanox.com%7Ca1eb06c5a
>>>>
>> d794e4e725d08d64a537f23%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
>>>>
>> C0%7C636778119822376211&amp;sdata=duMis6fW2CTmyRqdOAmlZl5cezCfJ
>>>> aeuoo61QGIBIGk%3D&amp;reserved=0
>>>>
>>>> This will trigger all mem event callbacks, including VFIO DMA mapping
>>>> callback.
>>>
>>> I see, so I am indeed confused ๐Ÿ˜Š.
>>> On which cases the application should call the existing rte_vfio_dma_map?
>> If the memory is already mapped and the only way to work with it is through
>> the rte_malloc mechanism.
>>
>> I have to be honest, I didn't consider this question before :D I guess there
>> could be cases where using rte_malloc might not be suitable because it
>> wastes some memory on malloc elements, i.e. if you want to use N pages as
>> memory, you'd have to allocate N+1 pages. If memory is at a premium,
>> maybe manual management of it would be better in some cases.
> 
> I had similar thoughts, more related to the usability from the user side.
> When application allocated allocates external memory it just wants to use it for DMA, i.e. put it as the mbuf buf_addr or to populate it w/ a mempool.
> It is an "overhead" to create a socket for this external memory, to populate it w/ the memory, and later on to malloc from this socket (or use the socket id for the mempool creation).
> Not to mention the fact that maybe the application wants to manage this memory differently than how rte_malloc does.
> 
> On the other hand, mapping memory to device before using it for dma is far more intuitive.

It is far more intuitive *if* you're doing all of the memory management 
yourself or "just" using this memory for a mempool. This was already 
working before, and if you had that as your use case, there is no need 
for the external memory feature.

On the other hand, if you were to use it in a different way - for 
example, allocating hash tables or other DPDK data structures - then 
such a feature is essential. The entire point was to allow using 
external memory with semantics identical to how you use the rest of DPDK.

Also, whether it's "intuitive" depends on perspective - you say "i 
expect to allocate memory and map it for DMA myself", i say "why do i 
have to care about DMA mapping, DPDK should do this for me" :) If you 
are using your own memory management and doing everything manually - you 
get to map everything for DMA manually. If you are using DPDK facilities 
- it is intuitive that for any DPDK-managed memory, internal or 
external, same rules apply across the board - DMA mapping included.

> 
>>
>>>
>>>>
>>>>>
>>>>> I guess we can we can add a flag on the device mapping which will
>>>>> say
>>>> MAP_TO_ALL_DEVICES, to ease the application life in the presence of
>>>> multiple device in the host.
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly
>>
>>
>> --
>> Thanks,
>> Anatoly
Shahaf Shuler Nov. 20, 2018, 8:08 a.m. UTC | #9
Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
> memory
> 
> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
> > Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
> >> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
> >> memory
> >>
> >> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
> >>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
> >>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>> external memory
> >>>>
> >>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
> >>>>> Hi Anatoly,
> >>>>>
> >>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
> >>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>>>> external memory
> >>>>>>

[...]

> >>
> >> I have to be honest, I didn't consider this question before :D I
> >> guess there could be cases where using rte_malloc might not be
> >> suitable because it wastes some memory on malloc elements, i.e. if
> >> you want to use N pages as memory, you'd have to allocate N+1 pages.
> >> If memory is at a premium, maybe manual management of it would be
> better in some cases.
> >
> > I had similar thoughts, more related to the usability from the user side.
> > When application allocated allocates external memory it just wants to use it
> for DMA, i.e. put it as the mbuf buf_addr or to populate it w/ a mempool.
> > It is an "overhead" to create a socket for this external memory, to populate
> it w/ the memory, and later on to malloc from this socket (or use the socket
> id for the mempool creation).
> > Not to mention the fact that maybe the application wants to manage this
> memory differently than how rte_malloc does.
> >
> > On the other hand, mapping memory to device before using it for dma is
> far more intuitive.
> 
> It is far more intuitive *if* you're doing all of the memory management
> yourself or "just" using this memory for a mempool. This was already working
> before, and if you had that as your use case, there is no need for the
> external memory feature.

I think it was broken in some way when you did the first work on the new memory management. 
Because after this work, PMDs were adjusted to look for the memory only in the fbarrays, and such external memory doesn't exists there.
otherwise, if working, why you added the support for the external memory? Which other use case you tried to cover? 

It is fixed in some way with the new work to support external memory, but requires changes in the application (the "overhead" I referred before). 

> 
> On the other hand, if you were to use it in a different way - for example,
> allocating hash tables or other DPDK data structures - then such a feature is
> essential. 

Hash tables that the NIC needs to read/write to/from?
I didn't get the point here. 

The entire point was to allow using external memory with
> semantics identical to how you use the rest of DPDK.

I understand this design choice, and it has the benefit of consistency v.s. more API call on the application side. 

> 
> Also, whether it's "intuitive" depends on perspective - you say "i expect to
> allocate memory and map it for DMA myself", i say "why do i have to care
> about DMA mapping, DPDK should do this for me" :) 

Right ๐Ÿ˜Š. We should be driven by customers and not vendors, too bad there is no much feedback from them on the community. 
I know from VPP side they prefer not to use the rte_socket rather to manage the memory by them self, and they are perfectly fine with mapping it explicitly. This is true for other company that has vswitch solution (which I cannot mention by name). 


If you are using your
> own memory management and doing everything manually - you get to map
> everything for DMA manually. If you are using DPDK facilities
> - it is intuitive that for any DPDK-managed memory, internal or external,
> same rules apply across the board - DMA mapping included.

So if understand you point here, your claim "we need both". One for users which prefer the DPDK to do that for them and the other for the users who wants more control.
However, when keeping the API of explicit mapping, we can no longer say that external memory is behaves exactly the same as internal.

So going back to my RFC ๐Ÿ˜Š, if we decide to keep both approach, what I suggest is to change the VFIO mapping one to be more generic and per device (can be rte_device).
I think the biggest questions we need to answer are
1. are we OK with generic, vendor agnostic API?
2.  are we OK that the explicit mapping is done per device?


> 
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> I guess we can we can add a flag on the device mapping which will
> >>>>> say
> >>>> MAP_TO_ALL_DEVICES, to ease the application life in the presence of
> >>>> multiple device in the host.
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> Anatoly
> >>
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly
Burakov, Anatoly Nov. 20, 2018, 10:55 a.m. UTC | #10
On 20-Nov-18 8:08 AM, Shahaf Shuler wrote:
> Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>> memory
>>
>> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
>>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>>>> memory
>>>>
>>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
>>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>> external memory
>>>>>>
>>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
>>>>>>> Hi Anatoly,
>>>>>>>
>>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>> external memory
>>>>>>>>
> 
> [...]
> 
>>>>
>>>> I have to be honest, I didn't consider this question before :D I
>>>> guess there could be cases where using rte_malloc might not be
>>>> suitable because it wastes some memory on malloc elements, i.e. if
>>>> you want to use N pages as memory, you'd have to allocate N+1 pages.
>>>> If memory is at a premium, maybe manual management of it would be
>> better in some cases.
>>>
>>> I had similar thoughts, more related to the usability from the user side.
>>> When application allocated allocates external memory it just wants to use it
>> for DMA, i.e. put it as the mbuf buf_addr or to populate it w/ a mempool.
>>> It is an "overhead" to create a socket for this external memory, to populate
>> it w/ the memory, and later on to malloc from this socket (or use the socket
>> id for the mempool creation).
>>> Not to mention the fact that maybe the application wants to manage this
>> memory differently than how rte_malloc does.
>>>
>>> On the other hand, mapping memory to device before using it for dma is
>> far more intuitive.
>>
>> It is far more intuitive *if* you're doing all of the memory management
>> yourself or "just" using this memory for a mempool. This was already working
>> before, and if you had that as your use case, there is no need for the
>> external memory feature.
> 
> I think it was broken in some way when you did the first work on the new memory management.
> Because after this work, PMDs were adjusted to look for the memory only in the fbarrays, and such external memory doesn't exists there.
> otherwise, if working, why you added the support for the external memory? Which other use case you tried to cover?
> 
> It is fixed in some way with the new work to support external memory, but requires changes in the application (the "overhead" I referred before).

I feel we have a difference of terminology here.

What i refer to as "external memory support" is the feature that was 
added in 18.11, that has to do with creating heaps etc. - this memory is 
now (hopefully) properly supported across entire DPDK.

What you're probably referring to as "external memory support" is not 
that - it's the feature of mempools, rings etc. that allow you to use 
memory that's not registered with and not managed by DPDK. This is the 
way "external memory" was done prior to 18.11.

The above scenario (working with anonymous memory in mempools) is still 
"broken" in this sense - and i see no general way to fix it. It could 
work with mempools in that we could automatically create new heaps 
behind the scenes and populate them with new memory, but for other data 
structures (e.g. rte_ring_init() over anonymous memory), it's just not 
possible to do that.

I personally would prefer these API's to just go away and people started 
using the "official" way to manage external memory. Removing the API's 
may even be feasible for everything bar mempool, and mempool support we 
can manage more-or-less automatically in the above way (by automatically 
creating an external heap behind the scenes). But then by the time we're 
creating/adding external memory heaps inside mempool API's, you don't 
really need the external memory DMA mapping API calls because the 
mempool would do that automatically by way of using the external heaps API.

In other words, the problem here isn't so much usability as it is the 
fact that, for some stuff, a lot of internal code now relies on DPDK's 
internal page tables, and we have to either make allowances for that on 
a case-by-case basis (by the way, no one came to me with these issues!), 
or admit that using unregistered memory is a bad idea and let's not do 
that in the future.

> 
>>
>> On the other hand, if you were to use it in a different way - for example,
>> allocating hash tables or other DPDK data structures - then such a feature is
>> essential.
> 
> Hash tables that the NIC needs to read/write to/from?
> I didn't get the point here.

The point wasn't hash tables, but support for use cases we might not 
have thought of. That includes doing strange things like allocating DPDK 
data structures in non-standard memory locations.

> 
> The entire point was to allow using external memory with
>> semantics identical to how you use the rest of DPDK.
> 
> I understand this design choice, and it has the benefit of consistency v.s. more API call on the application side.
> 
>>
>> Also, whether it's "intuitive" depends on perspective - you say "i expect to
>> allocate memory and map it for DMA myself", i say "why do i have to care
>> about DMA mapping, DPDK should do this for me" :)
> 
> Right ๐Ÿ˜Š. We should be driven by customers and not vendors, too bad there is no much feedback from them on the community.

That implies current design *wasn't* driven by customers, which is not 
true. They may not be the same as ones you are aware of - but then this 
isn't really my fault :)

> I know from VPP side they prefer not to use the rte_socket rather to manage the memory by them self, and they are perfectly fine with mapping it explicitly. This is true for other company that has vswitch solution (which I cannot mention by name).

Well, it's not like i can do anything if no one from either communities 
comes here and engages in the discussion. I wish i could read people's 
minds, but alas :)

> 
> 
> If you are using your
>> own memory management and doing everything manually - you get to map
>> everything for DMA manually. If you are using DPDK facilities
>> - it is intuitive that for any DPDK-managed memory, internal or external,
>> same rules apply across the board - DMA mapping included.
> 
> So if understand you point here, your claim "we need both". One for users which prefer the DPDK to do that for them and the other for the users who wants more control.
> However, when keeping the API of explicit mapping, we can no longer say that external memory is behaves exactly the same as internal.

Well, more or less that, yes. It would be good to have full support for 
"external" memory (in the "unregistered with DPDK" sense, as outlined 
above).

> 
> So going back to my RFC ๐Ÿ˜Š, if we decide to keep both approach, what I suggest is to change the VFIO mapping one to be more generic and per device (can be rte_device).
> I think the biggest questions we need to answer are
> 1. are we OK with generic, vendor agnostic API?

Yes, generic and vendor-agnostic FTW!

> 2.  are we OK that the explicit mapping is done per device?

This question is slightly misleading, because that's not *all* there is 
to it. In general, yes, i'm OK with the idea. In practice, VFIO may get 
in the way.

VFIO works by mapping memory for a VFIO container. The container can 
have one or more devices in it, and how many devices it can have depends 
on the IOMMU setup (i.e. it may not be possible to have one device per 
container on some setups, which is why currently everything is done with 
one container). So, "per-device" means we either keep each device in 
separate container (which will not work in some cases), or we somehow 
work around the fact that mappings can be performed only once (i'm not 
sure if VFIO will return appropriate error codes that would allow us to 
distinguish failed mappings due to duplicate, or failed mappings due to 
other reasons - i'll have to research this), and keep all devices in the 
same container.

In other words, it may be possible to solve this issue for VFIO, but i 
would not be comfortable to proceed without knowing that this can be 
done in a sane way.

There's also a question of automatic vs. manual mapping. The way i see 
it, the choices are:

1) drop automatic registration for DMA altogether, for both internal and 
external memory
2) drop automatic registration for DMA for external memory only, but 
keep automatic DMA registration for internal memory
3) do automatic DMA registration for both

I don't like the second choice for API consistency reasons, so i would 
rather go either for 1) (and cause massive breakage of all existing 
applications along the way...), or 3) (the way it's working right now). 
The pragmatic choice *may* be 2), which is what i feel you are 
advocating for, but it has its own problems, mainly the mechanics of it. 
If we can get them worked out, i may just bite the bullet and ack the 
patches :D

Specifically, let's suppose we want automatic memory registration for 
DMA on internal but not external memory. How does it happen? Currently, 
it happens via mem event callbacks.

If we were to drop the callback approach for external memory, we would 
lose out on additional functionality provided by said callbacks, and i 
would really like to avoid that.

If we were to switch to manual mapping, that means EAL will depend on 
ethdev (you have suggested rte_device as an alternative - would that fix 
the issue?), and we will have to rewrite current DMA mapping code to do 
mappings explicitly, rather than relying on callbacks mechanism. So, 
instead of e.g. VFIO mapping being performed via callback, we would 
instead iterate over each rte_device and call its respective DMA mapping 
function explicitly - for internal memory, but not for external. This 
*may* work, but i need to think it through a bit more :)

This also still leaves a question of external unregistered memory - this 
is a wholly separate issue.

TL;DR i'm tentatively OK with the proposal, but we need to work certain 
things out before we go too far along the path and find out that we've 
reached a dead end.
Shahaf Shuler Nov. 22, 2018, 10:06 a.m. UTC | #11
Tuesday, November 20, 2018 12:56 PM, Burakov, Anatoly:
> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
> external memory
> 
> On 20-Nov-18 8:08 AM, Shahaf Shuler wrote:
> > Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
> >> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
> >> memory
> >>
> >> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
> >>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
> >>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>> external memory
> >>>>
> >>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
> >>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
> >>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>>>> external memory
> >>>>>>
> >>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
> >>>>>>> Hi Anatoly,
> >>>>>>>
> >>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
> >>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>>>>>> external memory
> >>>>>>>>
> >
> > [...]

[...]

> > I think it was broken in some way when you did the first work on the new
> memory management.
> > Because after this work, PMDs were adjusted to look for the memory only
> in the fbarrays, and such external memory doesn't exists there.
> > otherwise, if working, why you added the support for the external
> memory? Which other use case you tried to cover?
> >
> > It is fixed in some way with the new work to support external memory, but
> requires changes in the application (the "overhead" I referred before).
> 
> I feel we have a difference of terminology here.
> 
> What i refer to as "external memory support" is the feature that was added
> in 18.11, that has to do with creating heaps etc. - this memory is now
> (hopefully) properly supported across entire DPDK.
> 
> What you're probably referring to as "external memory support" is not that -
> it's the feature of mempools, rings etc. that allow you to use memory that's
> not registered with and not managed by DPDK. This is the way "external
> memory" was done prior to 18.11.

I thought the 18.11 feature should serve both cases. 
To use memory not managed by DPDK user should create heap and populate it w/ the memory, so that it can use it later for mempool/rings. 

> 
> The above scenario (working with anonymous memory in mempools) is still
> "broken" in this sense - and i see no general way to fix it. It could work with
> mempools in that we could automatically create new heaps behind the
> scenes and populate them with new memory, but for other data structures
> (e.g. rte_ring_init() over anonymous memory), it's just not possible to do
> that.
> 
> I personally would prefer these API's to just go away and people started
> using the "official" way to manage external memory. Removing the API's may
> even be feasible for everything bar mempool, and mempool support we can
> manage more-or-less automatically in the above way (by automatically
> creating an external heap behind the scenes). But then by the time we're
> creating/adding external memory heaps inside mempool API's, you don't
> really need the external memory DMA mapping API calls because the
> mempool would do that automatically by way of using the external heaps
> API.

Yes, one way to go is to map the memory internally inside the mempool creation. But it will not work for all cases.
For example, if application wants to work w/ mbuf with external buffer (EXT_ATTACHED_MBUF), the buffer will be attached on a different call (rte_pktmbuf_attach_extbuf), so we will need to handle that as well.
Maybe we will find more cases in the future. i would prefer to solve it for all cases.  See more below. 

> 
> In other words, the problem here isn't so much usability as it is the fact that,
> for some stuff, a lot of internal code now relies on DPDK's internal page
> tables, and we have to either make allowances for that on a case-by-case
> basis (by the way, no one came to me with these issues!), 

One option is case by case, other can be to allow the application to explicit map this memory to the different devices. 

or admit that using
> unregistered memory is a bad idea and let's not do that in the future.

I think it is un-avoidable (to use unregistered memory). The use case exists, there are applications which prefer to have their own memory management. 
For those apps, we need to have a solution. 

> 
> >
> >>
> >> On the other hand, if you were to use it in a different way - for
> >> example, allocating hash tables or other DPDK data structures - then
> >> such a feature is essential.
> >
> > Hash tables that the NIC needs to read/write to/from?
> > I didn't get the point here.
> 
> The point wasn't hash tables, but support for use cases we might not have
> thought of. That includes doing strange things like allocating DPDK data
> structures in non-standard memory locations.
> 
> >
> > The entire point was to allow using external memory with
> >> semantics identical to how you use the rest of DPDK.
> >
> > I understand this design choice, and it has the benefit of consistency v.s.
> more API call on the application side.
> >
> >>
> >> Also, whether it's "intuitive" depends on perspective - you say "i
> >> expect to allocate memory and map it for DMA myself", i say "why do i
> >> have to care about DMA mapping, DPDK should do this for me" :)
> >
> > Right ๐Ÿ˜Š. We should be driven by customers and not vendors, too bad there
> is no much feedback from them on the community.
> 
> That implies current design *wasn't* driven by customers, which is not true.
> They may not be the same as ones you are aware of - but then this isn't really
> my fault :)
> 
> > I know from VPP side they prefer not to use the rte_socket rather to
> manage the memory by them self, and they are perfectly fine with mapping
> it explicitly. This is true for other company that has vswitch solution (which I
> cannot mention by name).
> 
> Well, it's not like i can do anything if no one from either communities comes
> here and engages in the discussion. I wish i could read people's minds, but
> alas :)
> 
> >
> >
> > If you are using your
> >> own memory management and doing everything manually - you get to
> map
> >> everything for DMA manually. If you are using DPDK facilities
> >> - it is intuitive that for any DPDK-managed memory, internal or
> >> external, same rules apply across the board - DMA mapping included.
> >
> > So if understand you point here, your claim "we need both". One for users
> which prefer the DPDK to do that for them and the other for the users who
> wants more control.
> > However, when keeping the API of explicit mapping, we can no longer say
> that external memory is behaves exactly the same as internal.
> 
> Well, more or less that, yes. It would be good to have full support for
> "external" memory (in the "unregistered with DPDK" sense, as outlined
> above).
> 
> >
> > So going back to my RFC ๐Ÿ˜Š, if we decide to keep both approach, what I
> suggest is to change the VFIO mapping one to be more generic and per
> device (can be rte_device).
> > I think the biggest questions we need to answer are 1. are we OK with
> > generic, vendor agnostic API?
> 
> Yes, generic and vendor-agnostic FTW!
> 
> > 2.  are we OK that the explicit mapping is done per device?
> 
> This question is slightly misleading, because that's not *all* there is to it. In
> general, yes, i'm OK with the idea. In practice, VFIO may get in the way.
> 
> VFIO works by mapping memory for a VFIO container. The container can
> have one or more devices in it, and how many devices it can have depends
> on the IOMMU setup (i.e. it may not be possible to have one device per
> container on some setups, which is why currently everything is done with
> one container).


 So, "per-device" means we either keep each device in
> separate container (which will not work in some cases), 
or we somehow
> work around the fact that mappings can be performed only once (i'm not
> sure if VFIO will return appropriate error codes that would allow us to
> distinguish failed mappings due to duplicate, or failed mappings due to other
> reasons - i'll have to research this), and keep all devices in the same
> container.

Yes I see here 2 options:
1. the VFIO has good error reports that enables the user to understand mapping failed because it is already exists. 
2. to hold database which will say which IOVA were already mapped so the DPDK vfio mapping will know to avoid it. 

> 
> In other words, it may be possible to solve this issue for VFIO, but i would not
> be comfortable to proceed without knowing that this can be done in a sane
> way.

+1.

I think the simplest is #1 above. How can we get an answer on it? 

> 
> There's also a question of automatic vs. manual mapping. The way i see it,
> the choices are:
> 
> 1) drop automatic registration for DMA altogether, for both internal and
> external memory
> 2) drop automatic registration for DMA for external memory only, but keep
> automatic DMA registration for internal memory
> 3) do automatic DMA registration for both
> 
> I don't like the second choice for API consistency reasons, so i would rather
> go either for 1) (and cause massive breakage of all existing applications along
> the way...), or 3) (the way it's working right now).
> The pragmatic choice *may* be 2), which is what i feel you are advocating
> for, but it has its own problems, mainly the mechanics of it.
> If we can get them worked out, i may just bite the bullet and ack the patches
> ๐Ÿ˜ƒ

I don't think we need to do such extensive changes. 

I think it is ok to have 2 ways for application to work (the heaps and the explicit DMA call). After all, DPDK is a development kit and we should make application life easier for different use cases. 
If the application wants the DPDK to help, it can create the heap and everything will just happen. Otherwise application will do the memory management by itself, and use the DMA call explicitly. 

> 
> Specifically, let's suppose we want automatic memory registration for DMA
> on internal but not external memory. How does it happen? Currently, it
> happens via mem event callbacks.
> 
> If we were to drop the callback approach for external memory, we would
> lose out on additional functionality provided by said callbacks, and i would
> really like to avoid that.
> 
> If we were to switch to manual mapping, that means EAL will depend on
> ethdev (you have suggested rte_device as an alternative - would that fix the
> issue?)

No it won't.
It will help only for the cases were we have multiple devices on the same PCI address (like representors). 

, and we will have to rewrite current DMA mapping code to do
> mappings explicitly, rather than relying on callbacks mechanism. So, instead
> of e.g. VFIO mapping being performed via callback, we would instead iterate
> over each rte_device and call its respective DMA mapping function explicitly -
> for internal memory, but not for external. This
> *may* work, but i need to think it through a bit more :)
> 
> This also still leaves a question of external unregistered memory - this is a
> wholly separate issue.
> 
> TL;DR i'm tentatively OK with the proposal, but we need to work certain
> things out before we go too far along the path and find out that we've
> reached a dead end.

Fully agree, this is why the RFC is for.
I think our biggest question is whether or not VFIO will behave OK with double mapping or we need some tracking mechanism in DPDK.

> 
> --
> Thanks,
> Anatoly
Burakov, Anatoly Nov. 22, 2018, 10:41 a.m. UTC | #12
On 22-Nov-18 10:06 AM, Shahaf Shuler wrote:
> Tuesday, November 20, 2018 12:56 PM, Burakov, Anatoly:
>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
>> external memory
>>
>> On 20-Nov-18 8:08 AM, Shahaf Shuler wrote:
>>> Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>>>> memory
>>>>
>>>> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
>>>>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>> external memory
>>>>>>
>>>>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
>>>>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>> external memory
>>>>>>>>
>>>>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
>>>>>>>>> Hi Anatoly,
>>>>>>>>>
>>>>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>>>> external memory
>>>>>>>>>>
>>>
>>> [...]
> 
> [...]
> 
>>> I think it was broken in some way when you did the first work on the new
>> memory management.
>>> Because after this work, PMDs were adjusted to look for the memory only
>> in the fbarrays, and such external memory doesn't exists there.
>>> otherwise, if working, why you added the support for the external
>> memory? Which other use case you tried to cover?
>>>
>>> It is fixed in some way with the new work to support external memory, but
>> requires changes in the application (the "overhead" I referred before).
>>
>> I feel we have a difference of terminology here.
>>
>> What i refer to as "external memory support" is the feature that was added
>> in 18.11, that has to do with creating heaps etc. - this memory is now
>> (hopefully) properly supported across entire DPDK.
>>
>> What you're probably referring to as "external memory support" is not that -
>> it's the feature of mempools, rings etc. that allow you to use memory that's
>> not registered with and not managed by DPDK. This is the way "external
>> memory" was done prior to 18.11.
> 
> I thought the 18.11 feature should serve both cases.
> To use memory not managed by DPDK user should create heap and populate it w/ the memory, so that it can use it later for mempool/rings.

The feature introduced in 18.11 expects you to use DPDK allocation 
methods with added memory. It is not a supported use case to create an 
external memory area, register it with heap *and* use manual memory 
management on that area (e.g. populate mempool anonymously).

Perhaps there should be a new API, something like "register_memory", 
where we only register memory, but do not expect the user to use DPDK 
allocator with said memory. I wish i would've thought of that before, 
where have you been all this time! :D I think this should feature 
definitely go into 19.02. Are we past the v1 deadline?

> 
>>
>> The above scenario (working with anonymous memory in mempools) is still
>> "broken" in this sense - and i see no general way to fix it. It could work with
>> mempools in that we could automatically create new heaps behind the
>> scenes and populate them with new memory, but for other data structures
>> (e.g. rte_ring_init() over anonymous memory), it's just not possible to do
>> that.
>>
>> I personally would prefer these API's to just go away and people started
>> using the "official" way to manage external memory. Removing the API's may
>> even be feasible for everything bar mempool, and mempool support we can
>> manage more-or-less automatically in the above way (by automatically
>> creating an external heap behind the scenes). But then by the time we're
>> creating/adding external memory heaps inside mempool API's, you don't
>> really need the external memory DMA mapping API calls because the
>> mempool would do that automatically by way of using the external heaps
>> API.
> 
> Yes, one way to go is to map the memory internally inside the mempool creation. But it will not work for all cases.
> For example, if application wants to work w/ mbuf with external buffer (EXT_ATTACHED_MBUF), the buffer will be attached on a different call (rte_pktmbuf_attach_extbuf), so we will need to handle that as well.
> Maybe we will find more cases in the future. i would prefer to solve it for all cases.  See more below.
> 
>>
>> In other words, the problem here isn't so much usability as it is the fact that,
>> for some stuff, a lot of internal code now relies on DPDK's internal page
>> tables, and we have to either make allowances for that on a case-by-case
>> basis (by the way, no one came to me with these issues!),
> 
> One option is case by case, other can be to allow the application to explicit map this memory to the different devices.
> 
> or admit that using
>> unregistered memory is a bad idea and let's not do that in the future.
> 
> I think it is un-avoidable (to use unregistered memory). The use case exists, there are applications which prefer to have their own memory management.
> For those apps, we need to have a solution.
> 
>>
>>>
>>>>
>>>> On the other hand, if you were to use it in a different way - for
>>>> example, allocating hash tables or other DPDK data structures - then
>>>> such a feature is essential.
>>>
>>> Hash tables that the NIC needs to read/write to/from?
>>> I didn't get the point here.
>>
>> The point wasn't hash tables, but support for use cases we might not have
>> thought of. That includes doing strange things like allocating DPDK data
>> structures in non-standard memory locations.
>>
>>>
>>> The entire point was to allow using external memory with
>>>> semantics identical to how you use the rest of DPDK.
>>>
>>> I understand this design choice, and it has the benefit of consistency v.s.
>> more API call on the application side.
>>>
>>>>
>>>> Also, whether it's "intuitive" depends on perspective - you say "i
>>>> expect to allocate memory and map it for DMA myself", i say "why do i
>>>> have to care about DMA mapping, DPDK should do this for me" :)
>>>
>>> Right ๐Ÿ˜Š. We should be driven by customers and not vendors, too bad there
>> is no much feedback from them on the community.
>>
>> That implies current design *wasn't* driven by customers, which is not true.
>> They may not be the same as ones you are aware of - but then this isn't really
>> my fault :)
>>
>>> I know from VPP side they prefer not to use the rte_socket rather to
>> manage the memory by them self, and they are perfectly fine with mapping
>> it explicitly. This is true for other company that has vswitch solution (which I
>> cannot mention by name).
>>
>> Well, it's not like i can do anything if no one from either communities comes
>> here and engages in the discussion. I wish i could read people's minds, but
>> alas :)
>>
>>>
>>>
>>> If you are using your
>>>> own memory management and doing everything manually - you get to
>> map
>>>> everything for DMA manually. If you are using DPDK facilities
>>>> - it is intuitive that for any DPDK-managed memory, internal or
>>>> external, same rules apply across the board - DMA mapping included.
>>>
>>> So if understand you point here, your claim "we need both". One for users
>> which prefer the DPDK to do that for them and the other for the users who
>> wants more control.
>>> However, when keeping the API of explicit mapping, we can no longer say
>> that external memory is behaves exactly the same as internal.
>>
>> Well, more or less that, yes. It would be good to have full support for
>> "external" memory (in the "unregistered with DPDK" sense, as outlined
>> above).
>>
>>>
>>> So going back to my RFC ๐Ÿ˜Š, if we decide to keep both approach, what I
>> suggest is to change the VFIO mapping one to be more generic and per
>> device (can be rte_device).
>>> I think the biggest questions we need to answer are 1. are we OK with
>>> generic, vendor agnostic API?
>>
>> Yes, generic and vendor-agnostic FTW!
>>
>>> 2.  are we OK that the explicit mapping is done per device?
>>
>> This question is slightly misleading, because that's not *all* there is to it. In
>> general, yes, i'm OK with the idea. In practice, VFIO may get in the way.
>>
>> VFIO works by mapping memory for a VFIO container. The container can
>> have one or more devices in it, and how many devices it can have depends
>> on the IOMMU setup (i.e. it may not be possible to have one device per
>> container on some setups, which is why currently everything is done with
>> one container).
> 
> 
>   So, "per-device" means we either keep each device in
>> separate container (which will not work in some cases),
> or we somehow
>> work around the fact that mappings can be performed only once (i'm not
>> sure if VFIO will return appropriate error codes that would allow us to
>> distinguish failed mappings due to duplicate, or failed mappings due to other
>> reasons - i'll have to research this), and keep all devices in the same
>> container.
> 
> Yes I see here 2 options:
> 1. the VFIO has good error reports that enables the user to understand mapping failed because it is already exists.
> 2. to hold database which will say which IOVA were already mapped so the DPDK vfio mapping will know to avoid it.

Option 2 is difficult to support with secondary processes. Secondary 
processes also use the same container and thus should have the same view 
of what is mapped and what isn't. Given that VFIO mapping may happen 
during memory allocation, we won't be able to allocate more memory to 
hold extra mappings, so shared memory won't work here. We could use 
fbarray for this, but that would make the amount of mappings fixed.

> 
>>
>> In other words, it may be possible to solve this issue for VFIO, but i would not
>> be comfortable to proceed without knowing that this can be done in a sane
>> way.
> 
> +1.
> 
> I think the simplest is #1 above. How can we get an answer on it?

A few experiments should do it :) I'll get back to you on that.

> 
>>
>> There's also a question of automatic vs. manual mapping. The way i see it,
>> the choices are:
>>
>> 1) drop automatic registration for DMA altogether, for both internal and
>> external memory
>> 2) drop automatic registration for DMA for external memory only, but keep
>> automatic DMA registration for internal memory
>> 3) do automatic DMA registration for both
>>
>> I don't like the second choice for API consistency reasons, so i would rather
>> go either for 1) (and cause massive breakage of all existing applications along
>> the way...), or 3) (the way it's working right now).
>> The pragmatic choice *may* be 2), which is what i feel you are advocating
>> for, but it has its own problems, mainly the mechanics of it.
>> If we can get them worked out, i may just bite the bullet and ack the patches
>> ๐Ÿ˜ƒ
> 
> I don't think we need to do such extensive changes.
> 
> I think it is ok to have 2 ways for application to work (the heaps and the explicit DMA call). After all, DPDK is a development kit and we should make application life easier for different use cases.
> If the application wants the DPDK to help, it can create the heap and everything will just happen. Otherwise application will do the memory management by itself, and use the DMA call explicitly.

In other words, what you are proposing is option 3), with an addition of 
support for manual memory/DMA mapping management by way of the new API 
proposed above (register_memory() call). Am i understanding you correctly?

> 
>>
>> Specifically, let's suppose we want automatic memory registration for DMA
>> on internal but not external memory. How does it happen? Currently, it
>> happens via mem event callbacks.
>>
>> If we were to drop the callback approach for external memory, we would
>> lose out on additional functionality provided by said callbacks, and i would
>> really like to avoid that.
>>
>> If we were to switch to manual mapping, that means EAL will depend on
>> ethdev (you have suggested rte_device as an alternative - would that fix the
>> issue?)
> 
> No it won't.
> It will help only for the cases were we have multiple devices on the same PCI address (like representors).

So, should it be an rte_device call then?

> 
> , and we will have to rewrite current DMA mapping code to do
>> mappings explicitly, rather than relying on callbacks mechanism. So, instead
>> of e.g. VFIO mapping being performed via callback, we would instead iterate
>> over each rte_device and call its respective DMA mapping function explicitly -
>> for internal memory, but not for external. This
>> *may* work, but i need to think it through a bit more :)
>>
>> This also still leaves a question of external unregistered memory - this is a
>> wholly separate issue.
>>
>> TL;DR i'm tentatively OK with the proposal, but we need to work certain
>> things out before we go too far along the path and find out that we've
>> reached a dead end.
> 
> Fully agree, this is why the RFC is for.
> I think our biggest question is whether or not VFIO will behave OK with double mapping or we need some tracking mechanism in DPDK.

Yes, i will check this for VFIO and get back to you.

> 
>>
>> --
>> Thanks,
>> Anatoly
Shahaf Shuler Nov. 22, 2018, 11:31 a.m. UTC | #13
Thursday, November 22, 2018 12:42 PM, Burakov, Anatoly:
> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
> external memory
> 
> On 22-Nov-18 10:06 AM, Shahaf Shuler wrote:
> > Tuesday, November 20, 2018 12:56 PM, Burakov, Anatoly:
> >> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping
> >> for external memory
> >>
> >> On 20-Nov-18 8:08 AM, Shahaf Shuler wrote:
> >>> Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
> >>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>> external memory
> >>>>
> >>>> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
> >>>>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
> >>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>>>> external memory
> >>>>>>
> >>>>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
> >>>>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
> >>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>>>>>> external memory
> >>>>>>>>
> >>>>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
> >>>>>>>>> Hi Anatoly,
> >>>>>>>>>
> >>>>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
> >>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping
> for
> >>>>>>>>>> external memory
> >>>>>>>>>>
> >>>
> >>> [...]
> >
> > [...]
> >
> > I thought the 18.11 feature should serve both cases.
> > To use memory not managed by DPDK user should create heap and
> populate it w/ the memory, so that it can use it later for mempool/rings.
> 
> The feature introduced in 18.11 expects you to use DPDK allocation methods
> with added memory. It is not a supported use case to create an external
> memory area, register it with heap *and* use manual memory management
> on that area (e.g. populate mempool anonymously).
> 
> Perhaps there should be a new API, something like "register_memory",
> where we only register memory, but do not expect the user to use DPDK
> allocator with said memory. 

Isn't it the existing rte_vfio_dma_map?

I wish i would've thought of that before, where
> have you been all this time! :D I think this should feature definitely go into
> 19.02. Are we past the v1 deadline?
> 
> >
> >>
> >> The above scenario (working with anonymous memory in mempools) is
> >> still "broken" in this sense - and i see no general way to fix it. It
> >> could work with mempools in that we could automatically create new
> >> heaps behind the scenes and populate them with new memory, but for
> >> other data structures (e.g. rte_ring_init() over anonymous memory),
> >> it's just not possible to do that.
> >>
> >> I personally would prefer these API's to just go away and people
> >> started using the "official" way to manage external memory. Removing
> >> the API's may even be feasible for everything bar mempool, and
> >> mempool support we can manage more-or-less automatically in the
> above
> >> way (by automatically creating an external heap behind the scenes).
> >> But then by the time we're creating/adding external memory heaps
> >> inside mempool API's, you don't really need the external memory DMA
> >> mapping API calls because the mempool would do that automatically by
> >> way of using the external heaps API.
> >
> > Yes, one way to go is to map the memory internally inside the mempool
> creation. But it will not work for all cases.
> > For example, if application wants to work w/ mbuf with external buffer
> (EXT_ATTACHED_MBUF), the buffer will be attached on a different call
> (rte_pktmbuf_attach_extbuf), so we will need to handle that as well.
> > Maybe we will find more cases in the future. i would prefer to solve it for all
> cases.  See more below.
> >
> >>
> >> In other words, the problem here isn't so much usability as it is the
> >> fact that, for some stuff, a lot of internal code now relies on
> >> DPDK's internal page tables, and we have to either make allowances
> >> for that on a case-by-case basis (by the way, no one came to me with
> >> these issues!),
> >
> > One option is case by case, other can be to allow the application to explicit
> map this memory to the different devices.
> >
> > or admit that using
> >> unregistered memory is a bad idea and let's not do that in the future.
> >
> > I think it is un-avoidable (to use unregistered memory). The use case exists,
> there are applications which prefer to have their own memory management.
> > For those apps, we need to have a solution.
> >
> >>
> >>>
> >>>>
> >>>> On the other hand, if you were to use it in a different way - for
> >>>> example, allocating hash tables or other DPDK data structures -
> >>>> then such a feature is essential.
> >>>
> >>> Hash tables that the NIC needs to read/write to/from?
> >>> I didn't get the point here.
> >>
> >> The point wasn't hash tables, but support for use cases we might not
> >> have thought of. That includes doing strange things like allocating
> >> DPDK data structures in non-standard memory locations.
> >>
> >>>
> >>> The entire point was to allow using external memory with
> >>>> semantics identical to how you use the rest of DPDK.
> >>>
> >>> I understand this design choice, and it has the benefit of consistency v.s.
> >> more API call on the application side.
> >>>
> >>>>
> >>>> Also, whether it's "intuitive" depends on perspective - you say "i
> >>>> expect to allocate memory and map it for DMA myself", i say "why do
> >>>> i have to care about DMA mapping, DPDK should do this for me" :)
> >>>
> >>> Right ๐Ÿ˜Š. We should be driven by customers and not vendors, too bad
> >>> there
> >> is no much feedback from them on the community.
> >>
> >> That implies current design *wasn't* driven by customers, which is not
> true.
> >> They may not be the same as ones you are aware of - but then this
> >> isn't really my fault :)
> >>
> >>> I know from VPP side they prefer not to use the rte_socket rather to
> >> manage the memory by them self, and they are perfectly fine with
> >> mapping it explicitly. This is true for other company that has
> >> vswitch solution (which I cannot mention by name).
> >>
> >> Well, it's not like i can do anything if no one from either
> >> communities comes here and engages in the discussion. I wish i could
> >> read people's minds, but alas :)
> >>
> >>>
> >>>
> >>> If you are using your
> >>>> own memory management and doing everything manually - you get to
> >> map
> >>>> everything for DMA manually. If you are using DPDK facilities
> >>>> - it is intuitive that for any DPDK-managed memory, internal or
> >>>> external, same rules apply across the board - DMA mapping included.
> >>>
> >>> So if understand you point here, your claim "we need both". One for
> >>> users
> >> which prefer the DPDK to do that for them and the other for the users
> >> who wants more control.
> >>> However, when keeping the API of explicit mapping, we can no longer
> >>> say
> >> that external memory is behaves exactly the same as internal.
> >>
> >> Well, more or less that, yes. It would be good to have full support
> >> for "external" memory (in the "unregistered with DPDK" sense, as
> >> outlined above).
> >>
> >>>
> >>> So going back to my RFC ๐Ÿ˜Š, if we decide to keep both approach, what
> >>> I
> >> suggest is to change the VFIO mapping one to be more generic and per
> >> device (can be rte_device).
> >>> I think the biggest questions we need to answer are 1. are we OK
> >>> with generic, vendor agnostic API?
> >>
> >> Yes, generic and vendor-agnostic FTW!
> >>
> >>> 2.  are we OK that the explicit mapping is done per device?
> >>
> >> This question is slightly misleading, because that's not *all* there
> >> is to it. In general, yes, i'm OK with the idea. In practice, VFIO may get in
> the way.
> >>
> >> VFIO works by mapping memory for a VFIO container. The container can
> >> have one or more devices in it, and how many devices it can have
> >> depends on the IOMMU setup (i.e. it may not be possible to have one
> >> device per container on some setups, which is why currently
> >> everything is done with one container).
> >
> >
> >   So, "per-device" means we either keep each device in
> >> separate container (which will not work in some cases),
> > or we somehow
> >> work around the fact that mappings can be performed only once (i'm
> >> not sure if VFIO will return appropriate error codes that would allow
> >> us to distinguish failed mappings due to duplicate, or failed
> >> mappings due to other reasons - i'll have to research this), and keep
> >> all devices in the same container.
> >
> > Yes I see here 2 options:
> > 1. the VFIO has good error reports that enables the user to understand
> mapping failed because it is already exists.
> > 2. to hold database which will say which IOVA were already mapped so the
> DPDK vfio mapping will know to avoid it.
> 
> Option 2 is difficult to support with secondary processes. Secondary
> processes also use the same container and thus should have the same view
> of what is mapped and what isn't. Given that VFIO mapping may happen
> during memory allocation, we won't be able to allocate more memory to hold
> extra mappings, so shared memory won't work here. We could use fbarray
> for this, but that would make the amount of mappings fixed.
> 
> >
> >>
> >> In other words, it may be possible to solve this issue for VFIO, but
> >> i would not be comfortable to proceed without knowing that this can
> >> be done in a sane way.
> >
> > +1.
> >
> > I think the simplest is #1 above. How can we get an answer on it?
> 
> A few experiments should do it :) I'll get back to you on that.
> 
> >
> >>
> >> There's also a question of automatic vs. manual mapping. The way i
> >> see it, the choices are:
> >>
> >> 1) drop automatic registration for DMA altogether, for both internal
> >> and external memory
> >> 2) drop automatic registration for DMA for external memory only, but
> >> keep automatic DMA registration for internal memory
> >> 3) do automatic DMA registration for both
> >>
> >> I don't like the second choice for API consistency reasons, so i
> >> would rather go either for 1) (and cause massive breakage of all
> >> existing applications along the way...), or 3) (the way it's working right
> now).
> >> The pragmatic choice *may* be 2), which is what i feel you are
> >> advocating for, but it has its own problems, mainly the mechanics of it.
> >> If we can get them worked out, i may just bite the bullet and ack the
> >> patches
> >> ๐Ÿ˜ƒ
> >
> > I don't think we need to do such extensive changes.
> >
> > I think it is ok to have 2 ways for application to work (the heaps and the
> explicit DMA call). After all, DPDK is a development kit and we should make
> application life easier for different use cases.
> > If the application wants the DPDK to help, it can create the heap and
> everything will just happen. Otherwise application will do the memory
> management by itself, and use the DMA call explicitly.
> 
> In other words, what you are proposing is option 3), with an addition of
> support for manual memory/DMA mapping management by way of the new
> API proposed above (register_memory() call). Am i understanding you
> correctly?

Exact. 

> 
> >
> >>
> >> Specifically, let's suppose we want automatic memory registration for
> >> DMA on internal but not external memory. How does it happen?
> >> Currently, it happens via mem event callbacks.
> >>
> >> If we were to drop the callback approach for external memory, we
> >> would lose out on additional functionality provided by said
> >> callbacks, and i would really like to avoid that.
> >>
> >> If we were to switch to manual mapping, that means EAL will depend on
> >> ethdev (you have suggested rte_device as an alternative - would that
> >> fix the
> >> issue?)
> >
> > No it won't.
> > It will help only for the cases were we have multiple devices on the same
> PCI address (like representors).
> 
> So, should it be an rte_device call then?

I think it should be rte_device.  So that we can do the same for crypto devices (and others).
For the representor case,  PMDs can have iterator to register the memory to all of the related eth devices. 

> 
> >
> > , and we will have to rewrite current DMA mapping code to do
> >> mappings explicitly, rather than relying on callbacks mechanism. So,
> >> instead of e.g. VFIO mapping being performed via callback, we would
> >> instead iterate over each rte_device and call its respective DMA
> >> mapping function explicitly - for internal memory, but not for
> >> external. This
> >> *may* work, but i need to think it through a bit more :)
> >>
> >> This also still leaves a question of external unregistered memory -
> >> this is a wholly separate issue.
> >>
> >> TL;DR i'm tentatively OK with the proposal, but we need to work
> >> certain things out before we go too far along the path and find out
> >> that we've reached a dead end.
> >
> > Fully agree, this is why the RFC is for.
> > I think our biggest question is whether or not VFIO will behave OK with
> double mapping or we need some tracking mechanism in DPDK.
> 
> Yes, i will check this for VFIO and get back to you.

Great thanks. 

> 
> >
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly
Burakov, Anatoly Nov. 22, 2018, 11:34 a.m. UTC | #14
On 22-Nov-18 11:31 AM, Shahaf Shuler wrote:
> Thursday, November 22, 2018 12:42 PM, Burakov, Anatoly:
>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
>> external memory
>>
>> On 22-Nov-18 10:06 AM, Shahaf Shuler wrote:
>>> Tuesday, November 20, 2018 12:56 PM, Burakov, Anatoly:
>>>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping
>>>> for external memory
>>>>
>>>> On 20-Nov-18 8:08 AM, Shahaf Shuler wrote:
>>>>> Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>> external memory
>>>>>>
>>>>>> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
>>>>>>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>> external memory
>>>>>>>>
>>>>>>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
>>>>>>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
>>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>>>> external memory
>>>>>>>>>>
>>>>>>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
>>>>>>>>>>> Hi Anatoly,
>>>>>>>>>>>
>>>>>>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>>>>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping
>> for
>>>>>>>>>>>> external memory
>>>>>>>>>>>>
>>>>>
>>>>> [...]
>>>
>>> [...]
>>>
>>> I thought the 18.11 feature should serve both cases.
>>> To use memory not managed by DPDK user should create heap and
>> populate it w/ the memory, so that it can use it later for mempool/rings.
>>
>> The feature introduced in 18.11 expects you to use DPDK allocation methods
>> with added memory. It is not a supported use case to create an external
>> memory area, register it with heap *and* use manual memory management
>> on that area (e.g. populate mempool anonymously).
>>
>> Perhaps there should be a new API, something like "register_memory",
>> where we only register memory, but do not expect the user to use DPDK
>> allocator with said memory.
> 
> Isn't it the existing rte_vfio_dma_map?

It does provide DMA mapping, but it doesn't register memory in DPDK's 
fbarrays, so any fbarray-related calls (i.e. rte_virt2iova etc.) will fail.

> 
> I wish i would've thought of that before, where
>> have you been all this time! :D I think this should feature definitely go into
>> 19.02. Are we past the v1 deadline?
>>
>>>
>>>>
>>>> The above scenario (working with anonymous memory in mempools) is
>>>> still "broken" in this sense - and i see no general way to fix it. It
>>>> could work with mempools in that we could automatically create new
>>>> heaps behind the scenes and populate them with new memory, but for
>>>> other data structures (e.g. rte_ring_init() over anonymous memory),
>>>> it's just not possible to do that.
>>>>
>>>> I personally would prefer these API's to just go away and people
>>>> started using the "official" way to manage external memory. Removing
>>>> the API's may even be feasible for everything bar mempool, and
>>>> mempool support we can manage more-or-less automatically in the
>> above
>>>> way (by automatically creating an external heap behind the scenes).
>>>> But then by the time we're creating/adding external memory heaps
>>>> inside mempool API's, you don't really need the external memory DMA
>>>> mapping API calls because the mempool would do that automatically by
>>>> way of using the external heaps API.
>>>
>>> Yes, one way to go is to map the memory internally inside the mempool
>> creation. But it will not work for all cases.
>>> For example, if application wants to work w/ mbuf with external buffer
>> (EXT_ATTACHED_MBUF), the buffer will be attached on a different call
>> (rte_pktmbuf_attach_extbuf), so we will need to handle that as well.
>>> Maybe we will find more cases in the future. i would prefer to solve it for all
>> cases.  See more below.
>>>
>>>>
>>>> In other words, the problem here isn't so much usability as it is the
>>>> fact that, for some stuff, a lot of internal code now relies on
>>>> DPDK's internal page tables, and we have to either make allowances
>>>> for that on a case-by-case basis (by the way, no one came to me with
>>>> these issues!),
>>>
>>> One option is case by case, other can be to allow the application to explicit
>> map this memory to the different devices.
>>>
>>> or admit that using
>>>> unregistered memory is a bad idea and let's not do that in the future.
>>>
>>> I think it is un-avoidable (to use unregistered memory). The use case exists,
>> there are applications which prefer to have their own memory management.
>>> For those apps, we need to have a solution.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> On the other hand, if you were to use it in a different way - for
>>>>>> example, allocating hash tables or other DPDK data structures -
>>>>>> then such a feature is essential.
>>>>>
>>>>> Hash tables that the NIC needs to read/write to/from?
>>>>> I didn't get the point here.
>>>>
>>>> The point wasn't hash tables, but support for use cases we might not
>>>> have thought of. That includes doing strange things like allocating
>>>> DPDK data structures in non-standard memory locations.
>>>>
>>>>>
>>>>> The entire point was to allow using external memory with
>>>>>> semantics identical to how you use the rest of DPDK.
>>>>>
>>>>> I understand this design choice, and it has the benefit of consistency v.s.
>>>> more API call on the application side.
>>>>>
>>>>>>
>>>>>> Also, whether it's "intuitive" depends on perspective - you say "i
>>>>>> expect to allocate memory and map it for DMA myself", i say "why do
>>>>>> i have to care about DMA mapping, DPDK should do this for me" :)
>>>>>
>>>>> Right ๐Ÿ˜Š. We should be driven by customers and not vendors, too bad
>>>>> there
>>>> is no much feedback from them on the community.
>>>>
>>>> That implies current design *wasn't* driven by customers, which is not
>> true.
>>>> They may not be the same as ones you are aware of - but then this
>>>> isn't really my fault :)
>>>>
>>>>> I know from VPP side they prefer not to use the rte_socket rather to
>>>> manage the memory by them self, and they are perfectly fine with
>>>> mapping it explicitly. This is true for other company that has
>>>> vswitch solution (which I cannot mention by name).
>>>>
>>>> Well, it's not like i can do anything if no one from either
>>>> communities comes here and engages in the discussion. I wish i could
>>>> read people's minds, but alas :)
>>>>
>>>>>
>>>>>
>>>>> If you are using your
>>>>>> own memory management and doing everything manually - you get to
>>>> map
>>>>>> everything for DMA manually. If you are using DPDK facilities
>>>>>> - it is intuitive that for any DPDK-managed memory, internal or
>>>>>> external, same rules apply across the board - DMA mapping included.
>>>>>
>>>>> So if understand you point here, your claim "we need both". One for
>>>>> users
>>>> which prefer the DPDK to do that for them and the other for the users
>>>> who wants more control.
>>>>> However, when keeping the API of explicit mapping, we can no longer
>>>>> say
>>>> that external memory is behaves exactly the same as internal.
>>>>
>>>> Well, more or less that, yes. It would be good to have full support
>>>> for "external" memory (in the "unregistered with DPDK" sense, as
>>>> outlined above).
>>>>
>>>>>
>>>>> So going back to my RFC ๐Ÿ˜Š, if we decide to keep both approach, what
>>>>> I
>>>> suggest is to change the VFIO mapping one to be more generic and per
>>>> device (can be rte_device).
>>>>> I think the biggest questions we need to answer are 1. are we OK
>>>>> with generic, vendor agnostic API?
>>>>
>>>> Yes, generic and vendor-agnostic FTW!
>>>>
>>>>> 2.  are we OK that the explicit mapping is done per device?
>>>>
>>>> This question is slightly misleading, because that's not *all* there
>>>> is to it. In general, yes, i'm OK with the idea. In practice, VFIO may get in
>> the way.
>>>>
>>>> VFIO works by mapping memory for a VFIO container. The container can
>>>> have one or more devices in it, and how many devices it can have
>>>> depends on the IOMMU setup (i.e. it may not be possible to have one
>>>> device per container on some setups, which is why currently
>>>> everything is done with one container).
>>>
>>>
>>>    So, "per-device" means we either keep each device in
>>>> separate container (which will not work in some cases),
>>> or we somehow
>>>> work around the fact that mappings can be performed only once (i'm
>>>> not sure if VFIO will return appropriate error codes that would allow
>>>> us to distinguish failed mappings due to duplicate, or failed
>>>> mappings due to other reasons - i'll have to research this), and keep
>>>> all devices in the same container.
>>>
>>> Yes I see here 2 options:
>>> 1. the VFIO has good error reports that enables the user to understand
>> mapping failed because it is already exists.
>>> 2. to hold database which will say which IOVA were already mapped so the
>> DPDK vfio mapping will know to avoid it.
>>
>> Option 2 is difficult to support with secondary processes. Secondary
>> processes also use the same container and thus should have the same view
>> of what is mapped and what isn't. Given that VFIO mapping may happen
>> during memory allocation, we won't be able to allocate more memory to hold
>> extra mappings, so shared memory won't work here. We could use fbarray
>> for this, but that would make the amount of mappings fixed.
>>
>>>
>>>>
>>>> In other words, it may be possible to solve this issue for VFIO, but
>>>> i would not be comfortable to proceed without knowing that this can
>>>> be done in a sane way.
>>>
>>> +1.
>>>
>>> I think the simplest is #1 above. How can we get an answer on it?
>>
>> A few experiments should do it :) I'll get back to you on that.
>>
>>>
>>>>
>>>> There's also a question of automatic vs. manual mapping. The way i
>>>> see it, the choices are:
>>>>
>>>> 1) drop automatic registration for DMA altogether, for both internal
>>>> and external memory
>>>> 2) drop automatic registration for DMA for external memory only, but
>>>> keep automatic DMA registration for internal memory
>>>> 3) do automatic DMA registration for both
>>>>
>>>> I don't like the second choice for API consistency reasons, so i
>>>> would rather go either for 1) (and cause massive breakage of all
>>>> existing applications along the way...), or 3) (the way it's working right
>> now).
>>>> The pragmatic choice *may* be 2), which is what i feel you are
>>>> advocating for, but it has its own problems, mainly the mechanics of it.
>>>> If we can get them worked out, i may just bite the bullet and ack the
>>>> patches
>>>> ๐Ÿ˜ƒ
>>>
>>> I don't think we need to do such extensive changes.
>>>
>>> I think it is ok to have 2 ways for application to work (the heaps and the
>> explicit DMA call). After all, DPDK is a development kit and we should make
>> application life easier for different use cases.
>>> If the application wants the DPDK to help, it can create the heap and
>> everything will just happen. Otherwise application will do the memory
>> management by itself, and use the DMA call explicitly.
>>
>> In other words, what you are proposing is option 3), with an addition of
>> support for manual memory/DMA mapping management by way of the new
>> API proposed above (register_memory() call). Am i understanding you
>> correctly?
> 
> Exact.
> 
>>
>>>
>>>>
>>>> Specifically, let's suppose we want automatic memory registration for
>>>> DMA on internal but not external memory. How does it happen?
>>>> Currently, it happens via mem event callbacks.
>>>>
>>>> If we were to drop the callback approach for external memory, we
>>>> would lose out on additional functionality provided by said
>>>> callbacks, and i would really like to avoid that.
>>>>
>>>> If we were to switch to manual mapping, that means EAL will depend on
>>>> ethdev (you have suggested rte_device as an alternative - would that
>>>> fix the
>>>> issue?)
>>>
>>> No it won't.
>>> It will help only for the cases were we have multiple devices on the same
>> PCI address (like representors).
>>
>> So, should it be an rte_device call then?
> 
> I think it should be rte_device.  So that we can do the same for crypto devices (and others).
> For the representor case,  PMDs can have iterator to register the memory to all of the related eth devices.
> 
>>
>>>
>>> , and we will have to rewrite current DMA mapping code to do
>>>> mappings explicitly, rather than relying on callbacks mechanism. So,
>>>> instead of e.g. VFIO mapping being performed via callback, we would
>>>> instead iterate over each rte_device and call its respective DMA
>>>> mapping function explicitly - for internal memory, but not for
>>>> external. This
>>>> *may* work, but i need to think it through a bit more :)
>>>>
>>>> This also still leaves a question of external unregistered memory -
>>>> this is a wholly separate issue.
>>>>
>>>> TL;DR i'm tentatively OK with the proposal, but we need to work
>>>> certain things out before we go too far along the path and find out
>>>> that we've reached a dead end.
>>>
>>> Fully agree, this is why the RFC is for.
>>> I think our biggest question is whether or not VFIO will behave OK with
>> double mapping or we need some tracking mechanism in DPDK.
>>
>> Yes, i will check this for VFIO and get back to you.
> 
> Great thanks.
> 
>>
>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly
>>
>>
>> --
>> Thanks,
>> Anatoly
Shahaf Shuler Jan. 14, 2019, 6:12 a.m. UTC | #15
Hi Anatoly,

Any last inputs on this one? Would like to prepare a patch for 19.05 and it would much effect the design. 

Thursday, November 22, 2018 1:32 PM, Shahaf Shuler:
> Subject: RE: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
> external memory
> 


[...]


> > > , and we will have to rewrite current DMA mapping code to do
> > >> mappings explicitly, rather than relying on callbacks mechanism.
> > >> So, instead of e.g. VFIO mapping being performed via callback, we
> > >> would instead iterate over each rte_device and call its respective
> > >> DMA mapping function explicitly - for internal memory, but not for
> > >> external. This
> > >> *may* work, but i need to think it through a bit more :)
> > >>
> > >> This also still leaves a question of external unregistered memory -
> > >> this is a wholly separate issue.
> > >>
> > >> TL;DR i'm tentatively OK with the proposal, but we need to work
> > >> certain things out before we go too far along the path and find out
> > >> that we've reached a dead end.
> > >
> > > Fully agree, this is why the RFC is for.
> > > I think our biggest question is whether or not VFIO will behave OK
> > > with
> > double mapping or we need some tracking mechanism in DPDK.
> >
> > Yes, i will check this for VFIO and get back to you.
> 
> Great thanks.

?

> 
> >
> > >
> > >>
> > >> --
> > >> Thanks,
> > >> Anatoly
> >
> >
> > --
> > Thanks,
> > Anatoly
Burakov, Anatoly Jan. 15, 2019, 12:07 p.m. UTC | #16
On 14-Jan-19 6:12 AM, Shahaf Shuler wrote:
> Hi Anatoly,
> 
> Any last inputs on this one? Would like to prepare a patch for 19.05 and it would much effect the design.
> 

Hi Shahaf,

Unfortunately, i've been preoccupied with other stuff lately. I will 
have another look at this issue over the week and come back to you. 
Please feel free to ping me personally if i don't :)
Shahaf Shuler Jan. 16, 2019, 11:04 a.m. UTC | #17
Tuesday, January 15, 2019 2:07 PM, Burakov, Anatoly:
> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
> external memory
> 
> On 14-Jan-19 6:12 AM, Shahaf Shuler wrote:
> > Hi Anatoly,
> >
> > Any last inputs on this one? Would like to prepare a patch for 19.05 and it
> would much effect the design.
> >
> 
> Hi Shahaf,
> 
> Unfortunately, i've been preoccupied with other stuff lately. I will have
> another look at this issue over the week and come back to you.
> Please feel free to ping me personally if i don't :)

Sure, will ๐Ÿ˜Š.

Just to mention, I traversed through the vfio kernel code and it looks like it has what it takes. 
for type1 mapping[1] I would expect to get EEXIST.
for spapr mapping[2] I would expect to get EBUSY.

Will wait to your findings. 

[1]
https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.c#L1108

[2]
https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_spapr_tce.c#L176

> 
> --
> Thanks,
> Anatoly
diff mbox series

Patch

diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 508cbc46fd..44667e8446 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -885,20 +885,6 @@  int rte_vfio_clear_group(__rte_unused int vfio_group_fd)
 }
 
 int
-rte_vfio_dma_map(uint64_t __rte_unused vaddr, __rte_unused uint64_t iova,
-		  __rte_unused uint64_t len)
-{
-	return -1;
-}
-
-int
-rte_vfio_dma_unmap(uint64_t __rte_unused vaddr, uint64_t __rte_unused iova,
-		    __rte_unused uint64_t len)
-{
-	return -1;
-}
-
-int
 rte_vfio_get_group_num(__rte_unused const char *sysfs_base,
 		       __rte_unused const char *dev_addr,
 		       __rte_unused int *iommu_group_num)
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index cae96fab90..d8d966dd33 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -188,50 +188,6 @@  int
 rte_vfio_clear_group(int vfio_group_fd);
 
 /**
- * Map memory region for use with VFIO.
- *
- * @note Require at least one device to be attached at the time of
- *       mapping. DMA maps done via this API will only apply to default
- *       container and will not apply to any of the containers created
- *       via rte_vfio_container_create().
- *
- * @param vaddr
- *   Starting virtual address of memory to be mapped.
- *
- * @param iova
- *   Starting IOVA address of memory to be mapped.
- *
- * @param len
- *   Length of memory segment being mapped.
- *
- * @return
- *   0 if success.
- *   -1 on error.
- */
-int
-rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len);
-
-
-/**
- * Unmap memory region from VFIO.
- *
- * @param vaddr
- *   Starting virtual address of memory to be unmapped.
- *
- * @param iova
- *   Starting IOVA address of memory to be unmapped.
- *
- * @param len
- *   Length of memory segment being unmapped.
- *
- * @return
- *   0 if success.
- *   -1 on error.
- */
-
-int
-rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len);
-/**
  * Parse IOMMU group number for a device
  *
  * This function is only relevant to linux and will return
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 0516b1597b..839dce243f 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -1641,28 +1641,6 @@  container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova,
 }
 
 int
-rte_vfio_dma_map(uint64_t vaddr, uint64_t iova, uint64_t len)
-{
-	if (len == 0) {
-		rte_errno = EINVAL;
-		return -1;
-	}
-
-	return container_dma_map(default_vfio_cfg, vaddr, iova, len);
-}
-
-int
-rte_vfio_dma_unmap(uint64_t vaddr, uint64_t iova, uint64_t len)
-{
-	if (len == 0) {
-		rte_errno = EINVAL;
-		return -1;
-	}
-
-	return container_dma_unmap(default_vfio_cfg, vaddr, iova, len);
-}
-
-int
 rte_vfio_noiommu_is_enabled(void)
 {
 	int fd;
diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c
index 36e5389b3a..acc9d819a1 100644
--- a/lib/librte_ethdev/rte_ethdev.c
+++ b/lib/librte_ethdev/rte_ethdev.c
@@ -4447,6 +4447,44 @@  rte_eth_devargs_parse(const char *dargs, struct rte_eth_devargs *eth_da)
 	return result;
 }
 
+int __rte_experimental
+rte_eth_dma_map(uint16_t port_id, uint64_t vaddr, uint64_t iova,
+		uint64_t len)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dma_map, -ENOTSUP);
+
+	return (*dev->dev_ops->dma_map)(dev, vaddr, iova, len);
+}
+
+int __rte_experimental
+rte_eth_dma_unmap(uint16_t port_id, uint64_t vaddr, uint64_t iova,
+		  uint64_t len)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+
+	dev = &rte_eth_devices[port_id];
+	if (len == 0) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dma_unmap, -ENOTSUP);
+
+	return (*dev->dev_ops->dma_unmap)(dev, vaddr, iova, len);
+}
+
 RTE_INIT(ethdev_init_log)
 {
 	rte_eth_dev_logtype = rte_log_register("lib.ethdev");
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 769a694309..ad695ab70b 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -4368,6 +4368,46 @@  rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
 	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
 }
 
+/**
+ * Map memory region for use of an ethdev.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param vaddr
+ *   Starting virtual address of memory to be mapped.
+ * @param iova
+ *   Starting IOVA address of memory to be mapped.
+ * @param len
+ *   Length of memory segment being mapped.
+ *
+ * @return
+ *   0 if success.
+ *   -1 on error.
+ */
+int
+rte_eth_dma_map(uint16_t port_id, uint64_t vaddr, uint64_t iova, uint64_t len);
+
+
+/**
+ * Unmap memory region from an ethdev.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param vaddr
+ *   Starting virtual address of memory to be unmapped.
+ * @param iova
+ *   Starting IOVA address of memory to be unmapped.
+ * @param len
+ *   Length of memory segment being unmapped.
+ *
+ * @return
+ *   0 if success.
+ *   -1 on error.
+ */
+int
+rte_eth_dma_unmap(uint16_t port_id, uint64_t vaddr, uint64_t iova,
+		  uint64_t len);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_ethdev/rte_ethdev_core.h b/lib/librte_ethdev/rte_ethdev_core.h
index 8f03f83f62..41568ca57f 100644
--- a/lib/librte_ethdev/rte_ethdev_core.h
+++ b/lib/librte_ethdev/rte_ethdev_core.h
@@ -377,6 +377,14 @@  typedef int (*eth_pool_ops_supported_t)(struct rte_eth_dev *dev,
 						const char *pool);
 /**< @internal Test if a port supports specific mempool ops */
 
+typedef int (*eth_dma_map_t)(struct rte_eth_dev *dev, uint64_t vaddr,
+			     uint64_t iova, uint64_t len);
+/**< @internal Register memory for DMA usage by the device */
+
+typedef int (*eth_dma_unmap_t)(struct rte_eth_dev *dev, uint64_t vaddr,
+			       uint64_t iova, uint64_t len);
+/**< @internal Un-register memory from DMA usage by the device */
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -509,6 +517,12 @@  struct eth_dev_ops {
 
 	eth_pool_ops_supported_t pool_ops_supported;
 	/**< Test if a port supports specific mempool ops */
+
+	eth_dma_map_t dma_map;
+	/**< Register memory for DMA usage by the device */
+
+	eth_dma_map_t dma_unmap;
+	/**< Un-register memory from DMA usage by the device */
 };
 
 /**