[3/5] common/dpaax: add library for PA VA translation table

Message ID 20180925125423.7505-4-shreyansh.jain@nxp.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series Add a PA-VA Translation table for DPAAx |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail Compilation issues

Commit Message

Shreyansh Jain Sept. 25, 2018, 12:54 p.m. UTC
  A common library, valid for dpaaX drivers, which is used to maintain
a local copy of PA->VA translations.

In case of physical addressing mode (one of the option for FSLMC, and
only option for DPAA bus), the addresses of descriptors Rx'd are
physical. These need to be converted into equivalent VA for rte_mbuf
and other similar calls.

Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
library is an attempt to reduce the overall cost associated with
this translation.

A small table is maintained, containing continuous entries
representing a continguous physical range. Each of these entries
stores the equivalent VA, which is fed during mempool creation, or
memory allocation/deallocation callbacks.

Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
---
 config/common_base                            |   5 +
 config/common_linuxapp                        |   5 +
 drivers/common/Makefile                       |   4 +
 drivers/common/dpaax/Makefile                 |  31 ++
 drivers/common/dpaax/dpaax_iova_table.c       | 509 ++++++++++++++++++
 drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
 drivers/common/dpaax/dpaax_logs.h             |  39 ++
 drivers/common/dpaax/meson.build              |  12 +
 .../common/dpaax/rte_common_dpaax_version.map |  12 +
 drivers/common/meson.build                    |   2 +-
 10 files changed, 722 insertions(+), 1 deletion(-)
 create mode 100644 drivers/common/dpaax/Makefile
 create mode 100644 drivers/common/dpaax/dpaax_iova_table.c
 create mode 100644 drivers/common/dpaax/dpaax_iova_table.h
 create mode 100644 drivers/common/dpaax/dpaax_logs.h
 create mode 100644 drivers/common/dpaax/meson.build
 create mode 100644 drivers/common/dpaax/rte_common_dpaax_version.map
  

Comments

Burakov, Anatoly Sept. 25, 2018, 1:28 p.m. UTC | #1
On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
> A common library, valid for dpaaX drivers, which is used to maintain
> a local copy of PA->VA translations.
> 
> In case of physical addressing mode (one of the option for FSLMC, and
> only option for DPAA bus), the addresses of descriptors Rx'd are
> physical. These need to be converted into equivalent VA for rte_mbuf
> and other similar calls.
> 
> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
> library is an attempt to reduce the overall cost associated with
> this translation.
> 
> A small table is maintained, containing continuous entries
> representing a continguous physical range. Each of these entries
> stores the equivalent VA, which is fed during mempool creation, or
> memory allocation/deallocation callbacks.
> 
> Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
> ---

Hi Shreyansh,

So, basically, you're reimplementing old DPDK's memory view (storing 
VA's in a PA-centric way). Makes sense :)

I should caution you that right now, external memory allocator 
implementation does *not* trigger any callbacks for newly added memory. 
So, anything coming from external memory will not be reflected in your 
table, unless it happens to be already there before 
dpaax_iova_table_populate() gets called. This patchset makes a good 
argument for why perhaps it should trigger callbacks. Thoughts?

Also, a couple of nitpicks below.

>   config/common_base                            |   5 +
>   config/common_linuxapp                        |   5 +
>   drivers/common/Makefile                       |   4 +
>   drivers/common/dpaax/Makefile                 |  31 ++
>   drivers/common/dpaax/dpaax_iova_table.c       | 509 ++++++++++++++++++
>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>   drivers/common/dpaax/meson.build              |  12 +

<snip>

> +	DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for vaddr:(%p),"
> +		    " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
> +		    vaddr, paddr, length);
> +	return 0;
> +}
> +
> +int
> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)

len is not unused.

> +{
> +	int found = 0;
> +	unsigned int i;
> +	size_t e_offset;
> +	struct dpaax_iovat_element *entry;
> +	phys_addr_t align_paddr;
> +
> +	align_paddr = paddr & DPAAX_MEM_SPLIT_MASK;
> +
> +	/* Check if paddr is available in table */

<snip>

> +static inline void *
> +dpaax_iova_table_get_va(phys_addr_t paddr) {
> +	unsigned int i = 0, index;
> +	void *vaddr = 0;
> +	phys_addr_t paddr_align = paddr & DPAAX_MEM_SPLIT_MASK;
> +	size_t offset = paddr & DPAAX_MEM_SPLIT_MASK_OFF;
> +	struct dpaax_iovat_element *entry;
> +
> +	entry = dpaax_iova_table_p->entries;
> +
> +	do {
> +		if (unlikely(i > dpaax_iova_table_p->count))
> +			break;
> +
> +		if (paddr_align < entry[i].start) {
> +			/* Incorrect paddr; Not in memory range */
> +			return 0;

NULL?
  
Shreyansh Jain Sept. 25, 2018, 1:39 p.m. UTC | #2
Hello Anatoly,

On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>> A common library, valid for dpaaX drivers, which is used to maintain
>> a local copy of PA->VA translations.
>>
>> In case of physical addressing mode (one of the option for FSLMC, and
>> only option for DPAA bus), the addresses of descriptors Rx'd are
>> physical. These need to be converted into equivalent VA for rte_mbuf
>> and other similar calls.
>>
>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>> library is an attempt to reduce the overall cost associated with
>> this translation.
>>
>> A small table is maintained, containing continuous entries
>> representing a continguous physical range. Each of these entries
>> stores the equivalent VA, which is fed during mempool creation, or
>> memory allocation/deallocation callbacks.
>>
>> Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
>> ---
> 
> Hi Shreyansh,
> 
> So, basically, you're reimplementing old DPDK's memory view (storing 
> VA's in a PA-centric way). Makes sense :)

Yes, and frankly, I couldn't come up with any other way.

> 
> I should caution you that right now, external memory allocator 
> implementation does *not* trigger any callbacks for newly added memory. 
> So, anything coming from external memory will not be reflected in your 
> table, unless it happens to be already there before 
> dpaax_iova_table_populate() gets called. This patchset makes a good 
> argument for why perhaps it should trigger callbacks. Thoughts?

Oh. Then I must be finishing reading through your patches for external 
memory sooner. I didn't realize this.

> 
> Also, a couple of nitpicks below.
> 
>>   config/common_base                            |   5 +
>>   config/common_linuxapp                        |   5 +
>>   drivers/common/Makefile                       |   4 +
>>   drivers/common/dpaax/Makefile                 |  31 ++
>>   drivers/common/dpaax/dpaax_iova_table.c       | 509 ++++++++++++++++++
>>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>>   drivers/common/dpaax/meson.build              |  12 +
> 
> <snip>
> 
>> +    DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for vaddr:(%p),"
>> +            " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
>> +            vaddr, paddr, length);
>> +    return 0;
>> +}
>> +
>> +int
>> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
> 
> len is not unused.

I will fix this.
Actually, this function itself is useless - more for symmetry reason.
Callers would be either simply updating the table, or ignoring it 
completely. But, yes, this is indeed wrong that I set that unused.

> 
>> +{
>> +    int found = 0;
>> +    unsigned int i;
>> +    size_t e_offset;
>> +    struct dpaax_iovat_element *entry;
>> +    phys_addr_t align_paddr;
>> +
>> +    align_paddr = paddr & DPAAX_MEM_SPLIT_MASK;
>> +
>> +    /* Check if paddr is available in table */
> 
> <snip>
> 
>> +static inline void *
>> +dpaax_iova_table_get_va(phys_addr_t paddr) {
>> +    unsigned int i = 0, index;
>> +    void *vaddr = 0;
>> +    phys_addr_t paddr_align = paddr & DPAAX_MEM_SPLIT_MASK;
>> +    size_t offset = paddr & DPAAX_MEM_SPLIT_MASK_OFF;
>> +    struct dpaax_iovat_element *entry;
>> +
>> +    entry = dpaax_iova_table_p->entries;
>> +
>> +    do {
>> +        if (unlikely(i > dpaax_iova_table_p->count))
>> +            break;
>> +
>> +        if (paddr_align < entry[i].start) {
>> +            /* Incorrect paddr; Not in memory range */
>> +            return 0;
> 
> NULL?

Yes, NULL. I will fix that as well.

>
  
Burakov, Anatoly Sept. 25, 2018, 1:51 p.m. UTC | #3
On 25-Sep-18 2:39 PM, Shreyansh Jain wrote:
> Hello Anatoly,
> 
> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>> A common library, valid for dpaaX drivers, which is used to maintain
>>> a local copy of PA->VA translations.
>>>
>>> In case of physical addressing mode (one of the option for FSLMC, and
>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>> and other similar calls.
>>>
>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>> library is an attempt to reduce the overall cost associated with
>>> this translation.
>>>
>>> A small table is maintained, containing continuous entries
>>> representing a continguous physical range. Each of these entries
>>> stores the equivalent VA, which is fed during mempool creation, or
>>> memory allocation/deallocation callbacks.
>>>
>>> Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
>>> ---
>>
>> Hi Shreyansh,
>>
>> So, basically, you're reimplementing old DPDK's memory view (storing 
>> VA's in a PA-centric way). Makes sense :)
> 
> Yes, and frankly, I couldn't come up with any other way.
> 
>>
>> I should caution you that right now, external memory allocator 
>> implementation does *not* trigger any callbacks for newly added 
>> memory. So, anything coming from external memory will not be reflected 
>> in your table, unless it happens to be already there before 
>> dpaax_iova_table_populate() gets called. This patchset makes a good 
>> argument for why perhaps it should trigger callbacks. Thoughts?
> 
> Oh. Then I must be finishing reading through your patches for external 
> memory sooner. I didn't realize this.

To be clear, the current implementation of external memory allocators is 
not necessarily final - it's not too late to add callbacks to enable 
your use case better, if that's required (and it should be pretty easy 
to implement as well).
  
Shreyansh Jain Sept. 25, 2018, 2 p.m. UTC | #4
On Tuesday 25 September 2018 07:21 PM, Burakov, Anatoly wrote:
> On 25-Sep-18 2:39 PM, Shreyansh Jain wrote:
>> Hello Anatoly,
>>
>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>> A common library, valid for dpaaX drivers, which is used to maintain
>>>> a local copy of PA->VA translations.
>>>>
>>>> In case of physical addressing mode (one of the option for FSLMC, and
>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>>> and other similar calls.
>>>>
>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>> library is an attempt to reduce the overall cost associated with
>>>> this translation.
>>>>
>>>> A small table is maintained, containing continuous entries
>>>> representing a continguous physical range. Each of these entries
>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>> memory allocation/deallocation callbacks.
>>>>
>>>> Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
>>>> ---
>>>
>>> Hi Shreyansh,
>>>
>>> So, basically, you're reimplementing old DPDK's memory view (storing 
>>> VA's in a PA-centric way). Makes sense :)
>>
>> Yes, and frankly, I couldn't come up with any other way.
>>
>>>
>>> I should caution you that right now, external memory allocator 
>>> implementation does *not* trigger any callbacks for newly added 
>>> memory. So, anything coming from external memory will not be 
>>> reflected in your table, unless it happens to be already there before 
>>> dpaax_iova_table_populate() gets called. This patchset makes a good 
>>> argument for why perhaps it should trigger callbacks. Thoughts?
>>
>> Oh. Then I must be finishing reading through your patches for external 
>> memory sooner. I didn't realize this.
> 
> To be clear, the current implementation of external memory allocators is 
> not necessarily final - it's not too late to add callbacks to enable 
> your use case better, if that's required (and it should be pretty easy 
> to implement as well).
> 

Is there any reason why external may not be raising call back right now? 
I might have missed any previous conversation on this. Or may be, it is 
just lack of need.

As for whether it is required - I do see a need. It is definitely 
possible that after rte_eal_init has been completed (and underlying 
probe), applications allocate memory. In which case, even existing 
memevent callbacks (like the one in fslmc_bus, which VFIO/DMA maps the 
area) would have issues. From the external memory patchset, I do see 
that it is assumed DMA mapping is caller's responsibility.

Having such callback would help drives reduce that throwback of 
responsibility.

(Speaking of external memory patches, I also realize that my memevent 
callback in this patch series need to handle msl->external).
  
Burakov, Anatoly Sept. 25, 2018, 2:08 p.m. UTC | #5
On 25-Sep-18 3:00 PM, Shreyansh Jain wrote:
> On Tuesday 25 September 2018 07:21 PM, Burakov, Anatoly wrote:
>> On 25-Sep-18 2:39 PM, Shreyansh Jain wrote:
>>> Hello Anatoly,
>>>
>>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>>> A common library, valid for dpaaX drivers, which is used to maintain
>>>>> a local copy of PA->VA translations.
>>>>>
>>>>> In case of physical addressing mode (one of the option for FSLMC, and
>>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>>>> and other similar calls.
>>>>>
>>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>>> library is an attempt to reduce the overall cost associated with
>>>>> this translation.
>>>>>
>>>>> A small table is maintained, containing continuous entries
>>>>> representing a continguous physical range. Each of these entries
>>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>>> memory allocation/deallocation callbacks.
>>>>>
>>>>> Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
>>>>> ---
>>>>
>>>> Hi Shreyansh,
>>>>
>>>> So, basically, you're reimplementing old DPDK's memory view (storing 
>>>> VA's in a PA-centric way). Makes sense :)
>>>
>>> Yes, and frankly, I couldn't come up with any other way.
>>>
>>>>
>>>> I should caution you that right now, external memory allocator 
>>>> implementation does *not* trigger any callbacks for newly added 
>>>> memory. So, anything coming from external memory will not be 
>>>> reflected in your table, unless it happens to be already there 
>>>> before dpaax_iova_table_populate() gets called. This patchset makes 
>>>> a good argument for why perhaps it should trigger callbacks. Thoughts?
>>>
>>> Oh. Then I must be finishing reading through your patches for 
>>> external memory sooner. I didn't realize this.
>>
>> To be clear, the current implementation of external memory allocators 
>> is not necessarily final - it's not too late to add callbacks to 
>> enable your use case better, if that's required (and it should be 
>> pretty easy to implement as well).
>>
> 
> Is there any reason why external may not be raising call back right now? 
> I might have missed any previous conversation on this. Or may be, it is 
> just lack of need.

Well, pretty much - it didn't occur to me that it may be needed. I 
specifically went out of my way to note that it is the responsibility of 
the user to perform any DMA mappings, but i missed the fact that there 
may be other users interested to know that a user has just added a new 
external memory segment.

> 
> As for whether it is required - I do see a need. It is definitely 
> possible that after rte_eal_init has been completed (and underlying 
> probe), applications allocate memory. In which case, even existing 
> memevent callbacks (like the one in fslmc_bus, which VFIO/DMA maps the 
> area) would have issues. From the external memory patchset, I do see 
> that it is assumed DMA mapping is caller's responsibility.
> 
> Having such callback would help drives reduce that throwback of 
> responsibility.

I do not want to assume that user necessarily wants to map external 
memory for DMA unless explicitly asked to do so. At the same time, i can 
see that some uses may not have anything to do with DMA mapping and may 
instead be cases like yours, where you just need the address. In our 
case, we can just ignore external memory in VFIO and virtio callbacks, 
but still allow other callbacks to handle external memory as they see fit.

> 
> (Speaking of external memory patches, I also realize that my memevent 
> callback in this patch series need to handle msl->external).

Yes, we have to be careful on merge.

>
  
Burakov, Anatoly Sept. 26, 2018, 10:16 a.m. UTC | #6
On 25-Sep-18 3:08 PM, Burakov, Anatoly wrote:
> On 25-Sep-18 3:00 PM, Shreyansh Jain wrote:
>> On Tuesday 25 September 2018 07:21 PM, Burakov, Anatoly wrote:
>>> On 25-Sep-18 2:39 PM, Shreyansh Jain wrote:
>>>> Hello Anatoly,
>>>>
>>>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>>>> A common library, valid for dpaaX drivers, which is used to maintain
>>>>>> a local copy of PA->VA translations.
>>>>>>
>>>>>> In case of physical addressing mode (one of the option for FSLMC, and
>>>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>>>>> and other similar calls.
>>>>>>
>>>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>>>> library is an attempt to reduce the overall cost associated with
>>>>>> this translation.
>>>>>>
>>>>>> A small table is maintained, containing continuous entries
>>>>>> representing a continguous physical range. Each of these entries
>>>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>>>> memory allocation/deallocation callbacks.
>>>>>>
>>>>>> Signed-off-by: Shreyansh Jain <shreyansh.jain@nxp.com>
>>>>>> ---
>>>>>
>>>>> Hi Shreyansh,
>>>>>
>>>>> So, basically, you're reimplementing old DPDK's memory view 
>>>>> (storing VA's in a PA-centric way). Makes sense :)
>>>>
>>>> Yes, and frankly, I couldn't come up with any other way.
>>>>
>>>>>
>>>>> I should caution you that right now, external memory allocator 
>>>>> implementation does *not* trigger any callbacks for newly added 
>>>>> memory. So, anything coming from external memory will not be 
>>>>> reflected in your table, unless it happens to be already there 
>>>>> before dpaax_iova_table_populate() gets called. This patchset makes 
>>>>> a good argument for why perhaps it should trigger callbacks. Thoughts?
>>>>
>>>> Oh. Then I must be finishing reading through your patches for 
>>>> external memory sooner. I didn't realize this.
>>>
>>> To be clear, the current implementation of external memory allocators 
>>> is not necessarily final - it's not too late to add callbacks to 
>>> enable your use case better, if that's required (and it should be 
>>> pretty easy to implement as well).
>>>
>>
>> Is there any reason why external may not be raising call back right 
>> now? I might have missed any previous conversation on this. Or may be, 
>> it is just lack of need.
> 
> Well, pretty much - it didn't occur to me that it may be needed. I 
> specifically went out of my way to note that it is the responsibility of 
> the user to perform any DMA mappings, but i missed the fact that there 
> may be other users interested to know that a user has just added a new 
> external memory segment.
> 
>>
>> As for whether it is required - I do see a need. It is definitely 
>> possible that after rte_eal_init has been completed (and underlying 
>> probe), applications allocate memory. In which case, even existing 
>> memevent callbacks (like the one in fslmc_bus, which VFIO/DMA maps the 
>> area) would have issues. From the external memory patchset, I do see 
>> that it is assumed DMA mapping is caller's responsibility.
>>
>> Having such callback would help drives reduce that throwback of 
>> responsibility.
> 
> I do not want to assume that user necessarily wants to map external 
> memory for DMA unless explicitly asked to do so. At the same time, i can 
> see that some uses may not have anything to do with DMA mapping and may 
> instead be cases like yours, where you just need the address. In our 
> case, we can just ignore external memory in VFIO and virtio callbacks, 
> but still allow other callbacks to handle external memory as they see fit.
> 
>>
>> (Speaking of external memory patches, I also realize that my memevent 
>> callback in this patch series need to handle msl->external).
> 
> Yes, we have to be careful on merge.
> 

Hi Shreyansh,

I'm currently implementing callback support for external memory. I think 
the decision to leave DMA mapping to the user may have been a bad one.

For starters, one of the buses (bus/fslmc) has its own VFIO 
infrastructure independent of EAL's VFIO, so if we don't map it there - 
we can't map it at all because there's no generic manual way to do a DMA 
map with bus/fslmc driver after the fact.

Also, if we leave VFIO mapping to the user, we end up with an 
inconsistency where we provide memory callbacks which can potentially 
create DMA mappings (such as what some MLX drivers do), but VFIO DMA 
mappings will be ignored because "reasons".

The only place where we really *don't* want to see external memory is 
virtio, because there is no segment fd support for external memory yet. 
Every other place, i think it's a good idea to not skip external memory.

So, starting from v5, DMA mapping will be performed automatically for 
external memory when IOVA addresses are available.
  
Shreyansh Jain Oct. 9, 2018, 10:45 a.m. UTC | #7
On Tuesday 25 September 2018 07:09 PM, Shreyansh Jain wrote:
> Hello Anatoly,
> 
> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>> A common library, valid for dpaaX drivers, which is used to maintain
>>> a local copy of PA->VA translations.
>>>
>>> In case of physical addressing mode (one of the option for FSLMC, and
>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>> and other similar calls.
>>>
>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>> library is an attempt to reduce the overall cost associated with
>>> this translation.
>>>
>>> A small table is maintained, containing continuous entries
>>> representing a continguous physical range. Each of these entries
>>> stores the equivalent VA, which is fed during mempool creation, or
>>> memory allocation/deallocation callbacks.
>>>

[...]

> 
>>
>> Also, a couple of nitpicks below.
>>
>>>   cosnfig/common_base                            |   5 +
>>>   config/common_linuxapp                        |   5 +
>>>   drivers/common/Makefile                       |   4 +
>>>   drivers/common/dpaax/Makefile                 |  31 ++
>>>   drivers/common/dpaax/dpaax_iova_table.c       | 509 ++++++++++++++++++
>>>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>>>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>>>   drivers/common/dpaax/meson.build              |  12 +
>>
>> <snip>
>>
>>> +    DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for vaddr:(%p),"
>>> +            " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
>>> +            vaddr, paddr, length);
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
>>
>> len is not unused.
> 
> I will fix this.
> Actually, this function itself is useless - more for symmetry reason.
> Callers would be either simply updating the table, or ignoring it 
> completely. But, yes, this is indeed wrong that I set that unused.
> 

Actually, I was wrong in my first reply. In case of 
dpaax_iova_table_del(), len is indeed redundant. This is because the 
mapping is for a complete page (min of 2MB size), even if the request is 
for lesser length. So, removal of a single entry (of fixed size) would 
be done.

In fact, while on this, I think deleting a PA->VA entry itself is 
incorrect (not just useless). A single entry (~2MB equivalent) can 
represent multiple users (working on a rte_malloc'd area, for example). 
So, effectively, its always an update - not an add or del.

I will send updated series with this change.

[...]
  
Burakov, Anatoly Oct. 11, 2018, 9:03 a.m. UTC | #8
On 09-Oct-18 11:45 AM, Shreyansh Jain wrote:
> On Tuesday 25 September 2018 07:09 PM, Shreyansh Jain wrote:
>> Hello Anatoly,
>>
>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>> A common library, valid for dpaaX drivers, which is used to maintain
>>>> a local copy of PA->VA translations.
>>>>
>>>> In case of physical addressing mode (one of the option for FSLMC, and
>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>>> and other similar calls.
>>>>
>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>> library is an attempt to reduce the overall cost associated with
>>>> this translation.
>>>>
>>>> A small table is maintained, containing continuous entries
>>>> representing a continguous physical range. Each of these entries
>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>> memory allocation/deallocation callbacks.
>>>>
> 
> [...]
> 
>>
>>>
>>> Also, a couple of nitpicks below.
>>>
>>>>   cosnfig/common_base                            |   5 +
>>>>   config/common_linuxapp                        |   5 +
>>>>   drivers/common/Makefile                       |   4 +
>>>>   drivers/common/dpaax/Makefile                 |  31 ++
>>>>   drivers/common/dpaax/dpaax_iova_table.c       | 509 
>>>> ++++++++++++++++++
>>>>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>>>>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>>>>   drivers/common/dpaax/meson.build              |  12 +
>>>
>>> <snip>
>>>
>>>> +    DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for 
>>>> vaddr:(%p),"
>>>> +            " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
>>>> +            vaddr, paddr, length);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int
>>>> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
>>>
>>> len is not unused.
>>
>> I will fix this.
>> Actually, this function itself is useless - more for symmetry reason.
>> Callers would be either simply updating the table, or ignoring it 
>> completely. But, yes, this is indeed wrong that I set that unused.
>>
> 
> Actually, I was wrong in my first reply. In case of 
> dpaax_iova_table_del(), len is indeed redundant. This is because the 
> mapping is for a complete page (min of 2MB size), even if the request is 
> for lesser length. So, removal of a single entry (of fixed size) would 
> be done.
> 
> In fact, while on this, I think deleting a PA->VA entry itself is 
> incorrect (not just useless). A single entry (~2MB equivalent) can 
> represent multiple users (working on a rte_malloc'd area, for example). 
> So, effectively, its always an update - not an add or del.

I'm not sure what you mean here. If you got a mem event about memory 
area being freed, it's guaranteed to *not* have any users - neither 
malloc, nor any other memory. And len is always page-aligned.

> 
> I will send updated series with this change.
> 
> [...]
> 
>
  
Shreyansh Jain Oct. 11, 2018, 10:02 a.m. UTC | #9
On Thursday 11 October 2018 02:33 PM, Burakov, Anatoly wrote:
> On 09-Oct-18 11:45 AM, Shreyansh Jain wrote:
>> On Tuesday 25 September 2018 07:09 PM, Shreyansh Jain wrote:
>>> Hello Anatoly,
>>>
>>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>>> A common library, valid for dpaaX drivers, which is used to maintain
>>>>> a local copy of PA->VA translations.
>>>>>
>>>>> In case of physical addressing mode (one of the option for FSLMC, and
>>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>>>> and other similar calls.
>>>>>
>>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>>> library is an attempt to reduce the overall cost associated with
>>>>> this translation.
>>>>>
>>>>> A small table is maintained, containing continuous entries
>>>>> representing a continguous physical range. Each of these entries
>>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>>> memory allocation/deallocation callbacks.
>>>>>
>>
>> [...]
>>
>>>
>>>>
>>>> Also, a couple of nitpicks below.
>>>>
>>>>>   cosnfig/common_base                            |   5 +
>>>>>   config/common_linuxapp                        |   5 +
>>>>>   drivers/common/Makefile                       |   4 +
>>>>>   drivers/common/dpaax/Makefile                 |  31 ++
>>>>>   drivers/common/dpaax/dpaax_iova_table.c       | 509 
>>>>> ++++++++++++++++++
>>>>>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>>>>>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>>>>>   drivers/common/dpaax/meson.build              |  12 +
>>>>
>>>> <snip>
>>>>
>>>>> +    DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for 
>>>>> vaddr:(%p),"
>>>>> +            " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
>>>>> +            vaddr, paddr, length);
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +int
>>>>> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
>>>>
>>>> len is not unused.
>>>
>>> I will fix this.
>>> Actually, this function itself is useless - more for symmetry reason.
>>> Callers would be either simply updating the table, or ignoring it 
>>> completely. But, yes, this is indeed wrong that I set that unused.
>>>
>>
>> Actually, I was wrong in my first reply. In case of 
>> dpaax_iova_table_del(), len is indeed redundant. This is because the 
>> mapping is for a complete page (min of 2MB size), even if the request 
>> is for lesser length. So, removal of a single entry (of fixed size) 
>> would be done.
>>
>> In fact, while on this, I think deleting a PA->VA entry itself is 
>> incorrect (not just useless). A single entry (~2MB equivalent) can 
>> represent multiple users (working on a rte_malloc'd area, for 
>> example). So, effectively, its always an update - not an add or del.
> 
> I'm not sure what you mean here. If you got a mem event about memory 
> area being freed, it's guaranteed to *not* have any users - neither 
> malloc, nor any other memory. And len is always page-aligned.

ok. Maybe I am getting this wrong, but consider this:

1) hugepage size=2MB
2) a = malloc(1M)
   this will pin an entry in table for a block starting at VA=(a) and 
PA=(a'). Each entry is of 2MB length - that means, even if someone were 
to access a+1048577 for an equivalent PA, they would get it (though, 
that is a incorrect access).
3) b = malloc(1M)
   this *might* lead to a case where same 2MB page is used and 
VA=(b==(a+1MB)). Being hugepage backed, PA=(b=PA(a)+1M).
= After b, the PA-VA table has a single entry of 2MB, representing two 
mallocs. It can be used for translation for any thread requesting PAs of 
a or b.
4) Free(a)
  - this would attempt to remove one 2MB entry from PA-VA table. But, 
'b' is already valid. Access to get_pa(VA(b)) should return me the PA(b).
  - 'len' is not even used as the entry in PA-VA table is of a fixed size.

In the above, (3) is an assumption I am making based on my understanding 
how mem allocator is working. Is that wrong?

Basically, this is a restriction of this table - it has a min chunk of 
2MB - even for 1G hugepages - and hence, it is not possible to honor 
deletes. I know this is convoluted logic - but, this keeps it simple and 
use-able without much performance impact.

[...]
  
Shreyansh Jain Oct. 11, 2018, 10:07 a.m. UTC | #10
On Thursday 11 October 2018 03:32 PM, Shreyansh Jain wrote:
> On Thursday 11 October 2018 02:33 PM, Burakov, Anatoly wrote:
>> On 09-Oct-18 11:45 AM, Shreyansh Jain wrote:
>>> On Tuesday 25 September 2018 07:09 PM, Shreyansh Jain wrote:
>>>> Hello Anatoly,
>>>>
>>>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>>>> A common library, valid for dpaaX drivers, which is used to maintain
>>>>>> a local copy of PA->VA translations.
>>>>>>
>>>>>> In case of physical addressing mode (one of the option for FSLMC, and
>>>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>>>>> and other similar calls.
>>>>>>
>>>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>>>> library is an attempt to reduce the overall cost associated with
>>>>>> this translation.
>>>>>>
>>>>>> A small table is maintained, containing continuous entries
>>>>>> representing a continguous physical range. Each of these entries
>>>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>>>> memory allocation/deallocation callbacks.
>>>>>>
>>>
>>> [...]
>>>
>>>>
>>>>>
>>>>> Also, a couple of nitpicks below.
>>>>>
>>>>>>   cosnfig/common_base                            |   5 +
>>>>>>   config/common_linuxapp                        |   5 +
>>>>>>   drivers/common/Makefile                       |   4 +
>>>>>>   drivers/common/dpaax/Makefile                 |  31 ++
>>>>>>   drivers/common/dpaax/dpaax_iova_table.c       | 509 
>>>>>> ++++++++++++++++++
>>>>>>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>>>>>>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>>>>>>   drivers/common/dpaax/meson.build              |  12 +
>>>>>
>>>>> <snip>
>>>>>
>>>>>> +    DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for 
>>>>>> vaddr:(%p),"
>>>>>> +            " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
>>>>>> +            vaddr, paddr, length);
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +int
>>>>>> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
>>>>>
>>>>> len is not unused.
>>>>
>>>> I will fix this.
>>>> Actually, this function itself is useless - more for symmetry reason.
>>>> Callers would be either simply updating the table, or ignoring it 
>>>> completely. But, yes, this is indeed wrong that I set that unused.
>>>>
>>>
>>> Actually, I was wrong in my first reply. In case of 
>>> dpaax_iova_table_del(), len is indeed redundant. This is because the 
>>> mapping is for a complete page (min of 2MB size), even if the request 
>>> is for lesser length. So, removal of a single entry (of fixed size) 
>>> would be done.
>>>
>>> In fact, while on this, I think deleting a PA->VA entry itself is 
>>> incorrect (not just useless). A single entry (~2MB equivalent) can 
>>> represent multiple users (working on a rte_malloc'd area, for 
>>> example). So, effectively, its always an update - not an add or del.
>>
>> I'm not sure what you mean here. If you got a mem event about memory 
>> area being freed, it's guaranteed to *not* have any users - neither 
>> malloc, nor any other memory. And len is always page-aligned.
> 
> ok. Maybe I am getting this wrong, but consider this:
> 
> 1) hugepage size=2MB
> 2) a = malloc(1M)
>    this will pin an entry in table for a block starting at VA=(a) and 
> PA=(a'). Each entry is of 2MB length - that means, even if someone were 
> to access a+1048577 for an equivalent PA, they would get it (though, 
> that is a incorrect access).
> 3) b = malloc(1M)
>    this *might* lead to a case where same 2MB page is used and 
> VA=(b==(a+1MB)). Being hugepage backed, PA=(b=PA(a)+1M).
> = After b, the PA-VA table has a single entry of 2MB, representing two 
> mallocs. It can be used for translation for any thread requesting PAs of 
> a or b.
> 4) Free(a)
>   - this would attempt to remove one 2MB entry from PA-VA table. But, 
> 'b' is already valid. Access to get_pa(VA(b)) should return me the PA(b).
>   - 'len' is not even used as the entry in PA-VA table is of a fixed size.

Just to add to this:
- if talking about the mem_event callback, it definitely won't be a case 
where same page is still being served under another rte_malloc
- But, calls can come to delete from users of PA-VA table based on their 
own rte_free().

And, your comment makes me think - I should probably del entry from the 
table only when mem_event callback is received.

> 
> In the above, (3) is an assumption I am making based on my understanding 
> how mem allocator is working. Is that wrong?
> 
> Basically, this is a restriction of this table - it has a min chunk of 
> 2MB - even for 1G hugepages - and hence, it is not possible to honor 
> deletes. I know this is convoluted logic - but, this keeps it simple and 
> use-able without much performance impact.
> 
> [...]
>
  
Burakov, Anatoly Oct. 11, 2018, 10:09 a.m. UTC | #11
On 11-Oct-18 11:02 AM, Shreyansh Jain wrote:
> On Thursday 11 October 2018 02:33 PM, Burakov, Anatoly wrote:
>> On 09-Oct-18 11:45 AM, Shreyansh Jain wrote:
>>> On Tuesday 25 September 2018 07:09 PM, Shreyansh Jain wrote:
>>>> Hello Anatoly,
>>>>
>>>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>>>> A common library, valid for dpaaX drivers, which is used to maintain
>>>>>> a local copy of PA->VA translations.
>>>>>>
>>>>>> In case of physical addressing mode (one of the option for FSLMC, and
>>>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>>>>> and other similar calls.
>>>>>>
>>>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>>>> library is an attempt to reduce the overall cost associated with
>>>>>> this translation.
>>>>>>
>>>>>> A small table is maintained, containing continuous entries
>>>>>> representing a continguous physical range. Each of these entries
>>>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>>>> memory allocation/deallocation callbacks.
>>>>>>
>>>
>>> [...]
>>>
>>>>
>>>>>
>>>>> Also, a couple of nitpicks below.
>>>>>
>>>>>>   cosnfig/common_base                            |   5 +
>>>>>>   config/common_linuxapp                        |   5 +
>>>>>>   drivers/common/Makefile                       |   4 +
>>>>>>   drivers/common/dpaax/Makefile                 |  31 ++
>>>>>>   drivers/common/dpaax/dpaax_iova_table.c       | 509 
>>>>>> ++++++++++++++++++
>>>>>>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>>>>>>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>>>>>>   drivers/common/dpaax/meson.build              |  12 +
>>>>>
>>>>> <snip>
>>>>>
>>>>>> +    DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for 
>>>>>> vaddr:(%p),"
>>>>>> +            " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
>>>>>> +            vaddr, paddr, length);
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +int
>>>>>> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
>>>>>
>>>>> len is not unused.
>>>>
>>>> I will fix this.
>>>> Actually, this function itself is useless - more for symmetry reason.
>>>> Callers would be either simply updating the table, or ignoring it 
>>>> completely. But, yes, this is indeed wrong that I set that unused.
>>>>
>>>
>>> Actually, I was wrong in my first reply. In case of 
>>> dpaax_iova_table_del(), len is indeed redundant. This is because the 
>>> mapping is for a complete page (min of 2MB size), even if the request 
>>> is for lesser length. So, removal of a single entry (of fixed size) 
>>> would be done.
>>>
>>> In fact, while on this, I think deleting a PA->VA entry itself is 
>>> incorrect (not just useless). A single entry (~2MB equivalent) can 
>>> represent multiple users (working on a rte_malloc'd area, for 
>>> example). So, effectively, its always an update - not an add or del.
>>
>> I'm not sure what you mean here. If you got a mem event about memory 
>> area being freed, it's guaranteed to *not* have any users - neither 
>> malloc, nor any other memory. And len is always page-aligned.
> 
> ok. Maybe I am getting this wrong, but consider this:
> 
> 1) hugepage size=2MB
> 2) a = malloc(1M)
>    this will pin an entry in table for a block starting at VA=(a) and 
> PA=(a'). Each entry is of 2MB length - that means, even if someone were 
> to access a+1048577 for an equivalent PA, they would get it (though, 
> that is a incorrect access).
> 3) b = malloc(1M)
>    this *might* lead to a case where same 2MB page is used and 
> VA=(b==(a+1MB)). Being hugepage backed, PA=(b=PA(a)+1M).
> = After b, the PA-VA table has a single entry of 2MB, representing two 
> mallocs. It can be used for translation for any thread requesting PAs of 
> a or b.
> 4) Free(a)
>   - this would attempt to remove one 2MB entry from PA-VA table. But, 

No, it wouldn't. Not unless 'b' was also freed. Malloc will not free the 
area unless there are no users of this area (as in, it checks if a free 
malloc element encompasses an entire page). If you do two allocations, 
1MB in size each, the hugepage will contain two buys malloc elements, 
one meg each. You free one, but the other one isn't free, so no action 
is taken.

> 'b' is already valid. Access to get_pa(VA(b)) should return me the PA(b).
>   - 'len' is not even used as the entry in PA-VA table is of a fixed size.
> 
> In the above, (3) is an assumption I am making based on my understanding 
> how mem allocator is working. Is that wrong?
> 
> Basically, this is a restriction of this table - it has a min chunk of 
> 2MB - even for 1G hugepages - and hence, it is not possible to honor 
> deletes. I know this is convoluted logic - but, this keeps it simple and 
> use-able without much performance impact.

The granularity of mem events are page size. You will not ever get a mem 
event on anything other than a full page, and you are guaranteed that it 
is free and not used by anything within DPDK.

> 
> [...]
> 
>
  
Burakov, Anatoly Oct. 11, 2018, 10:13 a.m. UTC | #12
On 11-Oct-18 11:07 AM, Shreyansh Jain wrote:
> On Thursday 11 October 2018 03:32 PM, Shreyansh Jain wrote:
>> On Thursday 11 October 2018 02:33 PM, Burakov, Anatoly wrote:
>>> On 09-Oct-18 11:45 AM, Shreyansh Jain wrote:
>>>> On Tuesday 25 September 2018 07:09 PM, Shreyansh Jain wrote:
>>>>> Hello Anatoly,
>>>>>
>>>>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>>>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>>>>> A common library, valid for dpaaX drivers, which is used to maintain
>>>>>>> a local copy of PA->VA translations.
>>>>>>>
>>>>>>> In case of physical addressing mode (one of the option for FSLMC, 
>>>>>>> and
>>>>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>>>>> physical. These need to be converted into equivalent VA for rte_mbuf
>>>>>>> and other similar calls.
>>>>>>>
>>>>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>>>>> library is an attempt to reduce the overall cost associated with
>>>>>>> this translation.
>>>>>>>
>>>>>>> A small table is maintained, containing continuous entries
>>>>>>> representing a continguous physical range. Each of these entries
>>>>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>>>>> memory allocation/deallocation callbacks.
>>>>>>>
>>>>
>>>> [...]
>>>>
>>>>>
>>>>>>
>>>>>> Also, a couple of nitpicks below.
>>>>>>
>>>>>>>   cosnfig/common_base                            |   5 +
>>>>>>>   config/common_linuxapp                        |   5 +
>>>>>>>   drivers/common/Makefile                       |   4 +
>>>>>>>   drivers/common/dpaax/Makefile                 |  31 ++
>>>>>>>   drivers/common/dpaax/dpaax_iova_table.c       | 509 
>>>>>>> ++++++++++++++++++
>>>>>>>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>>>>>>>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>>>>>>>   drivers/common/dpaax/meson.build              |  12 +
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>> +    DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for 
>>>>>>> vaddr:(%p),"
>>>>>>> +            " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
>>>>>>> +            vaddr, paddr, length);
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +int
>>>>>>> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
>>>>>>
>>>>>> len is not unused.
>>>>>
>>>>> I will fix this.
>>>>> Actually, this function itself is useless - more for symmetry reason.
>>>>> Callers would be either simply updating the table, or ignoring it 
>>>>> completely. But, yes, this is indeed wrong that I set that unused.
>>>>>
>>>>
>>>> Actually, I was wrong in my first reply. In case of 
>>>> dpaax_iova_table_del(), len is indeed redundant. This is because the 
>>>> mapping is for a complete page (min of 2MB size), even if the 
>>>> request is for lesser length. So, removal of a single entry (of 
>>>> fixed size) would be done.
>>>>
>>>> In fact, while on this, I think deleting a PA->VA entry itself is 
>>>> incorrect (not just useless). A single entry (~2MB equivalent) can 
>>>> represent multiple users (working on a rte_malloc'd area, for 
>>>> example). So, effectively, its always an update - not an add or del.
>>>
>>> I'm not sure what you mean here. If you got a mem event about memory 
>>> area being freed, it's guaranteed to *not* have any users - neither 
>>> malloc, nor any other memory. And len is always page-aligned.
>>
>> ok. Maybe I am getting this wrong, but consider this:
>>
>> 1) hugepage size=2MB
>> 2) a = malloc(1M)
>>    this will pin an entry in table for a block starting at VA=(a) and 
>> PA=(a'). Each entry is of 2MB length - that means, even if someone 
>> were to access a+1048577 for an equivalent PA, they would get it 
>> (though, that is a incorrect access).
>> 3) b = malloc(1M)
>>    this *might* lead to a case where same 2MB page is used and 
>> VA=(b==(a+1MB)). Being hugepage backed, PA=(b=PA(a)+1M).
>> = After b, the PA-VA table has a single entry of 2MB, representing two 
>> mallocs. It can be used for translation for any thread requesting PAs 
>> of a or b.
>> 4) Free(a)
>>   - this would attempt to remove one 2MB entry from PA-VA table. But, 
>> 'b' is already valid. Access to get_pa(VA(b)) should return me the PA(b).
>>   - 'len' is not even used as the entry in PA-VA table is of a fixed 
>> size.
> 
> Just to add to this:
> - if talking about the mem_event callback, it definitely won't be a case 
> where same page is still being served under another rte_malloc
> - But, calls can come to delete from users of PA-VA table based on their 
> own rte_free().
> 
> And, your comment makes me think - I should probably del entry from the 
> table only when mem_event callback is received.

Mem events are not triggered on rte_free(), they're triggered on page 
deallocation. A call to rte_free/rte_memzone_free/rte_mempool_free etc. 
*might* trigger a page deallocation, but *only* if the memory area being 
freed encompasses an entire page. If you rte_malloc() 64 bytes and then 
rte_free() those 64 bytes, you won't get a mem event *unless* these were 
the only 64 bytes allocated on a particular page, and the entire page is 
no longer used by anything else.

> 
>>
>> In the above, (3) is an assumption I am making based on my 
>> understanding how mem allocator is working. Is that wrong?
>>
>> Basically, this is a restriction of this table - it has a min chunk of 
>> 2MB - even for 1G hugepages - and hence, it is not possible to honor 
>> deletes. I know this is convoluted logic - but, this keeps it simple 
>> and use-able without much performance impact.
>>
>> [...]
>>
> 
>
  
Shreyansh Jain Oct. 11, 2018, 10:39 a.m. UTC | #13
On Thursday 11 October 2018 03:43 PM, Burakov, Anatoly wrote:
> On 11-Oct-18 11:07 AM, Shreyansh Jain wrote:
>> On Thursday 11 October 2018 03:32 PM, Shreyansh Jain wrote:
>>> On Thursday 11 October 2018 02:33 PM, Burakov, Anatoly wrote:
>>>> On 09-Oct-18 11:45 AM, Shreyansh Jain wrote:
>>>>> On Tuesday 25 September 2018 07:09 PM, Shreyansh Jain wrote:
>>>>>> Hello Anatoly,
>>>>>>
>>>>>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>>>>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>>>>>> A common library, valid for dpaaX drivers, which is used to 
>>>>>>>> maintain
>>>>>>>> a local copy of PA->VA translations.
>>>>>>>>
>>>>>>>> In case of physical addressing mode (one of the option for 
>>>>>>>> FSLMC, and
>>>>>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>>>>>> physical. These need to be converted into equivalent VA for 
>>>>>>>> rte_mbuf
>>>>>>>> and other similar calls.
>>>>>>>>
>>>>>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>>>>>> library is an attempt to reduce the overall cost associated with
>>>>>>>> this translation.
>>>>>>>>
>>>>>>>> A small table is maintained, containing continuous entries
>>>>>>>> representing a continguous physical range. Each of these entries
>>>>>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>>>>>> memory allocation/deallocation callbacks.
>>>>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Also, a couple of nitpicks below.
>>>>>>>
>>>>>>>>   cosnfig/common_base                            |   5 +
>>>>>>>>   config/common_linuxapp                        |   5 +
>>>>>>>>   drivers/common/Makefile                       |   4 +
>>>>>>>>   drivers/common/dpaax/Makefile                 |  31 ++
>>>>>>>>   drivers/common/dpaax/dpaax_iova_table.c       | 509 
>>>>>>>> ++++++++++++++++++
>>>>>>>>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>>>>>>>>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>>>>>>>>   drivers/common/dpaax/meson.build              |  12 +
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>> +    DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for 
>>>>>>>> vaddr:(%p),"
>>>>>>>> +            " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
>>>>>>>> +            vaddr, paddr, length);
>>>>>>>> +    return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +int
>>>>>>>> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
>>>>>>>
>>>>>>> len is not unused.
>>>>>>
>>>>>> I will fix this.
>>>>>> Actually, this function itself is useless - more for symmetry reason.
>>>>>> Callers would be either simply updating the table, or ignoring it 
>>>>>> completely. But, yes, this is indeed wrong that I set that unused.
>>>>>>
>>>>>
>>>>> Actually, I was wrong in my first reply. In case of 
>>>>> dpaax_iova_table_del(), len is indeed redundant. This is because 
>>>>> the mapping is for a complete page (min of 2MB size), even if the 
>>>>> request is for lesser length. So, removal of a single entry (of 
>>>>> fixed size) would be done.
>>>>>
>>>>> In fact, while on this, I think deleting a PA->VA entry itself is 
>>>>> incorrect (not just useless). A single entry (~2MB equivalent) can 
>>>>> represent multiple users (working on a rte_malloc'd area, for 
>>>>> example). So, effectively, its always an update - not an add or del.
>>>>
>>>> I'm not sure what you mean here. If you got a mem event about memory 
>>>> area being freed, it's guaranteed to *not* have any users - neither 
>>>> malloc, nor any other memory. And len is always page-aligned.
>>>
>>> ok. Maybe I am getting this wrong, but consider this:
>>>
>>> 1) hugepage size=2MB
>>> 2) a = malloc(1M)
>>>    this will pin an entry in table for a block starting at VA=(a) and 
>>> PA=(a'). Each entry is of 2MB length - that means, even if someone 
>>> were to access a+1048577 for an equivalent PA, they would get it 
>>> (though, that is a incorrect access).
>>> 3) b = malloc(1M)
>>>    this *might* lead to a case where same 2MB page is used and 
>>> VA=(b==(a+1MB)). Being hugepage backed, PA=(b=PA(a)+1M).
>>> = After b, the PA-VA table has a single entry of 2MB, representing 
>>> two mallocs. It can be used for translation for any thread requesting 
>>> PAs of a or b.
>>> 4) Free(a)
>>>   - this would attempt to remove one 2MB entry from PA-VA table. But, 
>>> 'b' is already valid. Access to get_pa(VA(b)) should return me the 
>>> PA(b).
>>>   - 'len' is not even used as the entry in PA-VA table is of a fixed 
>>> size.
>>
>> Just to add to this:
>> - if talking about the mem_event callback, it definitely won't be a 
>> case where same page is still being served under another rte_malloc
>> - But, calls can come to delete from users of PA-VA table based on 
>> their own rte_free().
>>
>> And, your comment makes me think - I should probably del entry from 
>> the table only when mem_event callback is received.
> 
> Mem events are not triggered on rte_free(), they're triggered on page 
> deallocation. A call to rte_free/rte_memzone_free/rte_mempool_free etc. 
> *might* trigger a page deallocation, but *only* if the memory area being 
> freed encompasses an entire page. If you rte_malloc() 64 bytes and then 
> rte_free() those 64 bytes, you won't get a mem event *unless* these were 
> the only 64 bytes allocated on a particular page, and the entire page is 
> no longer used by anything else.

My understanding is same.
But, it seems my explanation wasn't well written:

For a rte_free(), I am not expecting that mem_event is raised - but, the 
caller of rte_free() (the eth or crypto drivers, or applications) may 
call the PA-VA table del function to remove the entries.

This voluntary delete of table entry from the drivers or applications 
using PA-VA calling del of PA-VA table - is not correct.

The path from mem_event callback clearing the PA-VA table entry is 
correct (which I removed in v2) - that time the page (len) would 
definitely not be used by anyone and can be removed from PA-VA table. 
And, yes, I agree that mem-event may not be on an rte_free().

> 
>>
>>>
>>> In the above, (3) is an assumption I am making based on my 
>>> understanding how mem allocator is working. Is that wrong?
>>>
>>> Basically, this is a restriction of this table - it has a min chunk 
>>> of 2MB - even for 1G hugepages - and hence, it is not possible to 
>>> honor deletes. I know this is convoluted logic - but, this keeps it 
>>> simple and use-able without much performance impact.
>>>
>>> [...]
>>>
>>
>>
> 
>
  
Burakov, Anatoly Oct. 11, 2018, 10:46 a.m. UTC | #14
On 11-Oct-18 11:39 AM, Shreyansh Jain wrote:
> On Thursday 11 October 2018 03:43 PM, Burakov, Anatoly wrote:
>> On 11-Oct-18 11:07 AM, Shreyansh Jain wrote:
>>> On Thursday 11 October 2018 03:32 PM, Shreyansh Jain wrote:
>>>> On Thursday 11 October 2018 02:33 PM, Burakov, Anatoly wrote:
>>>>> On 09-Oct-18 11:45 AM, Shreyansh Jain wrote:
>>>>>> On Tuesday 25 September 2018 07:09 PM, Shreyansh Jain wrote:
>>>>>>> Hello Anatoly,
>>>>>>>
>>>>>>> On Tuesday 25 September 2018 06:58 PM, Burakov, Anatoly wrote:
>>>>>>>> On 25-Sep-18 1:54 PM, Shreyansh Jain wrote:
>>>>>>>>> A common library, valid for dpaaX drivers, which is used to 
>>>>>>>>> maintain
>>>>>>>>> a local copy of PA->VA translations.
>>>>>>>>>
>>>>>>>>> In case of physical addressing mode (one of the option for 
>>>>>>>>> FSLMC, and
>>>>>>>>> only option for DPAA bus), the addresses of descriptors Rx'd are
>>>>>>>>> physical. These need to be converted into equivalent VA for 
>>>>>>>>> rte_mbuf
>>>>>>>>> and other similar calls.
>>>>>>>>>
>>>>>>>>> Using the rte_mem_virt2iova or rte_mem_virt2phy is expensive. This
>>>>>>>>> library is an attempt to reduce the overall cost associated with
>>>>>>>>> this translation.
>>>>>>>>>
>>>>>>>>> A small table is maintained, containing continuous entries
>>>>>>>>> representing a continguous physical range. Each of these entries
>>>>>>>>> stores the equivalent VA, which is fed during mempool creation, or
>>>>>>>>> memory allocation/deallocation callbacks.
>>>>>>>>>
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Also, a couple of nitpicks below.
>>>>>>>>
>>>>>>>>>   cosnfig/common_base                            |   5 +
>>>>>>>>>   config/common_linuxapp                        |   5 +
>>>>>>>>>   drivers/common/Makefile                       |   4 +
>>>>>>>>>   drivers/common/dpaax/Makefile                 |  31 ++
>>>>>>>>>   drivers/common/dpaax/dpaax_iova_table.c       | 509 
>>>>>>>>> ++++++++++++++++++
>>>>>>>>>   drivers/common/dpaax/dpaax_iova_table.h       | 104 ++++
>>>>>>>>>   drivers/common/dpaax/dpaax_logs.h             |  39 ++
>>>>>>>>>   drivers/common/dpaax/meson.build              |  12 +
>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>> +    DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for 
>>>>>>>>> vaddr:(%p),"
>>>>>>>>> +            " phy(%"PRIu64"), len(%zu)", entry[i].start, 
>>>>>>>>> e_offset,
>>>>>>>>> +            vaddr, paddr, length);
>>>>>>>>> +    return 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +int
>>>>>>>>> +dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
>>>>>>>>
>>>>>>>> len is not unused.
>>>>>>>
>>>>>>> I will fix this.
>>>>>>> Actually, this function itself is useless - more for symmetry 
>>>>>>> reason.
>>>>>>> Callers would be either simply updating the table, or ignoring it 
>>>>>>> completely. But, yes, this is indeed wrong that I set that unused.
>>>>>>>
>>>>>>
>>>>>> Actually, I was wrong in my first reply. In case of 
>>>>>> dpaax_iova_table_del(), len is indeed redundant. This is because 
>>>>>> the mapping is for a complete page (min of 2MB size), even if the 
>>>>>> request is for lesser length. So, removal of a single entry (of 
>>>>>> fixed size) would be done.
>>>>>>
>>>>>> In fact, while on this, I think deleting a PA->VA entry itself is 
>>>>>> incorrect (not just useless). A single entry (~2MB equivalent) can 
>>>>>> represent multiple users (working on a rte_malloc'd area, for 
>>>>>> example). So, effectively, its always an update - not an add or del.
>>>>>
>>>>> I'm not sure what you mean here. If you got a mem event about 
>>>>> memory area being freed, it's guaranteed to *not* have any users - 
>>>>> neither malloc, nor any other memory. And len is always page-aligned.
>>>>
>>>> ok. Maybe I am getting this wrong, but consider this:
>>>>
>>>> 1) hugepage size=2MB
>>>> 2) a = malloc(1M)
>>>>    this will pin an entry in table for a block starting at VA=(a) 
>>>> and PA=(a'). Each entry is of 2MB length - that means, even if 
>>>> someone were to access a+1048577 for an equivalent PA, they would 
>>>> get it (though, that is a incorrect access).
>>>> 3) b = malloc(1M)
>>>>    this *might* lead to a case where same 2MB page is used and 
>>>> VA=(b==(a+1MB)). Being hugepage backed, PA=(b=PA(a)+1M).
>>>> = After b, the PA-VA table has a single entry of 2MB, representing 
>>>> two mallocs. It can be used for translation for any thread 
>>>> requesting PAs of a or b.
>>>> 4) Free(a)
>>>>   - this would attempt to remove one 2MB entry from PA-VA table. 
>>>> But, 'b' is already valid. Access to get_pa(VA(b)) should return me 
>>>> the PA(b).
>>>>   - 'len' is not even used as the entry in PA-VA table is of a fixed 
>>>> size.
>>>
>>> Just to add to this:
>>> - if talking about the mem_event callback, it definitely won't be a 
>>> case where same page is still being served under another rte_malloc
>>> - But, calls can come to delete from users of PA-VA table based on 
>>> their own rte_free().
>>>
>>> And, your comment makes me think - I should probably del entry from 
>>> the table only when mem_event callback is received.
>>
>> Mem events are not triggered on rte_free(), they're triggered on page 
>> deallocation. A call to rte_free/rte_memzone_free/rte_mempool_free 
>> etc. *might* trigger a page deallocation, but *only* if the memory 
>> area being freed encompasses an entire page. If you rte_malloc() 64 
>> bytes and then rte_free() those 64 bytes, you won't get a mem event 
>> *unless* these were the only 64 bytes allocated on a particular page, 
>> and the entire page is no longer used by anything else.
> 
> My understanding is same.
> But, it seems my explanation wasn't well written:
> 
> For a rte_free(), I am not expecting that mem_event is raised - but, the 
> caller of rte_free() (the eth or crypto drivers, or applications) may 
> call the PA-VA table del function to remove the entries.
> 
> This voluntary delete of table entry from the drivers or applications 
> using PA-VA calling del of PA-VA table - is not correct.

Yes, so it appears :)
  

Patch

diff --git a/config/common_base b/config/common_base
index 6eb65ba4e..ba3dd05bc 100644
--- a/config/common_base
+++ b/config/common_base
@@ -138,6 +138,11 @@  CONFIG_RTE_ETHDEV_PROFILE_WITH_VTUNE=n
 #
 CONFIG_RTE_ETHDEV_TX_PREPARE_NOOP=n
 
+#
+# Common libraries, before Bus/PMDs
+#
+CONFIG_RTE_LIBRTE_COMMON_DPAAX=n
+
 #
 # Compile the Intel FPGA bus
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 9c5ea9d89..57a847f3e 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -29,6 +29,11 @@  CONFIG_RTE_PROC_INFO=y
 CONFIG_RTE_LIBRTE_VMBUS=y
 CONFIG_RTE_LIBRTE_NETVSC_PMD=y
 
+#
+# Common libraries, before Bus/PMDs
+#
+CONFIG_RTE_LIBRTE_COMMON_DPAAX=y
+
 # NXP DPAA BUS and drivers
 CONFIG_RTE_LIBRTE_DPAA_BUS=y
 CONFIG_RTE_LIBRTE_DPAA_MEMPOOL=y
diff --git a/drivers/common/Makefile b/drivers/common/Makefile
index 5f72da0ed..1a5a6706c 100644
--- a/drivers/common/Makefile
+++ b/drivers/common/Makefile
@@ -12,4 +12,8 @@  ifeq ($(CONFIG_RTE_LIBRTE_MVPP2_PMD),y)
 DIRS-y += mvep
 endif
 
+ifeq ($(CONFIG_RTE_LIBRTE_COMMON_DPAAX),y)
+DIRS-y += dpaax
+endif
+
 include $(RTE_SDK)/mk/rte.subdir.mk
diff --git a/drivers/common/dpaax/Makefile b/drivers/common/dpaax/Makefile
new file mode 100644
index 000000000..94d2cf0ce
--- /dev/null
+++ b/drivers/common/dpaax/Makefile
@@ -0,0 +1,31 @@ 
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright 2018 NXP
+#
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_common_dpaax.a
+
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+# versioning export map
+EXPORT_MAP := rte_common_dpaax_version.map
+
+# library version
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-y += dpaax_iova_table.c
+
+LDLIBS += -lrte_eal
+
+SYMLINK-y-include += dpaax_iova_table.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
\ No newline at end of file
diff --git a/drivers/common/dpaax/dpaax_iova_table.c b/drivers/common/dpaax/dpaax_iova_table.c
new file mode 100644
index 000000000..73acd1646
--- /dev/null
+++ b/drivers/common/dpaax/dpaax_iova_table.c
@@ -0,0 +1,509 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 NXP
+ */
+
+#include <rte_memory.h>
+
+#include "dpaax_iova_table.h"
+#include "dpaax_logs.h"
+
+/* Global dpaax logger identifier */
+int dpaax_logger;
+
+/* Global table reference */
+struct dpaax_iova_table *dpaax_iova_table_p;
+
+static int dpaax_handle_memevents(void);
+
+/* A structure representing the device-tree node available in /proc/device-tree.
+ */
+struct reg_node {
+	phys_addr_t addr;
+	size_t len;
+};
+
+/* A ntohll equivalent routine
+ * XXX: This is only applicable for 64 bit environment.
+ */
+static void
+rotate_8(unsigned char *arr)
+{
+	uint32_t temp;
+	uint32_t *first_half;
+	uint32_t *second_half;
+
+	first_half = (uint32_t *)(arr);
+	second_half = (uint32_t *)(arr + 4);
+
+	temp = *first_half;
+	*first_half = *second_half;
+	*second_half = temp;
+
+	*first_half = ntohl(*first_half);
+	*second_half = ntohl(*second_half);
+}
+
+/* read_memory_nodes
+ * Memory layout for DPAAx platforms (LS1043, LS1046, LS1088, LS2088, LX2160)
+ * are populated by Uboot and available in device tree:
+ * /proc/device-tree/memory@<address>/reg <= register.
+ * Entries are of the form:
+ *  (<8 byte start addr><8 byte length>)(..more similar blocks of start,len>)..
+ *
+ * @param count
+ *    OUT populate number of entries found in memory node
+ * @return
+ *    Pointer to array of reg_node elements, count size
+ */
+static struct reg_node *
+read_memory_node(unsigned int *count)
+{
+	int fd, ret, i;
+	unsigned int j;
+	glob_t result = {0};
+	struct stat statbuf = {0};
+	char file_data[MEM_NODE_FILE_LEN];
+	struct reg_node *nodes = NULL;
+
+	*count = 0;
+
+	ret = glob(MEM_NODE_PATH_GLOB, 0, NULL, &result);
+	if (ret != 0) {
+		DPAAX_ERR("Unable to glob device-tree memory node: (%s)(%d)",
+			   MEM_NODE_PATH_GLOB, ret);
+		goto out;
+	}
+
+	if (result.gl_pathc != 1) {
+		/* Either more than one memory@<addr> node found, or none.
+		 * In either case, cannot work ahead.
+		 */
+		DPAAX_ERR("Found (%zu) entries in device-tree. Not supported!",
+			  result.gl_pathc);
+		goto out;
+	}
+
+	DPAAX_DEBUG("Opening and parsing device-tree node: (%s)",
+		    result.gl_pathv[0]);
+	fd = open(result.gl_pathv[0], O_RDONLY);
+	if (fd < 0) {
+		DPAAX_ERR("Unable to open the device-tree node: (%s)(fd=%d)",
+			  MEM_NODE_PATH_GLOB, fd);
+		goto cleanup;
+	}
+
+	/* Stat to get the file size */
+	ret = fstat(fd, &statbuf);
+	if (ret != 0) {
+		DPAAX_ERR("Unable to get device-tree memory node size.");
+		goto cleanup;
+	}
+
+	DPAAX_DEBUG("Size of device-tree mem node: %lu", statbuf.st_size);
+	if (statbuf.st_size > MEM_NODE_FILE_LEN) {
+		DPAAX_WARN("More memory nodes available than assumed.");
+		DPAAX_WARN("System may not work properly!");
+	}
+
+	ret = read(fd, file_data, statbuf.st_size > MEM_NODE_FILE_LEN ?
+				  MEM_NODE_FILE_LEN : statbuf.st_size);
+	if (ret <= 0) {
+		DPAAX_ERR("Unable to read device-tree memory node: (%d)", ret);
+		goto cleanup;
+	}
+
+	/* The reg node should be multiple of 16 bytes, 8 bytes each for addr
+	 * and len.
+	 */
+	*count = (statbuf.st_size / 16);
+	if ((*count) <= 0 || (statbuf.st_size % 16 != 0)) {
+		DPAAX_ERR("Invalid memory node values or count. (size=%lu)",
+			  statbuf.st_size);
+		goto cleanup;
+	}
+
+	/* each entry is of 16 bytes, and size/16 is total count of entries */
+	nodes = malloc(sizeof(struct reg_node) * (*count));
+	if (!nodes) {
+		DPAAX_ERR("Failure in allocating working memory.");
+		goto cleanup;
+	}
+	memset(nodes, 0, sizeof(struct reg_node) * (*count));
+
+	for (i = 0, j = 0; i < (statbuf.st_size) && j < (*count); i += 16, j++) {
+		memcpy(&nodes[j], file_data + i, 16);
+		/* Rotate (ntohl) each 8 byte entry */
+		rotate_8((unsigned char *)(&(nodes[j].addr)));
+		rotate_8((unsigned char *)(&(nodes[j].len)));
+	}
+
+	DPAAX_DEBUG("Device-tree memory node data:");
+	do {
+		DPAAX_DEBUG("\n    %08" PRIx64 " %08zu", nodes[j].addr, nodes[j].len);
+	} while (--j);
+
+cleanup:
+	close(fd);
+	globfree(&result);
+out:
+	return nodes;
+}
+
+int
+dpaax_iova_table_populate(void)
+{
+	int ret;
+	unsigned int i, node_count;
+	size_t tot_memory_size, total_table_size;
+	struct reg_node *nodes;
+	struct dpaax_iovat_element *entry;
+
+	/* dpaax_iova_table_p is a singleton - only one instance should be
+	 * created.
+	 */
+	if (dpaax_iova_table_p) {
+		DPAAX_DEBUG("Multiple allocation attempt for IOVA Table (%p)",
+			    dpaax_iova_table_p);
+		/* This can be an error case as well - some path not cleaning
+		 * up table - but, for now, it is assumed that if IOVA Table
+		 * pointer is valid, table is allocated.
+		 */
+		return 0;
+	}
+
+	nodes = read_memory_node(&node_count);
+	if (nodes == NULL || node_count <= 0) {
+		DPAAX_WARN("PA->VA translation not available;");
+		DPAAX_WARN("Expect performance impact.");
+		return -1;
+	}
+
+	tot_memory_size = 0;
+	for (i = 0; i < node_count; i++)
+		tot_memory_size += nodes[i].len;
+
+	DPAAX_DEBUG("Total available PA memory size: %zu", tot_memory_size);
+
+	/* Total table size = meta data + tot_memory_size/8 */
+	total_table_size = sizeof(struct dpaax_iova_table) +
+			   (sizeof(struct dpaax_iovat_element) * node_count) +
+			   ((tot_memory_size / DPAAX_MEM_SPLIT) * sizeof(uint64_t));
+
+	/* TODO: This memory doesn't need to shared but needs to be always
+	 * pinned to RAM (no swap out) - using hugepage rather than malloc
+	 */
+	dpaax_iova_table_p = rte_zmalloc(NULL, total_table_size, 0);
+	if (dpaax_iova_table_p == NULL) {
+		DPAAX_WARN("Unable to allocate memory for PA->VA Table;");
+		DPAAX_WARN("PA->VA translation not available;");
+		DPAAX_WARN("Expect performance impact.");
+		free(nodes);
+		return -1;
+	}
+
+	/* Initialize table */
+	dpaax_iova_table_p->count = node_count;
+	entry = dpaax_iova_table_p->entries;
+
+	DPAAX_DEBUG("IOVA Table entries: (entry start = %p)", (void *)entry);
+	DPAAX_DEBUG("\t(entry),(start),(len),(next)");
+
+	for (i = 0; i < node_count; i++) {
+		/* dpaax_iova_table_p
+		 * |   dpaax_iova_table_p->entries
+		 * |      |
+		 * |      |
+		 * V      V
+		 * +------+------+-------+---+----------+---------+---
+		 * |iova_ |entry | entry |   | pages    | pages   |
+		 * |table | 1    |  2    |...| entry 1  | entry2  |
+		 * +-----'+.-----+-------+---+;---------+;--------+---
+		 *         \      \          /          /
+		 *          `~~~~~~|~~~~~>pages        /
+		 *                  \                 /
+		 *                   `~~~~~~~~~~~>pages
+		 */
+		entry[i].start = nodes[i].addr;
+		entry[i].len = nodes[i].len;
+		if (i > 0)
+			entry[i].pages = entry[i-1].pages +
+				((entry[i-1].len/DPAAX_MEM_SPLIT));
+		else
+			entry[i].pages = (uint64_t *)((unsigned char *)entry +
+					 (sizeof(struct dpaax_iovat_element) *
+					 node_count));
+
+		DPAAX_DEBUG("\t(%u),(%8"PRIx64"),(%8zu),(%8p)",
+			    i, entry[i].start, entry[i].len, entry[i].pages);
+	}
+
+	/* Release memory associated with nodes array - not required now */
+	free(nodes);
+
+	DPAAX_DEBUG("Adding mem-event handler\n");
+	ret = dpaax_handle_memevents();
+	if (ret) {
+		DPAAX_ERR("Unable to add mem-event handler");
+		DPAAX_WARN("Cases with non-buffer pool mem won't work!");
+	}
+
+	return 0;
+}
+
+void
+dpaax_iova_table_depopulate(void)
+{
+	if (dpaax_iova_table_p == NULL)
+		return;
+
+	rte_free(dpaax_iova_table_p->entries);
+	dpaax_iova_table_p = NULL;
+
+	DPAAX_DEBUG("IOVA Table cleanedup");
+}
+
+int
+dpaax_iova_table_add(phys_addr_t paddr, void *vaddr, size_t length)
+{
+	int found = 0;
+	unsigned int i;
+	size_t req_length = length, e_offset;
+	struct dpaax_iovat_element *entry;
+	uintptr_t align_vaddr;
+	phys_addr_t align_paddr;
+
+	align_paddr = paddr & DPAAX_MEM_SPLIT_MASK;
+	align_vaddr = ((uintptr_t)vaddr & DPAAX_MEM_SPLIT_MASK);
+
+	/* Check if paddr is available in table */
+	entry = dpaax_iova_table_p->entries;
+	for (i = 0; i < dpaax_iova_table_p->count; i++) {
+		if (align_paddr < entry[i].start) {
+			/* Address lower than start, but not found in previous
+			 * iteration shouldn't exist.
+			 */
+			DPAAX_ERR("Add: Incorrect entry for PA->VA Table"
+				  "(%"PRIu64")", paddr);
+			DPAAX_ERR("Add: Lowest address: %"PRIu64"",
+				  entry[i].start);
+			return -1;
+		}
+
+		if (align_paddr > (entry[i].start + entry[i].len))
+			continue;
+
+		/* align_paddr >= start && align_paddr < (start + len) */
+		found = 1;
+
+		do {
+			e_offset = ((align_paddr - entry[i].start) / DPAAX_MEM_SPLIT);
+			/* TODO: Whatif something already exists at this
+			 * location - is that an error? For now, ignoring the
+			 * case.
+			 */
+			entry[i].pages[e_offset] = align_vaddr;
+			DPAAX_DEBUG("Added: vaddr=%zu for Phy:%"PRIu64" at %zu"
+				    " remaining len %zu", align_vaddr,
+				    align_paddr, e_offset, req_length);
+
+			/* Incoming request can be larger than the
+			 * DPAAX_MEM_SPLIT size - in which case, multiple
+			 * entries in entry->pages[] are filled up.
+			 */
+			if (req_length <= DPAAX_MEM_SPLIT)
+				break;
+			align_paddr += DPAAX_MEM_SPLIT;
+			align_vaddr += DPAAX_MEM_SPLIT;
+			req_length -= DPAAX_MEM_SPLIT;
+		} while (1);
+
+		break;
+	}
+
+	if (!found) {
+		/* There might be case where the incoming physical address is
+		 * beyond the address discovered in the memory node of
+		 * device-tree. Specially if some malloc'd area is used by EAL
+		 * and the memevent handlers passes that across. But, this is
+		 * not necessarily an error.
+		 */
+		DPAAX_DEBUG("Add: Unable to find slot for vaddr:(%p),"
+			    " phy(%"PRIu64")",
+			    vaddr, paddr);
+		return -1;
+	}
+
+	DPAAX_DEBUG("Add: Found slot at (%"PRIu64")[(%zu)] for vaddr:(%p),"
+		    " phy(%"PRIu64"), len(%zu)", entry[i].start, e_offset,
+		    vaddr, paddr, length);
+	return 0;
+}
+
+int
+dpaax_iova_table_del(phys_addr_t paddr, size_t len __rte_unused)
+{
+	int found = 0;
+	unsigned int i;
+	size_t e_offset;
+	struct dpaax_iovat_element *entry;
+	phys_addr_t align_paddr;
+
+	align_paddr = paddr & DPAAX_MEM_SPLIT_MASK;
+
+	/* Check if paddr is available in table */
+	entry = dpaax_iova_table_p->entries;
+	for (i = 0; i < dpaax_iova_table_p->count; i++) {
+		if (align_paddr < entry[i].start) {
+			/* Address lower than start, but not found in previous
+			 * iteration shouldn't exist.
+			 */
+			DPAAX_ERR("Del: Incorrect entry for PA->VA Table "
+				  "(%"PRIu64")", paddr);
+			DPAAX_ERR("Del: Lowest address: %"PRIu64,
+				  entry[i].start);
+			return -1;
+		}
+
+		if (align_paddr > (entry[i].start + entry[i].len))
+			continue;
+
+		/* align_paddr >= start && align_paddr < (start + len) */
+		found = 1;
+		e_offset = (align_paddr / DPAAX_MEM_SPLIT);
+		entry->pages[e_offset] = 0;
+
+		/* Addition might have populated multiple entries, but removal
+		 * won't do that. Someone might be using internal entries.
+		 * Removal is essentially a dummy - maynot be ever required
+		 */
+		break;
+	}
+
+	if (!found) {
+		DPAAX_WARN("Del: Unable to find slot for phy(%"PRIu64")",
+			   paddr);
+		return -1;
+	}
+
+	DPAAX_DEBUG("Del: Found slot at (%"PRIu64")[(%zu)] for phy(%"PRIu64")",
+		    entry[i].start, e_offset, paddr);
+	return 0;
+}
+
+/* dpaax_iova_table_dump
+ * Dump the table, with its entries, on screen. Only works in Debug Mode
+ * Not for weak hearted - the tables can get quite large
+ */
+void
+dpaax_iova_table_dump(void)
+{
+	unsigned int i, j;
+	struct dpaax_iovat_element *entry;
+
+	/* In case DEBUG is not enabled, some 'if' conditions might misbehave
+	 * as they have nothing else in them  except a DPAAX_DEBUG() which if
+	 * tuned out would leave 'if' naked.
+	 */
+	if (rte_log_get_global_level() < RTE_LOG_DEBUG) {
+		DPAAX_ERR("Set log level to Debug for PA->Table dump!");
+		return;
+	}
+
+	DPAAX_DEBUG(" === Start of PA->VA Translation Table ===");
+	if (dpaax_iova_table_p == NULL)
+		DPAAX_DEBUG("\tNULL");
+
+	entry = dpaax_iova_table_p->entries;
+	for (i = 0; i < dpaax_iova_table_p->count; i++) {
+		DPAAX_DEBUG("\t(%16i),(%16"PRIu64"),(%16zu),(%16p)",
+			    i, entry[i].start, entry[i].len, entry[i].pages);
+		DPAAX_DEBUG("\t\t          (PA),          (VA)");
+		for (j = 0; j < (entry->len/DPAAX_MEM_SPLIT); j++) {
+			if (entry[i].pages[j] == 0)
+				continue;
+			DPAAX_DEBUG("\t\t(%16"PRIx64"),(%16"PRIx64")",
+				    (entry[i].start + (j * sizeof(uint64_t))),
+				    entry[i].pages[j]);
+		}
+	}
+	DPAAX_DEBUG(" === End of PA->VA Translation Table ===");
+}
+
+static void
+dpaax_memevent_cb(enum rte_mem_event type, const void *addr, size_t len,
+		  void *arg __rte_unused)
+{
+	struct rte_memseg_list *msl;
+	struct rte_memseg *ms;
+	size_t cur_len = 0, map_len = 0;
+	phys_addr_t phys_addr;
+	void *virt_addr;
+	int ret;
+
+	DPAAX_DEBUG("Called with addr=%p, len=%zu", addr, len);
+
+	msl = rte_mem_virt2memseg_list(addr);
+
+	while (cur_len < len) {
+		const void *va = RTE_PTR_ADD(addr, cur_len);
+
+		ms = rte_mem_virt2memseg(va, msl);
+		phys_addr = rte_mem_virt2phy(ms->addr);
+		virt_addr = ms->addr;
+		map_len = ms->len;
+
+		DPAAX_DEBUG("Request for %s, va=%p, virt_addr=%p,"
+			    "iova=%"PRIu64", map_len=%zu",
+			    type == RTE_MEM_EVENT_ALLOC ?
+			    "alloc" : "dealloc",
+			    va, virt_addr, phys_addr, map_len);
+
+		if (type == RTE_MEM_EVENT_ALLOC)
+			ret = dpaax_iova_table_add(phys_addr, virt_addr,
+						   map_len);
+		else
+			ret = dpaax_iova_table_del(phys_addr, map_len);
+
+		if (ret != 0) {
+			DPAAX_ERR("PA-Table entry update failed. "
+				  "Map=%d, addr=%p, len=%zu, err:(%d)",
+				  type, va, map_len, ret);
+			return;
+		}
+
+		cur_len += map_len;
+	}
+}
+
+static int
+dpaax_memevent_walk_memsegs(const struct rte_memseg_list *msl __rte_unused,
+			    const struct rte_memseg *ms, size_t len,
+			    void *arg __rte_unused)
+{
+	DPAAX_DEBUG("Walking for %p (pa=%"PRIu64") and len %zu",
+		    ms->addr, ms->phys_addr, len);
+	dpaax_iova_table_add(rte_mem_virt2phy(ms->addr), ms->addr, len);
+	return 0;
+}
+
+static int
+dpaax_handle_memevents(void)
+{
+	/* First, walk through all memsegs and pin them, before installing
+	 * handler. This assures that all memseg which have already been
+	 * identified/allocated by EAL, are already part of PA->VA Table. This
+	 * is especially for cases where application allocates memory before
+	 * the EAL or this is an externally allocated memory passed to EAL.
+	 */
+	rte_memseg_contig_walk_thread_unsafe(dpaax_memevent_walk_memsegs, NULL);
+
+	return rte_mem_event_callback_register("dpaax_memevents_cb",
+					       dpaax_memevent_cb, NULL);
+}
+
+RTE_INIT(dpaax_log)
+{
+	dpaax_logger = rte_log_register("pmd.common.dpaax");
+	if (dpaax_logger >= 0)
+		rte_log_set_level(dpaax_logger, RTE_LOG_NOTICE);
+}
diff --git a/drivers/common/dpaax/dpaax_iova_table.h b/drivers/common/dpaax/dpaax_iova_table.h
new file mode 100644
index 000000000..056e4e0a1
--- /dev/null
+++ b/drivers/common/dpaax/dpaax_iova_table.h
@@ -0,0 +1,104 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 NXP
+ */
+
+#ifndef _DPAAX_IOVA_TABLE_H_
+#define _DPAAX_IOVA_TABLE_H_
+
+#include <unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdbool.h>
+#include <string.h>
+#include <stdlib.h>
+#include <inttypes.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <fcntl.h>
+#include <glob.h>
+#include <errno.h>
+#include <arpa/inet.h>
+
+#include <rte_eal.h>
+#include <rte_branch_prediction.h>
+#include <rte_memory.h>
+#include <rte_malloc.h>
+
+struct dpaax_iovat_element {
+	phys_addr_t start; /**< Start address of block of physical pages */
+	size_t len; /**< Difference of end-start for quick access */
+	uint64_t *pages; /**< VA for each physical page in this block */
+};
+
+struct dpaax_iova_table {
+	unsigned int count; /**< No. of blocks of contiguous physical pages */
+	struct dpaax_iovat_element entries[0];
+};
+
+/* Pointer to the table, which is common for DPAA/DPAA2 and only a single
+ * instance is required across net/crypto/event drivers. This table is
+ * populated iff devices are found on the bus.
+ */
+extern struct dpaax_iova_table *dpaax_iova_table_p;
+
+/* Device tree file for memory layout is named 'memory@<addr>' where the 'addr'
+ * is SoC dependent, or even Uboot fixup dependent.
+ */
+#define MEM_NODE_PATH_GLOB "/proc/device-tree/memory[@0-9]*/reg"
+/* Device file should be multiple of 16 bytes, each containing 8 byte of addr
+ * and its length. Assuming max of 5 entries.
+ */
+#define MEM_NODE_FILE_LEN ((16 * 5) + 1)
+
+/* Table is made up of DPAAX_MEM_SPLIT elements for each contiguous zone. This
+ * helps avoid separate handling for cases where more than one size of hugepage
+ * is supported.
+ */
+#define DPAAX_MEM_SPLIT (1<<21)
+#define DPAAX_MEM_SPLIT_MASK ~(DPAAX_MEM_SPLIT - 1) /**< Floor aligned */
+#define DPAAX_MEM_SPLIT_MASK_OFF (DPAAX_MEM_SPLIT - 1) /**< Offset */
+
+/* APIs exposed */
+int dpaax_iova_table_populate(void);
+void dpaax_iova_table_depopulate(void);
+int dpaax_iova_table_add(phys_addr_t paddr, void *vaddr, size_t length);
+int dpaax_iova_table_del(phys_addr_t paddr, size_t len);
+void dpaax_iova_table_dump(void);
+
+static inline void *dpaax_iova_table_get_va(phys_addr_t paddr) __attribute__((hot));
+
+static inline void *
+dpaax_iova_table_get_va(phys_addr_t paddr) {
+	unsigned int i = 0, index;
+	void *vaddr = 0;
+	phys_addr_t paddr_align = paddr & DPAAX_MEM_SPLIT_MASK;
+	size_t offset = paddr & DPAAX_MEM_SPLIT_MASK_OFF;
+	struct dpaax_iovat_element *entry;
+
+	entry = dpaax_iova_table_p->entries;
+
+	do {
+		if (unlikely(i > dpaax_iova_table_p->count))
+			break;
+
+		if (paddr_align < entry[i].start) {
+			/* Incorrect paddr; Not in memory range */
+			return 0;
+		}
+
+		if (paddr_align > (entry[i].start + entry[i].len)) {
+			i++;
+			continue;
+		}
+
+		/* paddr > entry->start && paddr <= entry->(start+len) */
+		index = (paddr_align - entry[i].start)/DPAAX_MEM_SPLIT;
+		vaddr = (void *)((uintptr_t)entry[i].pages[index] + offset);
+		break;
+	} while (1);
+
+	return vaddr;
+}
+
+#endif /* _DPAAX_IOVA_TABLE_H_ */
diff --git a/drivers/common/dpaax/dpaax_logs.h b/drivers/common/dpaax/dpaax_logs.h
new file mode 100644
index 000000000..bf1b27cc1
--- /dev/null
+++ b/drivers/common/dpaax/dpaax_logs.h
@@ -0,0 +1,39 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2018 NXP
+ */
+
+#ifndef _DPAAX_LOGS_H_
+#define _DPAAX_LOGS_H_
+
+#include <rte_log.h>
+
+extern int dpaax_logger;
+
+#define DPAAX_LOG(level, fmt, args...) \
+	rte_log(RTE_LOG_ ## level, dpaax_logger, "dpaax: " fmt "\n", \
+		##args)
+
+/* Debug logs are with Function names */
+#define DPAAX_DEBUG(fmt, args...) \
+	rte_log(RTE_LOG_DEBUG, dpaax_logger, "dpaax: %s():	 " fmt "\n", \
+		__func__, ##args)
+
+#define DPAAX_INFO(fmt, args...) \
+	DPAAX_LOG(INFO, fmt, ## args)
+#define DPAAX_ERR(fmt, args...) \
+	DPAAX_LOG(ERR, fmt, ## args)
+#define DPAAX_WARN(fmt, args...) \
+	DPAAX_LOG(WARNING, fmt, ## args)
+
+/* DP Logs, toggled out at compile time if level lower than current level */
+#define DPAAX_DP_LOG(level, fmt, args...) \
+	RTE_LOG_DP(level, PMD, fmt, ## args)
+
+#define DPAAX_DP_DEBUG(fmt, args...) \
+	DPAAX_DP_LOG(DEBUG, fmt, ## args)
+#define DPAAX_DP_INFO(fmt, args...) \
+	DPAAX_DP_LOG(INFO, fmt, ## args)
+#define DPAAX_DP_WARN(fmt, args...) \
+	DPAAX_DP_LOG(WARNING, fmt, ## args)
+
+#endif /* _DPAAX_LOGS_H_ */
diff --git a/drivers/common/dpaax/meson.build b/drivers/common/dpaax/meson.build
new file mode 100644
index 000000000..98a1bdd48
--- /dev/null
+++ b/drivers/common/dpaax/meson.build
@@ -0,0 +1,12 @@ 
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 NXP
+
+allow_experimental_apis = true
+
+if host_machine.system() != 'linux'
+        build = false
+endif
+
+sources = files('dpaax_iova_table.c')
+
+cflags += ['-D_GNU_SOURCE']
diff --git a/drivers/common/dpaax/rte_common_dpaax_version.map b/drivers/common/dpaax/rte_common_dpaax_version.map
new file mode 100644
index 000000000..6c0efde20
--- /dev/null
+++ b/drivers/common/dpaax/rte_common_dpaax_version.map
@@ -0,0 +1,12 @@ 
+DPDK_18.11 {
+	global:
+
+	dpaax_iova_table_add;
+	dpaax_iova_table_del;
+	dpaax_iova_table_depopulate;
+	dpaax_iova_table_dump;
+	dpaax_iova_table_p;
+	dpaax_iova_table_populate;
+
+	local: *;
+};
\ No newline at end of file
diff --git a/drivers/common/meson.build b/drivers/common/meson.build
index f828ce7f7..0257d4d2b 100644
--- a/drivers/common/meson.build
+++ b/drivers/common/meson.build
@@ -2,6 +2,6 @@ 
 # Copyright(c) 2018 Cavium, Inc
 
 std_deps = ['eal']
-drivers = ['mvep', 'octeontx', 'qat']
+drivers = ['dpaax', 'mvep', 'octeontx', 'qat']
 config_flag_fmt = 'RTE_LIBRTE_@0@_COMMON'
 driver_name_fmt = 'rte_common_@0@'