[v7,1/4] mempool: modify mempool populate() to skip objects from page boundaries

Message ID 20190717090408.13717-2-vattunuru@marvell.com (mailing list archive)
State Superseded, archived
Headers
Series kni: add IOVA=VA support |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Performance-Testing fail build patch failure
ci/Intel-compilation fail Compilation issues

Commit Message

Vamsi Krishna Attunuru July 17, 2019, 9:04 a.m. UTC
  From: Vamsi Attunuru <vattunuru@marvell.com>

Currently the phys address of a mempool object populated by the mempool
populate default() routine may not be contiguous with in that mbuf range.

Patch ensures that each object's phys address is contiguous by modifying
default behaviour of mempool populate() to prevent objects from being
across 2 pages, expect if the size of object is bigger than size of page.

Since the overhead after this modification will be very minimal considering
the hugepage sizes of 512M & 1G, default behaviour is modified except for
the object sizes bigger than the page size.

Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
---
 lib/librte_mempool/rte_mempool.c             |  2 +-
 lib/librte_mempool/rte_mempool_ops_default.c | 33 ++++++++++++++++++++++++++--
 2 files changed, 32 insertions(+), 3 deletions(-)
  

Comments

Andrew Rybchenko July 17, 2019, 1:36 p.m. UTC | #1
On 7/17/19 12:04 PM, vattunuru@marvell.com wrote:
> From: Vamsi Attunuru <vattunuru@marvell.com>
>
> Currently the phys address of a mempool object populated by the mempool
> populate default() routine may not be contiguous with in that mbuf range.
>
> Patch ensures that each object's phys address is contiguous by modifying
> default behaviour of mempool populate() to prevent objects from being
> across 2 pages, expect if the size of object is bigger than size of page.
>
> Since the overhead after this modification will be very minimal considering
> the hugepage sizes of 512M & 1G, default behaviour is modified except for
> the object sizes bigger than the page size.
>
> Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>

NACK

Looking at MEMPOOL_F_NO_IOVA_CONTIG description I don't
understand why the patch is necessary at all. So, I'd like to
know more. Exact conditions, IOVA mode, hugepage sizes,
mempool flags and how it is populated.

> ---
>   lib/librte_mempool/rte_mempool.c             |  2 +-
>   lib/librte_mempool/rte_mempool_ops_default.c | 33 ++++++++++++++++++++++++++--
>   2 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
> index 7260ce0..1c48325 100644
> --- a/lib/librte_mempool/rte_mempool.c
> +++ b/lib/librte_mempool/rte_mempool.c
> @@ -339,7 +339,7 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
>   	i = rte_mempool_ops_populate(mp, mp->size - mp->populated_size,
>   		(char *)vaddr + off,
>   		(iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off),
> -		len - off, mempool_add_elem, NULL);
> +		len - off, mempool_add_elem, opaque);

The last argument is the callback opaque value. mempool_add_elem()
does not use the opaque. But it is incorrect to use it for other purposes
and require it to be memzone.

>   	/* not enough room to store one object */
>   	if (i == 0) {
> diff --git a/lib/librte_mempool/rte_mempool_ops_default.c b/lib/librte_mempool/rte_mempool_ops_default.c
> index 4e2bfc8..85da264 100644
> --- a/lib/librte_mempool/rte_mempool_ops_default.c
> +++ b/lib/librte_mempool/rte_mempool_ops_default.c
> @@ -45,19 +45,48 @@ rte_mempool_op_calc_mem_size_default(const struct rte_mempool *mp,
>   	return mem_size;
>   }
>   
> +/* Returns -1 if object falls on a page boundary, else returns 0 */
> +static inline int
> +mempool_check_obj_bounds(void *obj, uint64_t hugepage_sz, size_t elt_sz)
> +{
> +	uintptr_t page_end, elt_addr = (uintptr_t)obj;
> +	uint32_t pg_shift = rte_bsf32(hugepage_sz);
> +	uint64_t page_mask;
> +
> +	page_mask =  ~((1ull << pg_shift) - 1);
> +	page_end = (elt_addr & page_mask) + hugepage_sz;
> +
> +	if (elt_addr + elt_sz > page_end)
> +		return -1;
> +
> +	return 0;
> +}
> +
>   int
>   rte_mempool_op_populate_default(struct rte_mempool *mp, unsigned int max_objs,
>   		void *vaddr, rte_iova_t iova, size_t len,
>   		rte_mempool_populate_obj_cb_t *obj_cb, void *obj_cb_arg)
>   {
> -	size_t total_elt_sz;
> -	size_t off;
> +	struct rte_memzone *mz = obj_cb_arg;
> +	size_t total_elt_sz, off;

Why two variables are combined into one here? It is unrelated change.

>   	unsigned int i;
>   	void *obj;
>   
>   	total_elt_sz = mp->header_size + mp->elt_size + mp->trailer_size;
>   
>   	for (off = 0, i = 0; off + total_elt_sz <= len && i < max_objs; i++) {
> +
> +		/* Skip page boundary check if element is bigger than page */
> +		if (mz->hugepage_sz >= total_elt_sz) {
> +			if (mempool_check_obj_bounds((char *)vaddr + off,
> +						    mz->hugepage_sz,
> +						    total_elt_sz) < 0) {
> +				i--; /* Decrement count & skip this obj */
> +				off += total_elt_sz;
> +				continue;
> +			}
> +		}
> +

What I don't like here is that it makes one memory chunk insufficient
to populate all objects. I.e. we calculated memory chunk size required,
but skipped some object. May be it is not a problem, but breaks the
logic existing in the code.

>   		off += mp->header_size;
>   		obj = (char *)vaddr + off;
>   		obj_cb(mp, obj_cb_arg, obj,
  
Olivier Matz July 17, 2019, 1:47 p.m. UTC | #2
On Wed, Jul 17, 2019 at 04:36:37PM +0300, Andrew Rybchenko wrote:
> On 7/17/19 12:04 PM, vattunuru@marvell.com wrote:
> > From: Vamsi Attunuru <vattunuru@marvell.com>
> > 
> > Currently the phys address of a mempool object populated by the mempool
> > populate default() routine may not be contiguous with in that mbuf range.
> > 
> > Patch ensures that each object's phys address is contiguous by modifying
> > default behaviour of mempool populate() to prevent objects from being
> > across 2 pages, expect if the size of object is bigger than size of page.
> > 
> > Since the overhead after this modification will be very minimal considering
> > the hugepage sizes of 512M & 1G, default behaviour is modified except for
> > the object sizes bigger than the page size.
> > 
> > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> > Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
> 
> NACK
> 
> Looking at MEMPOOL_F_NO_IOVA_CONTIG description I don't
> understand why the patch is necessary at all. So, I'd like to
> know more. Exact conditions, IOVA mode, hugepage sizes,
> mempool flags and how it is populated.
> 
> > ---
> >   lib/librte_mempool/rte_mempool.c             |  2 +-
> >   lib/librte_mempool/rte_mempool_ops_default.c | 33 ++++++++++++++++++++++++++--
> >   2 files changed, 32 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
> > index 7260ce0..1c48325 100644
> > --- a/lib/librte_mempool/rte_mempool.c
> > +++ b/lib/librte_mempool/rte_mempool.c
> > @@ -339,7 +339,7 @@ rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
> >   	i = rte_mempool_ops_populate(mp, mp->size - mp->populated_size,
> >   		(char *)vaddr + off,
> >   		(iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off),
> > -		len - off, mempool_add_elem, NULL);
> > +		len - off, mempool_add_elem, opaque);
> 
> The last argument is the callback opaque value. mempool_add_elem()
> does not use the opaque. But it is incorrect to use it for other purposes
> and require it to be memzone.
> 
> >   	/* not enough room to store one object */
> >   	if (i == 0) {
> > diff --git a/lib/librte_mempool/rte_mempool_ops_default.c b/lib/librte_mempool/rte_mempool_ops_default.c
> > index 4e2bfc8..85da264 100644
> > --- a/lib/librte_mempool/rte_mempool_ops_default.c
> > +++ b/lib/librte_mempool/rte_mempool_ops_default.c
> > @@ -45,19 +45,48 @@ rte_mempool_op_calc_mem_size_default(const struct rte_mempool *mp,
> >   	return mem_size;
> >   }
> > +/* Returns -1 if object falls on a page boundary, else returns 0 */
> > +static inline int
> > +mempool_check_obj_bounds(void *obj, uint64_t hugepage_sz, size_t elt_sz)
> > +{
> > +	uintptr_t page_end, elt_addr = (uintptr_t)obj;
> > +	uint32_t pg_shift = rte_bsf32(hugepage_sz);
> > +	uint64_t page_mask;
> > +
> > +	page_mask =  ~((1ull << pg_shift) - 1);
> > +	page_end = (elt_addr & page_mask) + hugepage_sz;
> > +
> > +	if (elt_addr + elt_sz > page_end)
> > +		return -1;
> > +
> > +	return 0;
> > +}
> > +
> >   int
> >   rte_mempool_op_populate_default(struct rte_mempool *mp, unsigned int max_objs,
> >   		void *vaddr, rte_iova_t iova, size_t len,
> >   		rte_mempool_populate_obj_cb_t *obj_cb, void *obj_cb_arg)
> >   {
> > -	size_t total_elt_sz;
> > -	size_t off;
> > +	struct rte_memzone *mz = obj_cb_arg;
> > +	size_t total_elt_sz, off;
> 
> Why two variables are combined into one here? It is unrelated change.
> 
> >   	unsigned int i;
> >   	void *obj;
> >   	total_elt_sz = mp->header_size + mp->elt_size + mp->trailer_size;
> >   	for (off = 0, i = 0; off + total_elt_sz <= len && i < max_objs; i++) {
> > +
> > +		/* Skip page boundary check if element is bigger than page */
> > +		if (mz->hugepage_sz >= total_elt_sz) {
> > +			if (mempool_check_obj_bounds((char *)vaddr + off,
> > +						    mz->hugepage_sz,
> > +						    total_elt_sz) < 0) {
> > +				i--; /* Decrement count & skip this obj */
> > +				off += total_elt_sz;
> > +				continue;
> > +			}
> > +		}
> > +
> 
> What I don't like here is that it makes one memory chunk insufficient
> to populate all objects. I.e. we calculated memory chunk size required,
> but skipped some object. May be it is not a problem, but breaks the
> logic existing in the code.

+1

We should not skip the object, but realign it to the next page. This
way it won't break the logic, because this is what is expected when
required space is calculated in rte_mempool_op_calc_mem_size_default().
  
Vamsi Krishna Attunuru July 17, 2019, 5:31 p.m. UTC | #3
> -----Original Message-----
> From: Andrew Rybchenko <arybchenko@solarflare.com>
> Sent: Wednesday, July 17, 2019 7:07 PM
> To: Vamsi Krishna Attunuru <vattunuru@marvell.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
> olivier.matz@6wind.com; ferruh.yigit@intel.com;
> anatoly.burakov@intel.com; Kiran Kumar Kokkilagadda
> <kirankumark@marvell.com>
> Subject: Re: [dpdk-dev] [PATCH v7 1/4] mempool: modify mempool
> populate() to skip objects from page boundaries
> 
> On 7/17/19 12:04 PM, vattunuru@marvell.com wrote:
> > From: Vamsi Attunuru <vattunuru@marvell.com>
> >
> > Currently the phys address of a mempool object populated by the
> > mempool populate default() routine may not be contiguous with in that
> mbuf range.
> >
> > Patch ensures that each object's phys address is contiguous by
> > modifying default behaviour of mempool populate() to prevent objects
> > from being across 2 pages, expect if the size of object is bigger than size of
> page.
> >
> > Since the overhead after this modification will be very minimal
> > considering the hugepage sizes of 512M & 1G, default behaviour is
> > modified except for the object sizes bigger than the page size.
> >
> > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> > Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
> 
> NACK
> 
> Looking at MEMPOOL_F_NO_IOVA_CONTIG description I don't understand
> why the patch is necessary at all. So, I'd like to know more. Exact conditions,
> IOVA mode, hugepage sizes, mempool flags and how it is populated.
> 

I presume the commit log clarifies the  changes in the patch, pls correct me if it's not clear. The requirement is to create mempool of objects that each object's phys address in contiguous with in it's range,  having flexibility to create such mempools helpful to the applications like KNI to operate in IOVA=VA mode where KNI kernel mode can safely translate IOVA addresses to PA. 

Regarding the exact conditions driven this approach are, when IOVA mode is set to VA and huge page size of 2MB/512MB used, since KNI application creates mempool without any flags set, mempool populate routine tends to reserve iova contiguous memzones and finally it ends up populating mempool with some objects that might being across two pages(over the page boundary), those mbuf's phys address might not be contiguous and these mempool are not suitable for operating KNI in IOVA=VA mode.

> > ---
> >   lib/librte_mempool/rte_mempool.c             |  2 +-
> >   lib/librte_mempool/rte_mempool_ops_default.c | 33
> ++++++++++++++++++++++++++--
> >   2 files changed, 32 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/librte_mempool/rte_mempool.c
> > b/lib/librte_mempool/rte_mempool.c
> > index 7260ce0..1c48325 100644
> > --- a/lib/librte_mempool/rte_mempool.c
> > +++ b/lib/librte_mempool/rte_mempool.c
> > @@ -339,7 +339,7 @@ rte_mempool_populate_iova(struct rte_mempool
> *mp, char *vaddr,
> >   	i = rte_mempool_ops_populate(mp, mp->size - mp->populated_size,
> >   		(char *)vaddr + off,
> >   		(iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off),
> > -		len - off, mempool_add_elem, NULL);
> > +		len - off, mempool_add_elem, opaque);
> 
> The last argument is the callback opaque value. mempool_add_elem() does
> not use the opaque. But it is incorrect to use it for other purposes and
> require it to be memzone.

To avoid multiple changes in the mempool APIs for adding new variable, opaque has been leveraged here. I think there is no harm in having this approach since it carries the relevant info and quite suitable between the (*mempool_populate_t) calls.  

> 
> >   	/* not enough room to store one object */
> >   	if (i == 0) {
> > diff --git a/lib/librte_mempool/rte_mempool_ops_default.c
> > b/lib/librte_mempool/rte_mempool_ops_default.c
> > index 4e2bfc8..85da264 100644
> > --- a/lib/librte_mempool/rte_mempool_ops_default.c
> > +++ b/lib/librte_mempool/rte_mempool_ops_default.c
> > @@ -45,19 +45,48 @@ rte_mempool_op_calc_mem_size_default(const
> struct rte_mempool *mp,
> >   	return mem_size;
> >   }
> >
> > +/* Returns -1 if object falls on a page boundary, else returns 0 */
> > +static inline int mempool_check_obj_bounds(void *obj, uint64_t
> > +hugepage_sz, size_t elt_sz) {
> > +	uintptr_t page_end, elt_addr = (uintptr_t)obj;
> > +	uint32_t pg_shift = rte_bsf32(hugepage_sz);
> > +	uint64_t page_mask;
> > +
> > +	page_mask =  ~((1ull << pg_shift) - 1);
> > +	page_end = (elt_addr & page_mask) + hugepage_sz;
> > +
> > +	if (elt_addr + elt_sz > page_end)
> > +		return -1;
> > +
> > +	return 0;
> > +}
> > +
> >   int
> >   rte_mempool_op_populate_default(struct rte_mempool *mp, unsigned
> int max_objs,
> >   		void *vaddr, rte_iova_t iova, size_t len,
> >   		rte_mempool_populate_obj_cb_t *obj_cb, void *obj_cb_arg)
> >   {
> > -	size_t total_elt_sz;
> > -	size_t off;
> > +	struct rte_memzone *mz = obj_cb_arg;
> > +	size_t total_elt_sz, off;
> 
> Why two variables are combined into one here? It is unrelated change.
> 
> >   	unsigned int i;
> >   	void *obj;
> >
> >   	total_elt_sz = mp->header_size + mp->elt_size + mp->trailer_size;
> >
> >   	for (off = 0, i = 0; off + total_elt_sz <= len && i < max_objs;
> > i++) {
> > +
> > +		/* Skip page boundary check if element is bigger than page */
> > +		if (mz->hugepage_sz >= total_elt_sz) {
> > +			if (mempool_check_obj_bounds((char *)vaddr + off,
> > +						    mz->hugepage_sz,
> > +						    total_elt_sz) < 0) {
> > +				i--; /* Decrement count & skip this obj */
> > +				off += total_elt_sz;
> > +				continue;
> > +			}
> > +		}
> > +
> 
> What I don't like here is that it makes one memory chunk insufficient to
> populate all objects. I.e. we calculated memory chunk size required, but
> skipped some object. May be it is not a problem, but breaks the logic existing
> in the code.
> 
> >   		off += mp->header_size;
> >   		obj = (char *)vaddr + off;
> >   		obj_cb(mp, obj_cb_arg, obj,
>
  
Andrew Rybchenko July 18, 2019, 9:28 a.m. UTC | #4
On 7/17/19 8:31 PM, Vamsi Krishna Attunuru wrote:
>> -----Original Message-----
>> From: Andrew Rybchenko <arybchenko@solarflare.com>
>> Sent: Wednesday, July 17, 2019 7:07 PM
>> To: Vamsi Krishna Attunuru <vattunuru@marvell.com>; dev@dpdk.org
>> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
>> olivier.matz@6wind.com; ferruh.yigit@intel.com;
>> anatoly.burakov@intel.com; Kiran Kumar Kokkilagadda
>> <kirankumark@marvell.com>
>> Subject: Re: [dpdk-dev] [PATCH v7 1/4] mempool: modify mempool
>> populate() to skip objects from page boundaries
>>
>> On 7/17/19 12:04 PM, vattunuru@marvell.com wrote:
>>> From: Vamsi Attunuru <vattunuru@marvell.com>
>>>
>>> Currently the phys address of a mempool object populated by the
>>> mempool populate default() routine may not be contiguous with in that
>> mbuf range.
>>> Patch ensures that each object's phys address is contiguous by
>>> modifying default behaviour of mempool populate() to prevent objects
>>> from being across 2 pages, expect if the size of object is bigger than size of
>> page.
>>> Since the overhead after this modification will be very minimal
>>> considering the hugepage sizes of 512M & 1G, default behaviour is
>>> modified except for the object sizes bigger than the page size.
>>>
>>> Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
>>> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
>> NACK
>>
>> Looking at MEMPOOL_F_NO_IOVA_CONTIG description I don't understand
>> why the patch is necessary at all. So, I'd like to know more. Exact conditions,
>> IOVA mode, hugepage sizes, mempool flags and how it is populated.
>>
> I presume the commit log clarifies the  changes in the patch, pls correct me if it's not clear. The requirement is to create mempool of objects that each object's phys address in contiguous with in it's range,  having flexibility to create such mempools helpful to the applications like KNI to operate in IOVA=VA mode where KNI kernel mode can safely translate IOVA addresses to PA.

As I understand it breaks rte_mempool_populate_default() logic
which assumes that page boundaries may be ignored in IOVA=VA
mode (see no_pageshift =  ... and above description in the function).

Sorry, right now I can't come up with the right fix, but as
explained below suggested is wrong.

> Regarding the exact conditions driven this approach are, when IOVA mode is set to VA and huge page size of 2MB/512MB used, since KNI application creates mempool without any flags set, mempool populate routine tends to reserve iova contiguous memzones and finally it ends up populating mempool with some objects that might being across two pages(over the page boundary), those mbuf's phys address might not be contiguous and these mempool are not suitable for operating KNI in IOVA=VA mode.
>
>>> ---
>>>    lib/librte_mempool/rte_mempool.c             |  2 +-
>>>    lib/librte_mempool/rte_mempool_ops_default.c | 33
>> ++++++++++++++++++++++++++--
>>>    2 files changed, 32 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/lib/librte_mempool/rte_mempool.c
>>> b/lib/librte_mempool/rte_mempool.c
>>> index 7260ce0..1c48325 100644
>>> --- a/lib/librte_mempool/rte_mempool.c
>>> +++ b/lib/librte_mempool/rte_mempool.c
>>> @@ -339,7 +339,7 @@ rte_mempool_populate_iova(struct rte_mempool
>> *mp, char *vaddr,
>>>    	i = rte_mempool_ops_populate(mp, mp->size - mp->populated_size,
>>>    		(char *)vaddr + off,
>>>    		(iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off),
>>> -		len - off, mempool_add_elem, NULL);
>>> +		len - off, mempool_add_elem, opaque);
>> The last argument is the callback opaque value. mempool_add_elem() does
>> not use the opaque. But it is incorrect to use it for other purposes and
>> require it to be memzone.
> To avoid multiple changes in the mempool APIs for adding new variable, opaque has been leveraged here. I think there is no harm in having this approach since it carries the relevant info and quite suitable between the (*mempool_populate_t) calls.

Sorry, but it enforces any caller of rte_mempool_populate_iova()
(and rte_mempool_populate_virt()) pass memzone as opaque, but
by API definition it is free_cb argument. So, if different free_cb is
used, different opaque may be required.

>>>    	/* not enough room to store one object */
>>>    	if (i == 0) {
>>> diff --git a/lib/librte_mempool/rte_mempool_ops_default.c
>>> b/lib/librte_mempool/rte_mempool_ops_default.c
>>> index 4e2bfc8..85da264 100644
>>> --- a/lib/librte_mempool/rte_mempool_ops_default.c
>>> +++ b/lib/librte_mempool/rte_mempool_ops_default.c
>>> @@ -45,19 +45,48 @@ rte_mempool_op_calc_mem_size_default(const
>> struct rte_mempool *mp,
>>>    	return mem_size;
>>>    }
>>>
>>> +/* Returns -1 if object falls on a page boundary, else returns 0 */
>>> +static inline int mempool_check_obj_bounds(void *obj, uint64_t
>>> +hugepage_sz, size_t elt_sz) {
>>> +	uintptr_t page_end, elt_addr = (uintptr_t)obj;
>>> +	uint32_t pg_shift = rte_bsf32(hugepage_sz);
>>> +	uint64_t page_mask;
>>> +
>>> +	page_mask =  ~((1ull << pg_shift) - 1);
>>> +	page_end = (elt_addr & page_mask) + hugepage_sz;
>>> +
>>> +	if (elt_addr + elt_sz > page_end)
>>> +		return -1;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>    int
>>>    rte_mempool_op_populate_default(struct rte_mempool *mp, unsigned
>> int max_objs,
>>>    		void *vaddr, rte_iova_t iova, size_t len,
>>>    		rte_mempool_populate_obj_cb_t *obj_cb, void *obj_cb_arg)
>>>    {
>>> -	size_t total_elt_sz;
>>> -	size_t off;
>>> +	struct rte_memzone *mz = obj_cb_arg;

Moreover, the change enforces obj_cb_arg to be memzone, but
it is obj_cb argument and obj_cb defines what should be in the
obj_cb_arg.

>>> +	size_t total_elt_sz, off;
>> Why two variables are combined into one here? It is unrelated change.
>>
>>>    	unsigned int i;
>>>    	void *obj;
>>>
>>>    	total_elt_sz = mp->header_size + mp->elt_size + mp->trailer_size;
>>>
>>>    	for (off = 0, i = 0; off + total_elt_sz <= len && i < max_objs;
>>> i++) {
>>> +
>>> +		/* Skip page boundary check if element is bigger than page */
>>> +		if (mz->hugepage_sz >= total_elt_sz) {
>>> +			if (mempool_check_obj_bounds((char *)vaddr + off,
>>> +						    mz->hugepage_sz,
>>> +						    total_elt_sz) < 0) {
>>> +				i--; /* Decrement count & skip this obj */
>>> +				off += total_elt_sz;
>>> +				continue;
>>> +			}
>>> +		}
>>> +
>> What I don't like here is that it makes one memory chunk insufficient to
>> populate all objects. I.e. we calculated memory chunk size required, but
>> skipped some object. May be it is not a problem, but breaks the logic existing
>> in the code.
>>
>>>    		off += mp->header_size;
>>>    		obj = (char *)vaddr + off;
>>>    		obj_cb(mp, obj_cb_arg, obj,
  
Vamsi Krishna Attunuru July 18, 2019, 2:16 p.m. UTC | #5
> -----Original Message-----
> From: Andrew Rybchenko <arybchenko@solarflare.com>
> Sent: Thursday, July 18, 2019 2:58 PM
> To: Vamsi Krishna Attunuru <vattunuru@marvell.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
> olivier.matz@6wind.com; ferruh.yigit@intel.com;
> anatoly.burakov@intel.com; Kiran Kumar Kokkilagadda
> <kirankumark@marvell.com>
> Subject: Re: [dpdk-dev] [PATCH v7 1/4] mempool: modify mempool
> populate() to skip objects from page boundaries
> 
> On 7/17/19 8:31 PM, Vamsi Krishna Attunuru wrote:
> >> -----Original Message-----
> >> From: Andrew Rybchenko <arybchenko@solarflare.com>
> >> Sent: Wednesday, July 17, 2019 7:07 PM
> >> To: Vamsi Krishna Attunuru <vattunuru@marvell.com>; dev@dpdk.org
> >> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran
> >> <jerinj@marvell.com>; olivier.matz@6wind.com; ferruh.yigit@intel.com;
> >> anatoly.burakov@intel.com; Kiran Kumar Kokkilagadda
> >> <kirankumark@marvell.com>
> >> Subject: Re: [dpdk-dev] [PATCH v7 1/4] mempool: modify mempool
> >> populate() to skip objects from page boundaries
> >>
> >> On 7/17/19 12:04 PM, vattunuru@marvell.com wrote:
> >>> From: Vamsi Attunuru <vattunuru@marvell.com>
> >>>
> >>> Currently the phys address of a mempool object populated by the
> >>> mempool populate default() routine may not be contiguous with in
> >>> that
> >> mbuf range.
> >>> Patch ensures that each object's phys address is contiguous by
> >>> modifying default behaviour of mempool populate() to prevent objects
> >>> from being across 2 pages, expect if the size of object is bigger
> >>> than size of
> >> page.
> >>> Since the overhead after this modification will be very minimal
> >>> considering the hugepage sizes of 512M & 1G, default behaviour is
> >>> modified except for the object sizes bigger than the page size.
> >>>
> >>> Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> >>> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
> >> NACK
> >>
> >> Looking at MEMPOOL_F_NO_IOVA_CONTIG description I don't
> understand
> >> why the patch is necessary at all. So, I'd like to know more. Exact
> >> conditions, IOVA mode, hugepage sizes, mempool flags and how it is
> populated.
> >>
> > I presume the commit log clarifies the  changes in the patch, pls correct me
> if it's not clear. The requirement is to create mempool of objects that each
> object's phys address in contiguous with in it's range,  having flexibility to
> create such mempools helpful to the applications like KNI to operate in
> IOVA=VA mode where KNI kernel mode can safely translate IOVA addresses
> to PA.
> 
> As I understand it breaks rte_mempool_populate_default() logic which
> assumes that page boundaries may be ignored in IOVA=VA mode (see
> no_pageshift =  ... and above description in the function).
> 
> Sorry, right now I can't come up with the right fix, but as explained below
> suggested is wrong.

It would be nice if how this approach goes wrong was also explained, earlier we had flag based approach and later made it default as Olivier suggested. If there are no other ways to fix, can we stick with this approach or roll back to the earlier flag based one..? 

@Olivier,
Any suggestions..?
 
> 
> > Regarding the exact conditions driven this approach are, when IOVA mode
> is set to VA and huge page size of 2MB/512MB used, since KNI application
> creates mempool without any flags set, mempool populate routine tends to
> reserve iova contiguous memzones and finally it ends up populating
> mempool with some objects that might being across two pages(over the page
> boundary), those mbuf's phys address might not be contiguous and these
> mempool are not suitable for operating KNI in IOVA=VA mode.
> >
> >>> ---
> >>>    lib/librte_mempool/rte_mempool.c             |  2 +-
> >>>    lib/librte_mempool/rte_mempool_ops_default.c | 33
> >> ++++++++++++++++++++++++++--
> >>>    2 files changed, 32 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/lib/librte_mempool/rte_mempool.c
> >>> b/lib/librte_mempool/rte_mempool.c
> >>> index 7260ce0..1c48325 100644
> >>> --- a/lib/librte_mempool/rte_mempool.c
> >>> +++ b/lib/librte_mempool/rte_mempool.c
> >>> @@ -339,7 +339,7 @@ rte_mempool_populate_iova(struct
> rte_mempool
> >> *mp, char *vaddr,
> >>>    	i = rte_mempool_ops_populate(mp, mp->size - mp->populated_size,
> >>>    		(char *)vaddr + off,
> >>>    		(iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off),
> >>> -		len - off, mempool_add_elem, NULL);
> >>> +		len - off, mempool_add_elem, opaque);
> >> The last argument is the callback opaque value. mempool_add_elem()
> >> does not use the opaque. But it is incorrect to use it for other
> >> purposes and require it to be memzone.
> > To avoid multiple changes in the mempool APIs for adding new variable,
> opaque has been leveraged here. I think there is no harm in having this
> approach since it carries the relevant info and quite suitable between the
> (*mempool_populate_t) calls.
> 
> Sorry, but it enforces any caller of rte_mempool_populate_iova() (and
> rte_mempool_populate_virt()) pass memzone as opaque, but by API
> definition it is free_cb argument. So, if different free_cb is used, different
> opaque may be required.

This approach simplifies the changes and avoid multiple changes, can these parameter usage be documented to avoid the confusions.?
Else there will be lots of changes in multiple APIs for passing mz info, if everyone is fine with those changes, will make the required changes and send the next version.

> 
> >>>    	/* not enough room to store one object */
> >>>    	if (i == 0) {
> >>> diff --git a/lib/librte_mempool/rte_mempool_ops_default.c
> >>> b/lib/librte_mempool/rte_mempool_ops_default.c
> >>> index 4e2bfc8..85da264 100644
> >>> --- a/lib/librte_mempool/rte_mempool_ops_default.c
> >>> +++ b/lib/librte_mempool/rte_mempool_ops_default.c
> >>> @@ -45,19 +45,48 @@ rte_mempool_op_calc_mem_size_default(const
> >> struct rte_mempool *mp,
> >>>    	return mem_size;
> >>>    }
> >>>
> >>> +/* Returns -1 if object falls on a page boundary, else returns 0 */
> >>> +static inline int mempool_check_obj_bounds(void *obj, uint64_t
> >>> +hugepage_sz, size_t elt_sz) {
> >>> +	uintptr_t page_end, elt_addr = (uintptr_t)obj;
> >>> +	uint32_t pg_shift = rte_bsf32(hugepage_sz);
> >>> +	uint64_t page_mask;
> >>> +
> >>> +	page_mask =  ~((1ull << pg_shift) - 1);
> >>> +	page_end = (elt_addr & page_mask) + hugepage_sz;
> >>> +
> >>> +	if (elt_addr + elt_sz > page_end)
> >>> +		return -1;
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>>    int
> >>>    rte_mempool_op_populate_default(struct rte_mempool *mp,
> unsigned
> >> int max_objs,
> >>>    		void *vaddr, rte_iova_t iova, size_t len,
> >>>    		rte_mempool_populate_obj_cb_t *obj_cb, void *obj_cb_arg)
> >>>    {
> >>> -	size_t total_elt_sz;
> >>> -	size_t off;
> >>> +	struct rte_memzone *mz = obj_cb_arg;
> 
> Moreover, the change enforces obj_cb_arg to be memzone, but it is obj_cb
> argument and obj_cb defines what should be in the obj_cb_arg.

Will revert this while sending next version.

> 
> >>> +	size_t total_elt_sz, off;
> >> Why two variables are combined into one here? It is unrelated change.
> >>
> >>>    	unsigned int i;
> >>>    	void *obj;
> >>>
> >>>    	total_elt_sz = mp->header_size + mp->elt_size +
> >>> mp->trailer_size;
> >>>
> >>>    	for (off = 0, i = 0; off + total_elt_sz <= len && i < max_objs;
> >>> i++) {
> >>> +
> >>> +		/* Skip page boundary check if element is bigger than page */
> >>> +		if (mz->hugepage_sz >= total_elt_sz) {
> >>> +			if (mempool_check_obj_bounds((char *)vaddr + off,
> >>> +						    mz->hugepage_sz,
> >>> +						    total_elt_sz) < 0) {
> >>> +				i--; /* Decrement count & skip this obj */
> >>> +				off += total_elt_sz;
> >>> +				continue;
> >>> +			}
> >>> +		}
> >>> +
> >> What I don't like here is that it makes one memory chunk insufficient
> >> to populate all objects. I.e. we calculated memory chunk size
> >> required, but skipped some object. May be it is not a problem, but
> >> breaks the logic existing in the code.
> >>
> >>>    		off += mp->header_size;
> >>>    		obj = (char *)vaddr + off;
> >>>    		obj_cb(mp, obj_cb_arg, obj,
  

Patch

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index 7260ce0..1c48325 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -339,7 +339,7 @@  rte_mempool_populate_iova(struct rte_mempool *mp, char *vaddr,
 	i = rte_mempool_ops_populate(mp, mp->size - mp->populated_size,
 		(char *)vaddr + off,
 		(iova == RTE_BAD_IOVA) ? RTE_BAD_IOVA : (iova + off),
-		len - off, mempool_add_elem, NULL);
+		len - off, mempool_add_elem, opaque);
 
 	/* not enough room to store one object */
 	if (i == 0) {
diff --git a/lib/librte_mempool/rte_mempool_ops_default.c b/lib/librte_mempool/rte_mempool_ops_default.c
index 4e2bfc8..85da264 100644
--- a/lib/librte_mempool/rte_mempool_ops_default.c
+++ b/lib/librte_mempool/rte_mempool_ops_default.c
@@ -45,19 +45,48 @@  rte_mempool_op_calc_mem_size_default(const struct rte_mempool *mp,
 	return mem_size;
 }
 
+/* Returns -1 if object falls on a page boundary, else returns 0 */
+static inline int
+mempool_check_obj_bounds(void *obj, uint64_t hugepage_sz, size_t elt_sz)
+{
+	uintptr_t page_end, elt_addr = (uintptr_t)obj;
+	uint32_t pg_shift = rte_bsf32(hugepage_sz);
+	uint64_t page_mask;
+
+	page_mask =  ~((1ull << pg_shift) - 1);
+	page_end = (elt_addr & page_mask) + hugepage_sz;
+
+	if (elt_addr + elt_sz > page_end)
+		return -1;
+
+	return 0;
+}
+
 int
 rte_mempool_op_populate_default(struct rte_mempool *mp, unsigned int max_objs,
 		void *vaddr, rte_iova_t iova, size_t len,
 		rte_mempool_populate_obj_cb_t *obj_cb, void *obj_cb_arg)
 {
-	size_t total_elt_sz;
-	size_t off;
+	struct rte_memzone *mz = obj_cb_arg;
+	size_t total_elt_sz, off;
 	unsigned int i;
 	void *obj;
 
 	total_elt_sz = mp->header_size + mp->elt_size + mp->trailer_size;
 
 	for (off = 0, i = 0; off + total_elt_sz <= len && i < max_objs; i++) {
+
+		/* Skip page boundary check if element is bigger than page */
+		if (mz->hugepage_sz >= total_elt_sz) {
+			if (mempool_check_obj_bounds((char *)vaddr + off,
+						    mz->hugepage_sz,
+						    total_elt_sz) < 0) {
+				i--; /* Decrement count & skip this obj */
+				off += total_elt_sz;
+				continue;
+			}
+		}
+
 		off += mp->header_size;
 		obj = (char *)vaddr + off;
 		obj_cb(mp, obj_cb_arg, obj,