[3/7] net/bonding: change mbuf pool and ring allocation

Message ID 1639592401-56845-4-git-send-email-rsanford@akamai.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series net/bonding: fixes and LACP short timeout |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Robert Sanford Dec. 15, 2021, 6:19 p.m. UTC
  - Turn off mbuf pool caching to avoid mbufs lingering in pool caches.
  At most, we transmit one LACPDU per second, per port.
- Fix calculation of ring sizes, taking into account that a ring of
  size N holds up to N-1 items.

Signed-off-by: Robert Sanford <rsanford@akamai.com>
---
 drivers/net/bonding/rte_eth_bond_8023ad.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)
  

Comments

humin (Q) Dec. 16, 2021, 8:59 a.m. UTC | #1
Hi, Robert,

在 2021/12/16 2:19, Robert Sanford 写道:
> - Turn off mbuf pool caching to avoid mbufs lingering in pool caches.
>    At most, we transmit one LACPDU per second, per port.
Could you be more detailed, why does mbuf pool caching is not needed?

> - Fix calculation of ring sizes, taking into account that a ring of
>    size N holds up to N-1 items.
Same to that, why should resvere another items ?
> 
By the way, I found the comment for BOND_MODE_8023AX_SLAVE_RX_PKTS is
is wrong, could you fix it in this patch?
> Signed-off-by: Robert Sanford <rsanford@akamai.com>
> ---
>   drivers/net/bonding/rte_eth_bond_8023ad.c | 14 ++++++++------
>   1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
> index 43231bc..83d3938 100644
> --- a/drivers/net/bonding/rte_eth_bond_8023ad.c
> +++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
> @@ -1101,9 +1101,7 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev,
>   	}
>   
>   	snprintf(mem_name, RTE_DIM(mem_name), "slave_port%u_pool", slave_id);
> -	port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc,
> -		RTE_MEMPOOL_CACHE_MAX_SIZE >= 32 ?
> -			32 : RTE_MEMPOOL_CACHE_MAX_SIZE,
> +	port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc, 0,
>   		0, element_size, socket_id);
>   
>   	/* Any memory allocation failure in initialization is critical because
> @@ -1113,19 +1111,23 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev,
>   			slave_id, mem_name, rte_strerror(rte_errno));
>   	}
>   
> +	/* Add one extra because ring reserves one. */
>   	snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_rx", slave_id);
>   	port->rx_ring = rte_ring_create(mem_name,
> -			rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS), socket_id, 0);
> +			rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS + 1),
> +			socket_id, 0);
>   
>   	if (port->rx_ring == NULL) {
>   		rte_panic("Slave %u: Failed to create rx ring '%s': %s\n", slave_id,
>   			mem_name, rte_strerror(rte_errno));
>   	}
>   
> -	/* TX ring is at least one pkt longer to make room for marker packet. */
> +	/* TX ring is at least one pkt longer to make room for marker packet.
> +	 * Add one extra because ring reserves one. */
>   	snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_tx", slave_id);
>   	port->tx_ring = rte_ring_create(mem_name,
> -			rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 1), socket_id, 0);
> +			rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 2),
> +			socket_id, 0);
>   
>   	if (port->tx_ring == NULL) {
>   		rte_panic("Slave %u: Failed to create tx ring '%s': %s\n", slave_id,
>
  
Sanford, Robert Dec. 17, 2021, 7:49 p.m. UTC | #2
Hello Connor,

Thank you for the questions and comments. I will repeat the questions, followed by my answers.

Q: Could you be more detailed, why is mbuf pool caching not needed?

A: The short answer: under certain conditions, we can run out of
buffers from that small, LACPDU-mempool. We actually saw this occur
in production, on mostly-idle links.

For a long explanation, let's assume the following:
1. 1 tx-queue per bond and underlying ethdev ports.
2. 256 tx-descriptors (per ethdev port).
3. 257 mbufs in each port's LACPDU-pool, as computed by
bond_mode_8023ad_activate_slave(), and cache-size 32.
4. The "app" xmits zero packets to this bond for a long time.
5. In EAL intr thread context, LACP tx_machine() allocates 1 mbuf
(LACPDU) per second from the pool, and puts it into LACP tx-ring.
6. Every second, another thread, let's call it the tx-core, calls
tx-burst (with zero packets to xmit), finds 1 mbuf on LACP tx-ring,
and underlying ethdev PMD puts mbuf data into a tx-desc.
7. PMD tx-burst configured not to clean up used tx-descs until
there are almost none free, e.g., less than pool's cache-size *
CACHE_FLUSH_THRESH_MULTIPLIER (1.5).
8. When cleaning up tx-descs, we may leave up to 47 mbufs in the
tx-core's LACPDU-pool cache (not accessible from intr thread).

When the number of used tx-descs (0..255) + number of mbufs in the
cache (0..47) reaches 257, then allocation fails.

If I understand the LACP tx-burst code correctly, it would be
worse if nb_tx_queues > 1, because (assuming multiple tx-cores)
any queue/lcore could xmit an LACPDU. Thus, up to nb_tx_queues *
47 mbufs could be cached, and not accessible from tx_machine().

You would not see this problem if the app xmits other (non-LACP)
mbufs on a regular basis, to expedite the clean-up of tx-descs
including LACPDU mbufs (unless nb_tx_queues tx-core caches
could hold all LACPDU mbufs).

If we make mempool's cache size 0, then allocation will not fail.

A mempool cache for LACPDUs does not offer much additional speed:
during alloc, the intr thread does not have default mempool caches
(AFAIK); and the average time between frees is either 1 second (LACP
short timeouts) or 10 seconds (long timeouts), i.e., infrequent.

--------

Q: Why reserve one additional slot in the rx and tx rings?

A: rte_ring_create() requires the ring size N, to be a power of 2,
but it can only store N-1 items. Thus, if we want to store X items,
we need to ask for (at least) X+1. Original code fails when the real
desired size is a power of 2, because in such a case, align32pow2
does not round up.

For example, say we want a ring to hold 4:

    rte_ring_create(... rte_align32pow2(4) ...)

rte_align32pow2(4) returns 4, and we end up with a ring that only
stores 3 items.

    rte_ring_create(... rte_align32pow2(4+1) ...)

rte_align32pow2(5) returns 8, and we end up with a ring that
stores up to 7 items, more than we need, but acceptable.

--------

Q: I found the comment for BOND_MODE_8023AX_SLAVE_RX_PKTS is
wrong, could you fix it in this patch?

A: Yes, I will fix it in the next version of the patch.

--
Regards,
Robert Sanford


On 12/16/21, 4:01 AM, "Min Hu (Connor)" <humin29@huawei.com> wrote:

    Hi, Robert,

    在 2021/12/16 2:19, Robert Sanford 写道:
    > - Turn off mbuf pool caching to avoid mbufs lingering in pool caches.
    >    At most, we transmit one LACPDU per second, per port.
    Could you be more detailed, why does mbuf pool caching is not needed?

    > - Fix calculation of ring sizes, taking into account that a ring of
    >    size N holds up to N-1 items.
    Same to that, why should resvere another items ?
    > 
    By the way, I found the comment for BOND_MODE_8023AX_SLAVE_RX_PKTS is
    is wrong, could you fix it in this patch?
    > Signed-off-by: Robert Sanford <rsanford@akamai.com>
    > ---
    >   drivers/net/bonding/rte_eth_bond_8023ad.c | 14 ++++++++------
    >   1 file changed, 8 insertions(+), 6 deletions(-)
    > 
    > diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
    > index 43231bc..83d3938 100644
    > --- a/drivers/net/bonding/rte_eth_bond_8023ad.c
    > +++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
    > @@ -1101,9 +1101,7 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev,
    >   	}
    >   
    >   	snprintf(mem_name, RTE_DIM(mem_name), "slave_port%u_pool", slave_id);
    > -	port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc,
    > -		RTE_MEMPOOL_CACHE_MAX_SIZE >= 32 ?
    > -			32 : RTE_MEMPOOL_CACHE_MAX_SIZE,
    > +	port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc, 0,
    >   		0, element_size, socket_id);
    >   
    >   	/* Any memory allocation failure in initialization is critical because
    > @@ -1113,19 +1111,23 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev,
    >   			slave_id, mem_name, rte_strerror(rte_errno));
    >   	}
    >   
    > +	/* Add one extra because ring reserves one. */
    >   	snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_rx", slave_id);
    >   	port->rx_ring = rte_ring_create(mem_name,
    > -			rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS), socket_id, 0);
    > +			rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS + 1),
    > +			socket_id, 0);
    >   
    >   	if (port->rx_ring == NULL) {
    >   		rte_panic("Slave %u: Failed to create rx ring '%s': %s\n", slave_id,
    >   			mem_name, rte_strerror(rte_errno));
    >   	}
    >   
    > -	/* TX ring is at least one pkt longer to make room for marker packet. */
    > +	/* TX ring is at least one pkt longer to make room for marker packet.
    > +	 * Add one extra because ring reserves one. */
    >   	snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_tx", slave_id);
    >   	port->tx_ring = rte_ring_create(mem_name,
    > -			rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 1), socket_id, 0);
    > +			rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 2),
    > +			socket_id, 0);
    >   
    >   	if (port->tx_ring == NULL) {
    >   		rte_panic("Slave %u: Failed to create tx ring '%s': %s\n", slave_id,
    >
  
humin (Q) Dec. 18, 2021, 3:44 a.m. UTC | #3
Hi, Sanford,
	Thanks for your detailed description, some questions as follows.

在 2021/12/18 3:49, Sanford, Robert 写道:
> Hello Connor,
> 
> Thank you for the questions and comments. I will repeat the questions, followed by my answers.
> 
> Q: Could you be more detailed, why is mbuf pool caching not needed?
> 
> A: The short answer: under certain conditions, we can run out of
> buffers from that small, LACPDU-mempool. We actually saw this occur
> in production, on mostly-idle links.
> 
> For a long explanation, let's assume the following:
> 1. 1 tx-queue per bond and underlying ethdev ports.
> 2. 256 tx-descriptors (per ethdev port).
> 3. 257 mbufs in each port's LACPDU-pool, as computed by
> bond_mode_8023ad_activate_slave(), and cache-size 32.
> 4. The "app" xmits zero packets to this bond for a long time.
> 5. In EAL intr thread context, LACP tx_machine() allocates 1 mbuf
> (LACPDU) per second from the pool, and puts it into LACP tx-ring.
> 6. Every second, another thread, let's call it the tx-core, calls
> tx-burst (with zero packets to xmit), finds 1 mbuf on LACP tx-ring,
> and underlying ethdev PMD puts mbuf data into a tx-desc.
> 7. PMD tx-burst configured not to clean up used tx-descs until
> there are almost none free, e.g., less than pool's cache-size *
> CACHE_FLUSH_THRESH_MULTIPLIER (1.5).
> 8. When cleaning up tx-descs, we may leave up to 47 mbufs in the
> tx-core's LACPDU-pool cache (not accessible from intr thread).
> 
> When the number of used tx-descs (0..255) + number of mbufs in the
> cache (0..47) reaches 257, then allocation fails.
> 
> If I understand the LACP tx-burst code correctly, it would be
> worse if nb_tx_queues > 1, because (assuming multiple tx-cores)
> any queue/lcore could xmit an LACPDU. Thus, up to nb_tx_queues *
> 47 mbufs could be cached, and not accessible from tx_machine().
> 
> You would not see this problem if the app xmits other (non-LACP)
> mbufs on a regular basis, to expedite the clean-up of tx-descs
> including LACPDU mbufs (unless nb_tx_queues tx-core caches
> could hold all LACPDU mbufs).
> 
I think, we could not see this problem only because the mempool can
offer much more mbufs than cache size on no-LACP circumstance.

> If we make mempool's cache size 0, then allocation will not fail.
How about enlarge the size of mempool, i.e., up to 4096 ? I think
it can also avoid this bug.
> 
> A mempool cache for LACPDUs does not offer much additional speed:
> during alloc, the intr thread does not have default mempool caches
Why? as I know, all the core has its own default mempool caches ?
> (AFAIK); and the average time between frees is either 1 second (LACP
> short timeouts) or 10 seconds (long timeouts), i.e., infrequent.
> 
> --------
> 
> Q: Why reserve one additional slot in the rx and tx rings?
> 
> A: rte_ring_create() requires the ring size N, to be a power of 2,
> but it can only store N-1 items. Thus, if we want to store X items,
Hi, Robert, could you describe it for me?
I cannot understand why it
"only store N -1 items". I check the source code, It writes:
"The real usable ring size is *count-1* instead of *count* to
differentiate a free ring from an empty ring."
But I still can not get what it wrote.

> we need to ask for (at least) X+1. Original code fails when the real
> desired size is a power of 2, because in such a case, align32pow2
> does not round up.
> 
> For example, say we want a ring to hold 4:
> 
>      rte_ring_create(... rte_align32pow2(4) ...)
> 
> rte_align32pow2(4) returns 4, and we end up with a ring that only
> stores 3 items.
> 
>      rte_ring_create(... rte_align32pow2(4+1) ...)
> 
> rte_align32pow2(5) returns 8, and we end up with a ring that
> stores up to 7 items, more than we need, but acceptable.
To fix the bug, how about just setting the flags "RING_F_EXACT_SZ"

> 
> --------
> 
> Q: I found the comment for BOND_MODE_8023AX_SLAVE_RX_PKTS is
> wrong, could you fix it in this patch?
> 
> A: Yes, I will fix it in the next version of the patch.
Thanks.
> 
> --
> Regards,
> Robert Sanford
> 
> 
> On 12/16/21, 4:01 AM, "Min Hu (Connor)" <humin29@huawei.com> wrote:
> 
>      Hi, Robert,
> 
>      在 2021/12/16 2:19, Robert Sanford 写道:
>      > - Turn off mbuf pool caching to avoid mbufs lingering in pool caches.
>      >    At most, we transmit one LACPDU per second, per port.
>      Could you be more detailed, why does mbuf pool caching is not needed?
> 
>      > - Fix calculation of ring sizes, taking into account that a ring of
>      >    size N holds up to N-1 items.
>      Same to that, why should resvere another items ?
>      >
>      By the way, I found the comment for BOND_MODE_8023AX_SLAVE_RX_PKTS is
>      is wrong, could you fix it in this patch?
>      > Signed-off-by: Robert Sanford <rsanford@akamai.com>
>      > ---
>      >   drivers/net/bonding/rte_eth_bond_8023ad.c | 14 ++++++++------
>      >   1 file changed, 8 insertions(+), 6 deletions(-)
>      >
>      > diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
>      > index 43231bc..83d3938 100644
>      > --- a/drivers/net/bonding/rte_eth_bond_8023ad.c
>      > +++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
>      > @@ -1101,9 +1101,7 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev,
>      >   	}
>      >
>      >   	snprintf(mem_name, RTE_DIM(mem_name), "slave_port%u_pool", slave_id);
>      > -	port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc,
>      > -		RTE_MEMPOOL_CACHE_MAX_SIZE >= 32 ?
>      > -			32 : RTE_MEMPOOL_CACHE_MAX_SIZE,
>      > +	port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc, 0,
>      >   		0, element_size, socket_id);
>      >
>      >   	/* Any memory allocation failure in initialization is critical because
>      > @@ -1113,19 +1111,23 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev,
>      >   			slave_id, mem_name, rte_strerror(rte_errno));
>      >   	}
>      >
>      > +	/* Add one extra because ring reserves one. */
>      >   	snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_rx", slave_id);
>      >   	port->rx_ring = rte_ring_create(mem_name,
>      > -			rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS), socket_id, 0);
>      > +			rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS + 1),
>      > +			socket_id, 0);
>      >
>      >   	if (port->rx_ring == NULL) {
>      >   		rte_panic("Slave %u: Failed to create rx ring '%s': %s\n", slave_id,
>      >   			mem_name, rte_strerror(rte_errno));
>      >   	}
>      >
>      > -	/* TX ring is at least one pkt longer to make room for marker packet. */
>      > +	/* TX ring is at least one pkt longer to make room for marker packet.
>      > +	 * Add one extra because ring reserves one. */
>      >   	snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_tx", slave_id);
>      >   	port->tx_ring = rte_ring_create(mem_name,
>      > -			rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 1), socket_id, 0);
>      > +			rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 2),
>      > +			socket_id, 0);
>      >
>      >   	if (port->tx_ring == NULL) {
>      >   		rte_panic("Slave %u: Failed to create tx ring '%s': %s\n", slave_id,
>      >
>
  
Sanford, Robert Dec. 20, 2021, 4:47 p.m. UTC | #4
Hello Connor,

Please see responses inline.

On 12/17/21, 10:44 PM, "Min Hu (Connor)" <humin29@huawei.com> wrote:

> > When the number of used tx-descs (0..255) + number of mbufs in the
> > cache (0..47) reaches 257, then allocation fails.
> > 
> > If I understand the LACP tx-burst code correctly, it would be
> > worse if nb_tx_queues > 1, because (assuming multiple tx-cores)
> > any queue/lcore could xmit an LACPDU. Thus, up to nb_tx_queues *
> > 47 mbufs could be cached, and not accessible from tx_machine().
> > 
> > You would not see this problem if the app xmits other (non-LACP)
> > mbufs on a regular basis, to expedite the clean-up of tx-descs
> > including LACPDU mbufs (unless nb_tx_queues tx-core caches
> > could hold all LACPDU mbufs).
> > 
> I think, we could not see this problem only because the mempool can
> offer much more mbufs than cache size on no-LACP circumstance.
>
> > If we make mempool's cache size 0, then allocation will not fail.
> How about enlarge the size of mempool, i.e., up to 4096 ? I think
> it can also avoid this bug.
> > 
> > A mempool cache for LACPDUs does not offer much additional speed:
> > during alloc, the intr thread does not have default mempool caches
> Why? as I know, all the core has its own default mempool caches ?

These are private mbuf pools that we use *only* for LACPDUs, *one*
mbuf per second, at most. (When LACP link peer selects long timeouts,
we get/put one mbuf every 30 seconds.)

There is *NO* benefit for the consumer thread (interrupt thread
executing tx_machine()) to have caches on per-slave LACPDU pools.
The interrupt thread is a control thread, i.e., a non-EAL thread.
Its lcore_id is LCORE_ID_ANY, so it has no "default cache" in any
mempool.

There is little or no benefit for active data-plane threads to have
caches on per-slave LACPDU pools, because on each pool, the producer
thread puts back, at most, one mbuf per second. There is not much
contention with the consumer (interrupt thread).

I contend that caches are not necessary for these private LACPDU
mbuf pools, but just waste RAM and CPU-cache. If we still insist
on creating them *with* caches, then we should add at least
(cache-size x 1.5 x nb-tx-queues) mbufs per pool.

 
> > Q: Why reserve one additional slot in the rx and tx rings?
> > 
> > A: rte_ring_create() requires the ring size N, to be a power of 2,
> > but it can only store N-1 items. Thus, if we want to store X items,
> Hi, Robert, could you describe it for me?
> I cannot understand why it
> "only store N -1 items". I check the source code, It writes:
> "The real usable ring size is *count-1* instead of *count* to
> differentiate a free ring from an empty ring."
> But I still can not get what it wrote.

I believe there is a mistake in the ring comments (in 3 places).
It would be better if they replace "free" with "full":
"... to differentiate a *full* ring from an empty ring."


> > we need to ask for (at least) X+1. Original code fails when the real
> > desired size is a power of 2, because in such a case, align32pow2
> > does not round up.
> > 
> > For example, say we want a ring to hold 4:
> > 
> >      rte_ring_create(... rte_align32pow2(4) ...)
> > 
> > rte_align32pow2(4) returns 4, and we end up with a ring that only
> > stores 3 items.
> > 
> >      rte_ring_create(... rte_align32pow2(4+1) ...)
> > 
> > rte_align32pow2(5) returns 8, and we end up with a ring that
> > stores up to 7 items, more than we need, but acceptable.
> To fix the bug, how about just setting the flags "RING_F_EXACT_SZ"

Yes, this is a good idea. I will look for examples or test code that
use this flag.

 --
Regards,
Robert Sanford
  
humin (Q) Dec. 21, 2021, 2:01 a.m. UTC | #5
Hi, Sanford,

在 2021/12/21 0:47, Sanford, Robert 写道:
> Hello Connor,
> 
> Please see responses inline.
> 
> On 12/17/21, 10:44 PM, "Min Hu (Connor)" <humin29@huawei.com> wrote:
> 
>>> When the number of used tx-descs (0..255) + number of mbufs in the
>>> cache (0..47) reaches 257, then allocation fails.
>>>
>>> If I understand the LACP tx-burst code correctly, it would be
>>> worse if nb_tx_queues > 1, because (assuming multiple tx-cores)
>>> any queue/lcore could xmit an LACPDU. Thus, up to nb_tx_queues *
>>> 47 mbufs could be cached, and not accessible from tx_machine().
>>>
>>> You would not see this problem if the app xmits other (non-LACP)
>>> mbufs on a regular basis, to expedite the clean-up of tx-descs
>>> including LACPDU mbufs (unless nb_tx_queues tx-core caches
>>> could hold all LACPDU mbufs).
>>>
>> I think, we could not see this problem only because the mempool can
>> offer much more mbufs than cache size on no-LACP circumstance.
>>
>>> If we make mempool's cache size 0, then allocation will not fail.
>> How about enlarge the size of mempool, i.e., up to 4096 ? I think
>> it can also avoid this bug.
>>>
>>> A mempool cache for LACPDUs does not offer much additional speed:
>>> during alloc, the intr thread does not have default mempool caches
>> Why? as I know, all the core has its own default mempool caches ?
> 
> These are private mbuf pools that we use *only* for LACPDUs, *one*
> mbuf per second, at most. (When LACP link peer selects long timeouts,
> we get/put one mbuf every 30 seconds.)
> 
> There is *NO* benefit for the consumer thread (interrupt thread
> executing tx_machine()) to have caches on per-slave LACPDU pools.
> The interrupt thread is a control thread, i.e., a non-EAL thread.
> Its lcore_id is LCORE_ID_ANY, so it has no "default cache" in any
> mempool.
Well, sorry, I forgot that interrupt thread is non-EAL thread.
> 
> There is little or no benefit for active data-plane threads to have
> caches on per-slave LACPDU pools, because on each pool, the producer
> thread puts back, at most, one mbuf per second. There is not much
> contention with the consumer (interrupt thread).
> 
> I contend that caches are not necessary for these private LACPDU
I agree with you.
> mbuf pools, but just waste RAM and CPU-cache. If we still insist
> on creating them *with* caches, then we should add at least
> (cache-size x 1.5 x nb-tx-queues) mbufs per pool.
> 
>   
>>> Q: Why reserve one additional slot in the rx and tx rings?
>>>
>>> A: rte_ring_create() requires the ring size N, to be a power of 2,
>>> but it can only store N-1 items. Thus, if we want to store X items,
>> Hi, Robert, could you describe it for me?
>> I cannot understand why it
>> "only store N -1 items". I check the source code, It writes:
>> "The real usable ring size is *count-1* instead of *count* to
>> differentiate a free ring from an empty ring."
>> But I still can not get what it wrote.
> 
> I believe there is a mistake in the ring comments (in 3 places).
> It would be better if they replace "free" with "full":
> "... to differentiate a *full* ring from an empty ring."
> 
Well, I still can not understand it. I think the ring size is N, it
should store N items, why "N - 1" items.?
Hope for your description, thanks.

> 
>>> we need to ask for (at least) X+1. Original code fails when the real
>>> desired size is a power of 2, because in such a case, align32pow2
>>> does not round up.
>>>
>>> For example, say we want a ring to hold 4:
>>>
>>>       rte_ring_create(... rte_align32pow2(4) ...)
>>>
>>> rte_align32pow2(4) returns 4, and we end up with a ring that only
>>> stores 3 items.
>>>
>>>       rte_ring_create(... rte_align32pow2(4+1) ...)
>>>
>>> rte_align32pow2(5) returns 8, and we end up with a ring that
>>> stores up to 7 items, more than we need, but acceptable.
>> To fix the bug, how about just setting the flags "RING_F_EXACT_SZ"
> 
> Yes, this is a good idea. I will look for examples or test code that
> use this flag.
Yes, if fixed, ILGM.
> 
>   --
> Regards,
> Robert Sanford
> 
>
  
Sanford, Robert Dec. 21, 2021, 3:31 p.m. UTC | #6
Hi Connor,

On 12/20/21, 9:03 PM, "Min Hu (Connor)" <humin29@huawei.com> wrote:

> Hi, Sanford,

> > There is *NO* benefit for the consumer thread (interrupt thread
> > executing tx_machine()) to have caches on per-slave LACPDU pools.
> > The interrupt thread is a control thread, i.e., a non-EAL thread.
> > Its lcore_id is LCORE_ID_ANY, so it has no "default cache" in any
> > mempool.
> Well, sorry, I forgot that interrupt thread is non-EAL thread.

No problem. (I added a temporary rte_log statement in tx_machine
to make sure lcore_id == LCORE_ID_ANY.)

> > There is little or no benefit for active data-plane threads to have
> > caches on per-slave LACPDU pools, because on each pool, the producer
> > thread puts back, at most, one mbuf per second. There is not much
> > contention with the consumer (interrupt thread).
> > 
> > I contend that caches are not necessary for these private LACPDU
> I agree with you.

Thanks.

> > I believe there is a mistake in the ring comments (in 3 places).
> > It would be better if they replace "free" with "full":
> > "... to differentiate a *full* ring from an empty ring."
> > 
> Well, I still can not understand it. I think the ring size is N, it
> should store N items, why "N - 1" items.?
> Hope for your description, thanks.

Here is an excellent article that describes ring buffers, empty vs full, N-1, etc.
https://embedjournal.com/implementing-circular-buffer-embedded-c/#the-full-vs-empty-problem


> >> To fix the bug, how about just setting the flags "RING_F_EXACT_SZ"
> > 
> > Yes, this is a good idea. I will look for examples or test code that
> > use this flag.
> Yes, if fixed, ILGM.

I will use RING_F_EXACT_SZ flag in the next version of the patchset. I did not know about that flag.
	rte_ring_create(... N_PKTS ... RING_F_EXACT_SZ)
... is equivalent to, and looks cleaner than ...
	rte_ring_create(... rte_align32pow2(N_PKTS + 1) ... 0)

I plan to create a separate patchset to update the comments in rte_ring.h,
re RING_F_EXACT_SZ and "free" vs "full".

--
Regards,
Robert Sanford
  
humin (Q) Dec. 22, 2021, 3:25 a.m. UTC | #7
在 2021/12/21 23:31, Sanford, Robert 写道:
> Hi Connor,
> 
> On 12/20/21, 9:03 PM, "Min Hu (Connor)" <humin29@huawei.com> wrote:
> 
>> Hi, Sanford,
> 
>>> There is *NO* benefit for the consumer thread (interrupt thread
>>> executing tx_machine()) to have caches on per-slave LACPDU pools.
>>> The interrupt thread is a control thread, i.e., a non-EAL thread.
>>> Its lcore_id is LCORE_ID_ANY, so it has no "default cache" in any
>>> mempool.
>> Well, sorry, I forgot that interrupt thread is non-EAL thread.
> 
> No problem. (I added a temporary rte_log statement in tx_machine
> to make sure lcore_id == LCORE_ID_ANY.)
> 
>>> There is little or no benefit for active data-plane threads to have
>>> caches on per-slave LACPDU pools, because on each pool, the producer
>>> thread puts back, at most, one mbuf per second. There is not much
>>> contention with the consumer (interrupt thread).
>>>
>>> I contend that caches are not necessary for these private LACPDU
>> I agree with you.
> 
> Thanks.
> 
>>> I believe there is a mistake in the ring comments (in 3 places).
>>> It would be better if they replace "free" with "full":
>>> "... to differentiate a *full* ring from an empty ring."
>>>
>> Well, I still can not understand it. I think the ring size is N, it
>> should store N items, why "N - 1" items.?
>> Hope for your description, thanks.
> 
> Here is an excellent article that describes ring buffers, empty vs full, N-1, etc.
> https://embedjournal.com/implementing-circular-buffer-embedded-c/#the-full-vs-empty-problem
> 
Thanks Sanford, I see. It is characteristics of ring queues which is
different with common queue, like buffers.
> 
>>>> To fix the bug, how about just setting the flags "RING_F_EXACT_SZ"
>>>
>>> Yes, this is a good idea. I will look for examples or test code that
>>> use this flag.
>> Yes, if fixed, ILGM.
> 
> I will use RING_F_EXACT_SZ flag in the next version of the patchset. I did not know about that flag.
> 	rte_ring_create(... N_PKTS ... RING_F_EXACT_SZ)
> ... is equivalent to, and looks cleaner than ...
> 	rte_ring_create(... rte_align32pow2(N_PKTS + 1) ... 0)
> 
> I plan to create a separate patchset to update the comments in rte_ring.h,
> re RING_F_EXACT_SZ and "free" vs "full".
> 
> --
> Regards,
> Robert Sanford
> 
>
  

Patch

diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
index 43231bc..83d3938 100644
--- a/drivers/net/bonding/rte_eth_bond_8023ad.c
+++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
@@ -1101,9 +1101,7 @@  bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev,
 	}
 
 	snprintf(mem_name, RTE_DIM(mem_name), "slave_port%u_pool", slave_id);
-	port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc,
-		RTE_MEMPOOL_CACHE_MAX_SIZE >= 32 ?
-			32 : RTE_MEMPOOL_CACHE_MAX_SIZE,
+	port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc, 0,
 		0, element_size, socket_id);
 
 	/* Any memory allocation failure in initialization is critical because
@@ -1113,19 +1111,23 @@  bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev,
 			slave_id, mem_name, rte_strerror(rte_errno));
 	}
 
+	/* Add one extra because ring reserves one. */
 	snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_rx", slave_id);
 	port->rx_ring = rte_ring_create(mem_name,
-			rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS), socket_id, 0);
+			rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS + 1),
+			socket_id, 0);
 
 	if (port->rx_ring == NULL) {
 		rte_panic("Slave %u: Failed to create rx ring '%s': %s\n", slave_id,
 			mem_name, rte_strerror(rte_errno));
 	}
 
-	/* TX ring is at least one pkt longer to make room for marker packet. */
+	/* TX ring is at least one pkt longer to make room for marker packet.
+	 * Add one extra because ring reserves one. */
 	snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_tx", slave_id);
 	port->tx_ring = rte_ring_create(mem_name,
-			rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 1), socket_id, 0);
+			rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 2),
+			socket_id, 0);
 
 	if (port->tx_ring == NULL) {
 		rte_panic("Slave %u: Failed to create tx ring '%s': %s\n", slave_id,