[v1,5/5] examples/l3fwd: enable direct rearm mode

Message ID 20220420081650.2043183-6-feifei.wang2@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Andrew Rybchenko
Headers
Series Direct re-arming of buffers on receive side |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/iol-intel-Functional success Functional Testing PASS
ci/intel-Testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-abi-testing success Testing PASS

Commit Message

Feifei Wang April 20, 2022, 8:16 a.m. UTC
  Enable direct rearm mode. The mapping is decided in the data plane based
on the first packet received.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)
  

Comments

Morten Brørup April 20, 2022, 10:10 a.m. UTC | #1
> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Wednesday, 20 April 2022 10.17
> 
> Enable direct rearm mode. The mapping is decided in the data plane
> based
> on the first packet received.

I usually don't care much about l3fwd, but putting configuration changes in the fast path is just wrong!

Also, l3fwd is often used for benchmarking, and this small piece of code in the fast path will affect benchmark results (although only very little).

Please move it out of the fast path.
  
Honnappa Nagarahalli April 21, 2022, 2:35 a.m. UTC | #2
<snip>

> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Wednesday, 20 April 2022 10.17
> >
> > Enable direct rearm mode. The mapping is decided in the data plane
> > based on the first packet received.
> 
> I usually don't care much about l3fwd, but putting configuration changes in the
> fast path is just wrong!
I would say it depends. In this case the cycles consumed by the API are very less and configuration data is very small and is already in the cache as PMD has accessed the same data structure. 

If the configuration needs more cycles than a typical (depending on the application) data plane packet processing needs or brings in enormous amount of data in to the cache, it should not be done on the data plane.

> 
> Also, l3fwd is often used for benchmarking, and this small piece of code in the
> fast path will affect benchmark results (although only very little).
We do not see any impact on the performance numbers. The reason for putting in the data plane was it covers wider use case in this L3fwd application. If the app were to be simple, the configuration could be done from the control plane. Unfortunately, the performance of L3fwd application matters.

> 
> Please move it out of the fast path.
  
Morten Brørup April 21, 2022, 6:40 a.m. UTC | #3
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, 21 April 2022 04.35
> >
> > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > Sent: Wednesday, 20 April 2022 10.17
> > >
> > > Enable direct rearm mode. The mapping is decided in the data plane
> > > based on the first packet received.
> >
> > I usually don't care much about l3fwd, but putting configuration
> changes in the
> > fast path is just wrong!
> I would say it depends. In this case the cycles consumed by the API are
> very less and configuration data is very small and is already in the
> cache as PMD has accessed the same data structure.
> 
> If the configuration needs more cycles than a typical (depending on the
> application) data plane packet processing needs or brings in enormous
> amount of data in to the cache, it should not be done on the data
> plane.
> 

As a matter of principle, configuration changes should be done outside the fast path.

If we allow an exception for this feature, it will set a bad precedent about where to put configuration code.

> >
> > Also, l3fwd is often used for benchmarking, and this small piece of
> code in the
> > fast path will affect benchmark results (although only very little).
> We do not see any impact on the performance numbers. The reason for
> putting in the data plane was it covers wider use case in this L3fwd
> application. If the app were to be simple, the configuration could be
> done from the control plane. Unfortunately, the performance of L3fwd
> application matters.
> 

Let's proceed down that path for the sake of discussion... Then the fast path is missing runtime verification that all preconditions for using remapping are present at any time.

> >
> > Please move it out of the fast path.

BTW, this patch does not call the rte_eth_direct_rxrearm_enable() to enable the feature.

And finally, this feature should be disabled by default, and only enabled by a command line parameter or similar. Otherwise, future l3fwd NIC performance reports will provide misleading performance results, if the feature is utilized. Application developers, when comparing NIC performance results, don't care about the performance for this unique use case; they care about the performance for the generic use case.
  
Honnappa Nagarahalli May 10, 2022, 10:01 p.m. UTC | #4
(apologies for the late response, this one slipped my mind)

Appreciate if others could weigh their opinions.

<snip>
> 
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Thursday, 21 April 2022 04.35
> > >
> > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > Sent: Wednesday, 20 April 2022 10.17
> > > >
> > > > Enable direct rearm mode. The mapping is decided in the data plane
> > > > based on the first packet received.
> > >
> > > I usually don't care much about l3fwd, but putting configuration
> > changes in the
> > > fast path is just wrong!
> > I would say it depends. In this case the cycles consumed by the API
> > are very less and configuration data is very small and is already in
> > the cache as PMD has accessed the same data structure.
> >
> > If the configuration needs more cycles than a typical (depending on
> > the
> > application) data plane packet processing needs or brings in enormous
> > amount of data in to the cache, it should not be done on the data
> > plane.
> >
> 
> As a matter of principle, configuration changes should be done outside the fast
> path.
> 
> If we allow an exception for this feature, it will set a bad precedent about
> where to put configuration code.
I think there are other examples though not exactly the same. For ex: the seqlock, we cannot have a scheduled out writer while holding the lock. But, it was mentioned that this can be over come easily by running the writer on an isolated core (which to me breaks some principles).

> 
> > >
> > > Also, l3fwd is often used for benchmarking, and this small piece of
> > code in the
> > > fast path will affect benchmark results (although only very little).
> > We do not see any impact on the performance numbers. The reason for
> > putting in the data plane was it covers wider use case in this L3fwd
> > application. If the app were to be simple, the configuration could be
> > done from the control plane. Unfortunately, the performance of L3fwd
> > application matters.
> >
> 
> Let's proceed down that path for the sake of discussion... Then the fast path is
> missing runtime verification that all preconditions for using remapping are
> present at any time.
Agree, few checks (ensuring that TX and RX buffers are from the same pool, ensuring tx_rs_thresh is same as RX rearm threshold) are missing.
We will add these, it is possible to add these checks outside the packet processing loop.

> 
> > >
> > > Please move it out of the fast path.
> 
> BTW, this patch does not call the rte_eth_direct_rxrearm_enable() to enable
> the feature.
> 
> And finally, this feature should be disabled by default, and only enabled by a
> command line parameter or similar. Otherwise, future l3fwd NIC performance
> reports will provide misleading performance results, if the feature is utilized.
> Application developers, when comparing NIC performance results, don't care
> about the performance for this unique use case; they care about the
> performance for the generic use case.
> 
I think this feature is similar to fast free feature (RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) as you have mentioned in the other thread. It should be handled similar to how fast free feature is handled.
  
Morten Brørup May 11, 2022, 7:17 a.m. UTC | #5
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, 11 May 2022 00.02
> 
> (apologies for the late response, this one slipped my mind)
> 
> Appreciate if others could weigh their opinions.
> 
> <snip>
> >
> > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > Sent: Thursday, 21 April 2022 04.35
> > > >
> > > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > > Sent: Wednesday, 20 April 2022 10.17
> > > > >
> > > > > Enable direct rearm mode. The mapping is decided in the data
> plane
> > > > > based on the first packet received.
> > > >
> > > > I usually don't care much about l3fwd, but putting configuration
> > > changes in the
> > > > fast path is just wrong!
> > > I would say it depends. In this case the cycles consumed by the API
> > > are very less and configuration data is very small and is already
> in
> > > the cache as PMD has accessed the same data structure.
> > >
> > > If the configuration needs more cycles than a typical (depending on
> > > the
> > > application) data plane packet processing needs or brings in
> enormous
> > > amount of data in to the cache, it should not be done on the data
> > > plane.
> > >
> >
> > As a matter of principle, configuration changes should be done
> outside the fast
> > path.
> >
> > If we allow an exception for this feature, it will set a bad
> precedent about
> > where to put configuration code.
> I think there are other examples though not exactly the same. For ex:
> the seqlock, we cannot have a scheduled out writer while holding the
> lock. But, it was mentioned that this can be over come easily by
> running the writer on an isolated core (which to me breaks some
> principles).

Referring to a bad example (which breaks some principles) does not change my opinion. ;-)

> 
> >
> > > >
> > > > Also, l3fwd is often used for benchmarking, and this small piece
> of
> > > code in the
> > > > fast path will affect benchmark results (although only very
> little).
> > > We do not see any impact on the performance numbers. The reason for
> > > putting in the data plane was it covers wider use case in this
> L3fwd
> > > application. If the app were to be simple, the configuration could
> be
> > > done from the control plane. Unfortunately, the performance of
> L3fwd
> > > application matters.
> > >
> >
> > Let's proceed down that path for the sake of discussion... Then the
> fast path is
> > missing runtime verification that all preconditions for using
> remapping are
> > present at any time.
> Agree, few checks (ensuring that TX and RX buffers are from the same
> pool, ensuring tx_rs_thresh is same as RX rearm threshold) are missing.
> We will add these, it is possible to add these checks outside the
> packet processing loop.
> 
> >
> > > >
> > > > Please move it out of the fast path.
> >
> > BTW, this patch does not call the rte_eth_direct_rxrearm_enable() to
> enable
> > the feature.
> >
> > And finally, this feature should be disabled by default, and only
> enabled by a
> > command line parameter or similar. Otherwise, future l3fwd NIC
> performance
> > reports will provide misleading performance results, if the feature
> is utilized.
> > Application developers, when comparing NIC performance results, don't
> care
> > about the performance for this unique use case; they care about the
> > performance for the generic use case.
> >
> I think this feature is similar to fast free feature
> (RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) as you have mentioned in the other
> thread. It should be handled similar to how fast free feature is
> handled.

I agree with this comparison.

Quickly skimming l3fwd/main.c reveals that RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE is used without checking preconditions, and thus might be buggy. E.g. what happens when the NICs support RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE, and l3fwd is run with the "per-port-pool" command line option? Obviously, the "direct rearm" patch should not be punished because of bugs in similar features ("fast free"). But it is not a valid reason to allow similar bugs. You mentioned above that precondition checking will be added, so pardon me for ranting a bit here. ;-)

Furthermore, if using l3fwd for NIC performance reports, I find the results misleading if application specific optimizations are used without mentioning it in the report. This applies to both "fast free" and "direct rearm" optimizations - they only work in specific application scenarios, and thus the l3fwd NIC performance test should be run without these optimization, or at least mention that the report only covers these specific applications. Which is why I prefer that such optimizations must be explicitly enabled through a command line parameter, and not used in testing for official NIC performance reports. Taking one step back, the real problem here is that an *example* application is used for NIC performance testing, and this is the main reason for my performance related objections. I should probably object to using l3fwd for NIC performance testing instead.

I don't feel strongly about l3fwd, so I will not object to the l3fwd patch. Just providing some feedback. :-)
  
Konstantin Ananyev May 11, 2022, 10:33 p.m. UTC | #6
20/04/2022 09:16, Feifei Wang пишет:
> Enable direct rearm mode. The mapping is decided in the data plane based
> on the first packet received.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>   examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
> index bec22c44cd..38ffdf4636 100644
> --- a/examples/l3fwd/l3fwd_lpm.c
> +++ b/examples/l3fwd/l3fwd_lpm.c
> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
>   	unsigned lcore_id;
>   	uint64_t prev_tsc, diff_tsc, cur_tsc;
>   	int i, nb_rx;
> -	uint16_t portid;
> +	uint16_t portid, tx_portid;
>   	uint8_t queueid;
>   	struct lcore_conf *qconf;
>   	const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
>   
>   	const uint16_t n_rx_q = qconf->n_rx_queue;
>   	const uint16_t n_tx_p = qconf->n_tx_port;
> +	int direct_rearm_map[n_rx_q];
> +
>   	if (n_rx_q == 0) {
>   		RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
>   		return 0;
> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
>   
>   		portid = qconf->rx_queue_list[i].port_id;
>   		queueid = qconf->rx_queue_list[i].queue_id;
> +		direct_rearm_map[i] = 0;
>   		RTE_LOG(INFO, L3FWD,
>   			" -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
>   			lcore_id, portid, queueid);
> @@ -209,6 +212,17 @@ lpm_main_loop(__rte_unused void *dummy)
>   			if (nb_rx == 0)
>   				continue;
>   
> +			/* Determine the direct rearm mapping based on the first
> +			 * packet received on the rx queue
> +			 */
> +			if (direct_rearm_map[i] == 0) {
> +				tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
> +							portid);
> +				rte_eth_direct_rxrearm_map(portid, queueid,
> +								tx_portid, queueid);
> +				direct_rearm_map[i] = 1;
> +			}
> +

That just doesn't look right to me: why to make decision based on the 
first packet?
What would happen if second and all other packets have to be routed
to different ports?
In fact, this direct-rearm mode seems suitable only for hard-coded
one to one mapped forwarding (examples/l2fwd, testpmd).
For l3fwd it can be used safely only when we have one port in use.
Also I think it should be selected at init-time and
it shouldn't be on by default.
To summarize, my opinion:
special cmd-line parameter to enable it.
allowable only when we run l3fwd over one port.


>   #if defined RTE_ARCH_X86 || defined __ARM_NEON \
>   			 || defined RTE_ARCH_PPC_64
>   			l3fwd_lpm_send_packets(nb_rx, pkts_burst,
  
Konstantin Ananyev May 27, 2022, 11:28 a.m. UTC | #7
25/05/2022 01:24, Honnappa Nagarahalli пишет:
> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> 
> 20/04/2022 09:16, Feifei Wang пишет:
>>> Enable direct rearm mode. The mapping is decided in the data plane based
>>> on the first packet received.
>>>
>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>> ---
>>>   examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
>>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
>>> index bec22c44cd..38ffdf4636 100644
>>> --- a/examples/l3fwd/l3fwd_lpm.c
>>> +++ b/examples/l3fwd/l3fwd_lpm.c
>>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>       unsigned lcore_id;
>>>       uint64_t prev_tsc, diff_tsc, cur_tsc;
>>>       int i, nb_rx;
>>> -    uint16_t portid;
>>> +    uint16_t portid, tx_portid;
>>>       uint8_t queueid;
>>>       struct lcore_conf *qconf;
>>>       const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
>>> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
>>>       const uint16_t n_rx_q = qconf->n_rx_queue;
>>>       const uint16_t n_tx_p = qconf->n_tx_port;
>>> +    int direct_rearm_map[n_rx_q];
>>> +
>>>       if (n_rx_q == 0) {
>>>           RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", 
>>> lcore_id);
>>>           return 0;
>>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>           portid = qconf->rx_queue_list[i].port_id;
>>>           queueid = qconf->rx_queue_list[i].queue_id;
>>> +        direct_rearm_map[i] = 0;
>>>           RTE_LOG(INFO, L3FWD,
>>>               " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
>>>               lcore_id, portid, queueid);
>>> @@ -209,6 +212,17 @@ lpm_main_loop(__rte_unused void *dummy)
>>>               if (nb_rx == 0)
>>>                   continue;
>>> +            /* Determine the direct rearm mapping based on the first
>>> +             * packet received on the rx queue
>>> +             */
>>> +            if (direct_rearm_map[i] == 0) {
>>> +                tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
>>> +                            portid);
>>> +                rte_eth_direct_rxrearm_map(portid, queueid,
>>> +                                tx_portid, queueid);
>>> +                direct_rearm_map[i] = 1;
>>> +            }
>>> +
> 
>> That just doesn't look right to me: why to make decision based on the 
>> first packet?
> The TX queue depends on the incoming packet. So, this method covers more 
> scenarios
> than doing it in the control plane where the outgoing queue is not known.
> 
> 
>> What would happen if second and all other packets have to be routed
>> to different ports?
> This is an example application and it should be fine to make this 
> assumption.
> More over, it does not cause any problems if packets change in between. 
> When
> the packets change back, the feature works again.
> 
>> In fact, this direct-rearm mode seems suitable only for hard-coded
>> one to one mapped forwarding (examples/l2fwd, testpmd).
>> For l3fwd it can be used safely only when we have one port in use.
> Can you elaborate more on the safety issue when more than one port is used?
> 
>> Also I think it should be selected at init-time and
>> it shouldn't be on by default.
>> To summarize, my opinion:
>> special cmd-line parameter to enable it.
> Can you please elaborate why a command line parameter is required? Other 
> similar
> features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are enabled without a 
> command
> line parameter. IMO, this is how it should ber. Essentially we are 
> trying to measure
> how different PMDs perform, the ones that have implemented performance 
> improvement
> features would show better performance (i.e. the PMDs implementing the 
> features
> should not be penalized by asking for additional user input).

 From my perspective, main purpose of l3fwd application is to demonstrate
DPDK ability to do packet routing based on input packet contents.
Making guesses about packet contents is a change in expected behavior.
For some cases it might improve performance, for many others -
will most likely cause performance drop.
I think that performance drop as default behavior
(running the same parameters as before) should not be allowed.
Plus you did not provided ability to switch off that behavior,
if undesired.

About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
default enablement - I don't think it is correct.
Within l3fwd app we can safely guarantee that all
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
in each TX queue all mbufs will belong to the same mempool and
their refcnt will always equal to one.
Here you are making guesses about contents of input packets,
without any guarantee that you guess will always be valid.

BTW, what's wrong with using l2fwd to demonstrate that feature?
Seems like a natural choice to me.

>> allowable only when we run l3fwd over one port.
> 
> 
>>>   #if defined RTE_ARCH_X86 || defined __ARM_NEON \
>>>                || defined RTE_ARCH_PPC_64
>>>               l3fwd_lpm_send_packets(nb_rx, pkts_burst,
  
Honnappa Nagarahalli May 31, 2022, 5:14 p.m. UTC | #8
<snip>
> 
> 25/05/2022 01:24, Honnappa Nagarahalli пишет:
> > From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> >
> > 20/04/2022 09:16, Feifei Wang пишет:
> >>> Enable direct rearm mode. The mapping is decided in the data plane
> >>> based on the first packet received.
> >>>
> >>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>> ---
> >>>   examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
> >>>   1 file changed, 15 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
> >>> index bec22c44cd..38ffdf4636 100644
> >>> --- a/examples/l3fwd/l3fwd_lpm.c
> >>> +++ b/examples/l3fwd/l3fwd_lpm.c
> >>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
> >>>       unsigned lcore_id;
> >>>       uint64_t prev_tsc, diff_tsc, cur_tsc;
> >>>       int i, nb_rx;
> >>> -    uint16_t portid;
> >>> +    uint16_t portid, tx_portid;
> >>>       uint8_t queueid;
> >>>       struct lcore_conf *qconf;
> >>>       const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
> >>> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
> >>>       const uint16_t n_rx_q = qconf->n_rx_queue;
> >>>       const uint16_t n_tx_p = qconf->n_tx_port;
> >>> +    int direct_rearm_map[n_rx_q];
> >>> +
> >>>       if (n_rx_q == 0) {
> >>>           RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n",
> >>> lcore_id);
> >>>           return 0;
> >>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
> >>>           portid = qconf->rx_queue_list[i].port_id;
> >>>           queueid = qconf->rx_queue_list[i].queue_id;
> >>> +        direct_rearm_map[i] = 0;
> >>>           RTE_LOG(INFO, L3FWD,
> >>>               " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
> >>>               lcore_id, portid, queueid); @@ -209,6 +212,17 @@
> >>> lpm_main_loop(__rte_unused void *dummy)
> >>>               if (nb_rx == 0)
> >>>                   continue;
> >>> +            /* Determine the direct rearm mapping based on the
> >>> +first
> >>> +             * packet received on the rx queue
> >>> +             */
> >>> +            if (direct_rearm_map[i] == 0) {
> >>> +                tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
> >>> +                            portid);
> >>> +                rte_eth_direct_rxrearm_map(portid, queueid,
> >>> +                                tx_portid, queueid);
> >>> +                direct_rearm_map[i] = 1;
> >>> +            }
> >>> +
> >
> >> That just doesn't look right to me: why to make decision based on the
> >> first packet?
> > The TX queue depends on the incoming packet. So, this method covers
> > more scenarios than doing it in the control plane where the outgoing
> > queue is not known.
> >
> >
> >> What would happen if second and all other packets have to be routed
> >> to different ports?
> > This is an example application and it should be fine to make this
> > assumption.
> > More over, it does not cause any problems if packets change in between.
> > When
> > the packets change back, the feature works again.
> >
> >> In fact, this direct-rearm mode seems suitable only for hard-coded
> >> one to one mapped forwarding (examples/l2fwd, testpmd).
> >> For l3fwd it can be used safely only when we have one port in use.
> > Can you elaborate more on the safety issue when more than one port is
> used?
> >
> >> Also I think it should be selected at init-time and it shouldn't be
> >> on by default.
> >> To summarize, my opinion:
> >> special cmd-line parameter to enable it.
> > Can you please elaborate why a command line parameter is required?
> > Other similar features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are
> > enabled without a command line parameter. IMO, this is how it should
> > ber. Essentially we are trying to measure how different PMDs perform,
> > the ones that have implemented performance improvement features
> would
> > show better performance (i.e. the PMDs implementing the features
> > should not be penalized by asking for additional user input).
> 
>  From my perspective, main purpose of l3fwd application is to demonstrate
> DPDK ability to do packet routing based on input packet contents.
> Making guesses about packet contents is a change in expected behavior.
> For some cases it might improve performance, for many others - will most
> likely cause performance drop.
> I think that performance drop as default behavior (running the same
> parameters as before) should not be allowed.
> Plus you did not provided ability to switch off that behavior, if undesired.
There is no drop in L3fwd performance due to this patch.

> 
> About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE default
> enablement - I don't think it is correct.
> Within l3fwd app we can safely guarantee that all
> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
> in each TX queue all mbufs will belong to the same mempool and their refcnt
> will always equal to one.
> Here you are making guesses about contents of input packets, without any
> guarantee that you guess will always be valid.
This is not a guess. The code understands the incoming traffic and configures accordingly. So, it should be correct. Since it is a sample application, we do not expect the traffic to be complex. If it is complex, the performance will be the same as before or better.

> 
> BTW, what's wrong with using l2fwd to demonstrate that feature?
> Seems like a natural choice to me.
The performance of L3fwd application in DPDK has become a industry standard, hence we need to showcase the performance in L3fwd application.

> 
> >> allowable only when we run l3fwd over one port.
> >
> >
> >>>   #if defined RTE_ARCH_X86 || defined __ARM_NEON \
> >>>                || defined RTE_ARCH_PPC_64
> >>>               l3fwd_lpm_send_packets(nb_rx, pkts_burst,
  
Andrew Rybchenko June 3, 2022, 10:32 a.m. UTC | #9
On 5/31/22 20:14, Honnappa Nagarahalli wrote:
> <snip>
>>
>> 25/05/2022 01:24, Honnappa Nagarahalli пишет:
>>> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>>>
>>> 20/04/2022 09:16, Feifei Wang пишет:
>>>>> Enable direct rearm mode. The mapping is decided in the data plane
>>>>> based on the first packet received.
>>>>>
>>>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>> ---
>>>>>    examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
>>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
>>>>> index bec22c44cd..38ffdf4636 100644
>>>>> --- a/examples/l3fwd/l3fwd_lpm.c
>>>>> +++ b/examples/l3fwd/l3fwd_lpm.c
>>>>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>>        unsigned lcore_id;
>>>>>        uint64_t prev_tsc, diff_tsc, cur_tsc;
>>>>>        int i, nb_rx;
>>>>> -    uint16_t portid;
>>>>> +    uint16_t portid, tx_portid;
>>>>>        uint8_t queueid;
>>>>>        struct lcore_conf *qconf;
>>>>>        const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
>>>>> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>>        const uint16_t n_rx_q = qconf->n_rx_queue;
>>>>>        const uint16_t n_tx_p = qconf->n_tx_port;
>>>>> +    int direct_rearm_map[n_rx_q];
>>>>> +
>>>>>        if (n_rx_q == 0) {
>>>>>            RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n",
>>>>> lcore_id);
>>>>>            return 0;
>>>>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>>            portid = qconf->rx_queue_list[i].port_id;
>>>>>            queueid = qconf->rx_queue_list[i].queue_id;
>>>>> +        direct_rearm_map[i] = 0;
>>>>>            RTE_LOG(INFO, L3FWD,
>>>>>                " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
>>>>>                lcore_id, portid, queueid); @@ -209,6 +212,17 @@
>>>>> lpm_main_loop(__rte_unused void *dummy)
>>>>>                if (nb_rx == 0)
>>>>>                    continue;
>>>>> +            /* Determine the direct rearm mapping based on the
>>>>> +first
>>>>> +             * packet received on the rx queue
>>>>> +             */
>>>>> +            if (direct_rearm_map[i] == 0) {
>>>>> +                tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
>>>>> +                            portid);
>>>>> +                rte_eth_direct_rxrearm_map(portid, queueid,
>>>>> +                                tx_portid, queueid);
>>>>> +                direct_rearm_map[i] = 1;
>>>>> +            }
>>>>> +
>>>
>>>> That just doesn't look right to me: why to make decision based on the
>>>> first packet?
>>> The TX queue depends on the incoming packet. So, this method covers
>>> more scenarios than doing it in the control plane where the outgoing
>>> queue is not known.
>>>
>>>
>>>> What would happen if second and all other packets have to be routed
>>>> to different ports?
>>> This is an example application and it should be fine to make this
>>> assumption.
>>> More over, it does not cause any problems if packets change in between.
>>> When
>>> the packets change back, the feature works again.
>>>
>>>> In fact, this direct-rearm mode seems suitable only for hard-coded
>>>> one to one mapped forwarding (examples/l2fwd, testpmd).
>>>> For l3fwd it can be used safely only when we have one port in use.
>>> Can you elaborate more on the safety issue when more than one port is
>> used?
>>>
>>>> Also I think it should be selected at init-time and it shouldn't be
>>>> on by default.
>>>> To summarize, my opinion:
>>>> special cmd-line parameter to enable it.
>>> Can you please elaborate why a command line parameter is required?
>>> Other similar features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are
>>> enabled without a command line parameter. IMO, this is how it should
>>> ber. Essentially we are trying to measure how different PMDs perform,
>>> the ones that have implemented performance improvement features
>> would
>>> show better performance (i.e. the PMDs implementing the features
>>> should not be penalized by asking for additional user input).
>>
>>   From my perspective, main purpose of l3fwd application is to demonstrate
>> DPDK ability to do packet routing based on input packet contents.
>> Making guesses about packet contents is a change in expected behavior.
>> For some cases it might improve performance, for many others - will most
>> likely cause performance drop.
>> I think that performance drop as default behavior (running the same
>> parameters as before) should not be allowed.
>> Plus you did not provided ability to switch off that behavior, if undesired.
> There is no drop in L3fwd performance due to this patch.

It is a questionable statement. Sorry. The patch adds code on
data path. So, I'm almost confident that it is possible to find
a case when performance slightly drops. May be it is always
minor and acceptable.

>>
>> About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE default
>> enablement - I don't think it is correct.
>> Within l3fwd app we can safely guarantee that all
>> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
>> in each TX queue all mbufs will belong to the same mempool and their refcnt
>> will always equal to one.
>> Here you are making guesses about contents of input packets, without any
>> guarantee that you guess will always be valid.
> This is not a guess. The code understands the incoming traffic and configures accordingly. So, it should be correct. Since it is a sample application, we do not expect the traffic to be complex. If it is complex, the performance will be the same as before or better.
> 
>>
>> BTW, what's wrong with using l2fwd to demonstrate that feature?
>> Seems like a natural choice to me.
> The performance of L3fwd application in DPDK has become a industry standard, hence we need to showcase the performance in L3fwd application.
> 
>>
>>>> allowable only when we run l3fwd over one port.
>>>
>>>
>>>>>    #if defined RTE_ARCH_X86 || defined __ARM_NEON \
>>>>>                 || defined RTE_ARCH_PPC_64
>>>>>                l3fwd_lpm_send_packets(nb_rx, pkts_burst,
>
  
Konstantin Ananyev June 6, 2022, 11:27 a.m. UTC | #10
31/05/2022 18:14, Honnappa Nagarahalli пишет:
> <snip>
>>
>> 25/05/2022 01:24, Honnappa Nagarahalli пишет:
>>> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>>>
>>> 20/04/2022 09:16, Feifei Wang пишет:
>>>>> Enable direct rearm mode. The mapping is decided in the data plane
>>>>> based on the first packet received.
>>>>>
>>>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>> ---
>>>>>    examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
>>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
>>>>> index bec22c44cd..38ffdf4636 100644
>>>>> --- a/examples/l3fwd/l3fwd_lpm.c
>>>>> +++ b/examples/l3fwd/l3fwd_lpm.c
>>>>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>>        unsigned lcore_id;
>>>>>        uint64_t prev_tsc, diff_tsc, cur_tsc;
>>>>>        int i, nb_rx;
>>>>> -    uint16_t portid;
>>>>> +    uint16_t portid, tx_portid;
>>>>>        uint8_t queueid;
>>>>>        struct lcore_conf *qconf;
>>>>>        const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
>>>>> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>>        const uint16_t n_rx_q = qconf->n_rx_queue;
>>>>>        const uint16_t n_tx_p = qconf->n_tx_port;
>>>>> +    int direct_rearm_map[n_rx_q];
>>>>> +
>>>>>        if (n_rx_q == 0) {
>>>>>            RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n",
>>>>> lcore_id);
>>>>>            return 0;
>>>>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>>            portid = qconf->rx_queue_list[i].port_id;
>>>>>            queueid = qconf->rx_queue_list[i].queue_id;
>>>>> +        direct_rearm_map[i] = 0;
>>>>>            RTE_LOG(INFO, L3FWD,
>>>>>                " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
>>>>>                lcore_id, portid, queueid); @@ -209,6 +212,17 @@
>>>>> lpm_main_loop(__rte_unused void *dummy)
>>>>>                if (nb_rx == 0)
>>>>>                    continue;
>>>>> +            /* Determine the direct rearm mapping based on the
>>>>> +first
>>>>> +             * packet received on the rx queue
>>>>> +             */
>>>>> +            if (direct_rearm_map[i] == 0) {
>>>>> +                tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
>>>>> +                            portid);
>>>>> +                rte_eth_direct_rxrearm_map(portid, queueid,
>>>>> +                                tx_portid, queueid);
>>>>> +                direct_rearm_map[i] = 1;
>>>>> +            }
>>>>> +
>>>
>>>> That just doesn't look right to me: why to make decision based on the
>>>> first packet?
>>> The TX queue depends on the incoming packet. So, this method covers
>>> more scenarios than doing it in the control plane where the outgoing
>>> queue is not known.
>>>
>>>
>>>> What would happen if second and all other packets have to be routed
>>>> to different ports?
>>> This is an example application and it should be fine to make this
>>> assumption.
>>> More over, it does not cause any problems if packets change in between.
>>> When
>>> the packets change back, the feature works again.
>>>
>>>> In fact, this direct-rearm mode seems suitable only for hard-coded
>>>> one to one mapped forwarding (examples/l2fwd, testpmd).
>>>> For l3fwd it can be used safely only when we have one port in use.
>>> Can you elaborate more on the safety issue when more than one port is
>> used?
>>>
>>>> Also I think it should be selected at init-time and it shouldn't be
>>>> on by default.
>>>> To summarize, my opinion:
>>>> special cmd-line parameter to enable it.
>>> Can you please elaborate why a command line parameter is required?
>>> Other similar features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are
>>> enabled without a command line parameter. IMO, this is how it should
>>> ber. Essentially we are trying to measure how different PMDs perform,
>>> the ones that have implemented performance improvement features
>> would
>>> show better performance (i.e. the PMDs implementing the features
>>> should not be penalized by asking for additional user input).
>>
>>   From my perspective, main purpose of l3fwd application is to demonstrate
>> DPDK ability to do packet routing based on input packet contents.
>> Making guesses about packet contents is a change in expected behavior.
>> For some cases it might improve performance, for many others - will most
>> likely cause performance drop.
>> I think that performance drop as default behavior (running the same
>> parameters as before) should not be allowed.
>> Plus you did not provided ability to switch off that behavior, if undesired.
> There is no drop in L3fwd performance due to this patch.

Hmm..
Are you saying even when your guess is wrong, and you constantly hitting
slow-path (check tx_queue first - failure, then allocate from mempool)
you didn't observe any performance drop?
There is more work to do, and if workload is cpu-bound,
my guess - it should be noticeable.
Also, from previous experience, quite often even after
tiny changes in rx/tx code-path some slowdown was reported.
Usually that happened on some low-end ARM cpus (Marvell, NXP).


>>
>> About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE default
>> enablement - I don't think it is correct.
>> Within l3fwd app we can safely guarantee that all
>> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
>> in each TX queue all mbufs will belong to the same mempool and their refcnt
>> will always equal to one.
>> Here you are making guesses about contents of input packets, without any
>> guarantee that you guess will always be valid.
> This is not a guess. The code understands the incoming traffic and configures accordingly. So, it should be correct. 


No, it is not.
Your code makes guess about all incoming traffic
based on just one (first) packet.

Since it is a sample application, we do not expect the traffic to be 
complex. If it is complex, the performance will be the same as before or 
better.

I am not talking about anything 'complex' here.
Doing dynamic routing based on packet contents is a
a basic l3fwd functionality.

> 
>>
>> BTW, what's wrong with using l2fwd to demonstrate that feature?
>> Seems like a natural choice to me.
> The performance of L3fwd application in DPDK has become a industry standard, hence we need to showcase the performance in L3fwd application.
> 
>>
>>>> allowable only when we run l3fwd over one port.
>>>
>>>
>>>>>    #if defined RTE_ARCH_X86 || defined __ARM_NEON \
>>>>>                 || defined RTE_ARCH_PPC_64
>>>>>                l3fwd_lpm_send_packets(nb_rx, pkts_burst,
>
  
Honnappa Nagarahalli June 29, 2022, 9:25 p.m. UTC | #11
(apologies for being late on replying, catching up after vacation)

<snip>
> >>
> >> 25/05/2022 01:24, Honnappa Nagarahalli пишет:
> >>> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> >>>
> >>> 20/04/2022 09:16, Feifei Wang пишет:
> >>>>> Enable direct rearm mode. The mapping is decided in the data plane
> >>>>> based on the first packet received.
> >>>>>
> >>>>> Suggested-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> >>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>>>> ---
> >>>>>    examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
> >>>>>    1 file changed, 15 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/examples/l3fwd/l3fwd_lpm.c
> >>>>> b/examples/l3fwd/l3fwd_lpm.c index bec22c44cd..38ffdf4636 100644
> >>>>> --- a/examples/l3fwd/l3fwd_lpm.c
> >>>>> +++ b/examples/l3fwd/l3fwd_lpm.c
> >>>>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
> >>>>>        unsigned lcore_id;
> >>>>>        uint64_t prev_tsc, diff_tsc, cur_tsc;
> >>>>>        int i, nb_rx;
> >>>>> -    uint16_t portid;
> >>>>> +    uint16_t portid, tx_portid;
> >>>>>        uint8_t queueid;
> >>>>>        struct lcore_conf *qconf;
> >>>>>        const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S -
> >>>>> 1) / @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void
> *dummy)
> >>>>>        const uint16_t n_rx_q = qconf->n_rx_queue;
> >>>>>        const uint16_t n_tx_p = qconf->n_tx_port;
> >>>>> +    int direct_rearm_map[n_rx_q];
> >>>>> +
> >>>>>        if (n_rx_q == 0) {
> >>>>>            RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n",
> >>>>> lcore_id);
> >>>>>            return 0;
> >>>>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
> >>>>>            portid = qconf->rx_queue_list[i].port_id;
> >>>>>            queueid = qconf->rx_queue_list[i].queue_id;
> >>>>> +        direct_rearm_map[i] = 0;
> >>>>>            RTE_LOG(INFO, L3FWD,
> >>>>>                " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
> >>>>>                lcore_id, portid, queueid); @@ -209,6 +212,17 @@
> >>>>> lpm_main_loop(__rte_unused void *dummy)
> >>>>>                if (nb_rx == 0)
> >>>>>                    continue;
> >>>>> +            /* Determine the direct rearm mapping based on the
> >>>>> +first
> >>>>> +             * packet received on the rx queue
> >>>>> +             */
> >>>>> +            if (direct_rearm_map[i] == 0) {
> >>>>> +                tx_portid = lpm_get_dst_port(qconf,
> >>>>> +pkts_burst[0],
> >>>>> +                            portid);
> >>>>> +                rte_eth_direct_rxrearm_map(portid, queueid,
> >>>>> +                                tx_portid, queueid);
> >>>>> +                direct_rearm_map[i] = 1;
> >>>>> +            }
> >>>>> +
> >>>
> >>>> That just doesn't look right to me: why to make decision based on
> >>>> the first packet?
> >>> The TX queue depends on the incoming packet. So, this method covers
> >>> more scenarios than doing it in the control plane where the outgoing
> >>> queue is not known.
> >>>
> >>>
> >>>> What would happen if second and all other packets have to be routed
> >>>> to different ports?
> >>> This is an example application and it should be fine to make this
> >>> assumption.
> >>> More over, it does not cause any problems if packets change in between.
> >>> When
> >>> the packets change back, the feature works again.
> >>>
> >>>> In fact, this direct-rearm mode seems suitable only for hard-coded
> >>>> one to one mapped forwarding (examples/l2fwd, testpmd).
> >>>> For l3fwd it can be used safely only when we have one port in use.
> >>> Can you elaborate more on the safety issue when more than one port
> >>> is
> >> used?
> >>>
> >>>> Also I think it should be selected at init-time and it shouldn't be
> >>>> on by default.
> >>>> To summarize, my opinion:
> >>>> special cmd-line parameter to enable it.
> >>> Can you please elaborate why a command line parameter is required?
> >>> Other similar features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are
> >>> enabled without a command line parameter. IMO, this is how it should
> >>> ber. Essentially we are trying to measure how different PMDs
> >>> perform, the ones that have implemented performance improvement
> >>> features
> >> would
> >>> show better performance (i.e. the PMDs implementing the features
> >>> should not be penalized by asking for additional user input).
> >>
> >>   From my perspective, main purpose of l3fwd application is to
> >> demonstrate DPDK ability to do packet routing based on input packet
> contents.
> >> Making guesses about packet contents is a change in expected behavior.
> >> For some cases it might improve performance, for many others - will
> >> most likely cause performance drop.
> >> I think that performance drop as default behavior (running the same
> >> parameters as before) should not be allowed.
> >> Plus you did not provided ability to switch off that behavior, if undesired.
> > There is no drop in L3fwd performance due to this patch.
> 
> Hmm..
> Are you saying even when your guess is wrong, and you constantly hitting slow-
> path (check tx_queue first - failure, then allocate from mempool) you didn't
> observe any performance drop?
> There is more work to do, and if workload is cpu-bound, my guess - it should be
> noticeable.
Well, you do not have to trust our words. The platform used to test is mentioned in the cover letter. Appreciate if you could test on your platform and provide the results.

> Also, from previous experience, quite often even after tiny changes in rx/tx
> code-path some slowdown was reported.
> Usually that happened on some low-end ARM cpus (Marvell, NXP).
> 
> 
> >>
> >> About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE default
> >> enablement - I don't think it is correct.
> >> Within l3fwd app we can safely guarantee that all
> >> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
> >> in each TX queue all mbufs will belong to the same mempool and their
> >> refcnt will always equal to one.
> >> Here you are making guesses about contents of input packets, without
> >> any guarantee that you guess will always be valid.
> > This is not a guess. The code understands the incoming traffic and configures
> accordingly. So, it should be correct.
> 
> 
> No, it is not.
> Your code makes guess about all incoming traffic based on just one (first)
> packet.
> 
> Since it is a sample application, we do not expect the traffic to be complex. If it is
> complex, the performance will be the same as before or better.
> 
> I am not talking about anything 'complex' here.
> Doing dynamic routing based on packet contents is a a basic l3fwd functionality.
> 
> >
> >>
> >> BTW, what's wrong with using l2fwd to demonstrate that feature?
> >> Seems like a natural choice to me.
> > The performance of L3fwd application in DPDK has become a industry
> standard, hence we need to showcase the performance in L3fwd application.
> >
> >>
> >>>> allowable only when we run l3fwd over one port.
> >>>
> >>>
> >>>>>    #if defined RTE_ARCH_X86 || defined __ARM_NEON \
> >>>>>                 || defined RTE_ARCH_PPC_64
> >>>>>                l3fwd_lpm_send_packets(nb_rx, pkts_burst,
> >
  

Patch

diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
index bec22c44cd..38ffdf4636 100644
--- a/examples/l3fwd/l3fwd_lpm.c
+++ b/examples/l3fwd/l3fwd_lpm.c
@@ -147,7 +147,7 @@  lpm_main_loop(__rte_unused void *dummy)
 	unsigned lcore_id;
 	uint64_t prev_tsc, diff_tsc, cur_tsc;
 	int i, nb_rx;
-	uint16_t portid;
+	uint16_t portid, tx_portid;
 	uint8_t queueid;
 	struct lcore_conf *qconf;
 	const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
@@ -158,6 +158,8 @@  lpm_main_loop(__rte_unused void *dummy)
 
 	const uint16_t n_rx_q = qconf->n_rx_queue;
 	const uint16_t n_tx_p = qconf->n_tx_port;
+	int direct_rearm_map[n_rx_q];
+
 	if (n_rx_q == 0) {
 		RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
 		return 0;
@@ -169,6 +171,7 @@  lpm_main_loop(__rte_unused void *dummy)
 
 		portid = qconf->rx_queue_list[i].port_id;
 		queueid = qconf->rx_queue_list[i].queue_id;
+		direct_rearm_map[i] = 0;
 		RTE_LOG(INFO, L3FWD,
 			" -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
 			lcore_id, portid, queueid);
@@ -209,6 +212,17 @@  lpm_main_loop(__rte_unused void *dummy)
 			if (nb_rx == 0)
 				continue;
 
+			/* Determine the direct rearm mapping based on the first
+			 * packet received on the rx queue
+			 */
+			if (direct_rearm_map[i] == 0) {
+				tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
+							portid);
+				rte_eth_direct_rxrearm_map(portid, queueid,
+								tx_portid, queueid);
+				direct_rearm_map[i] = 1;
+			}
+
 #if defined RTE_ARCH_X86 || defined __ARM_NEON \
 			 || defined RTE_ARCH_PPC_64
 			l3fwd_lpm_send_packets(nb_rx, pkts_burst,