[v3] eventdev: avoid non-burst shortcut for variable-size bursts

Message ID 20230511082415.6720-1-mattias.ronnblom@ericsson.com (mailing list archive)
State Rejected, archived
Delegated to: Jerin Jacob
Headers
Series [v3] eventdev: avoid non-burst shortcut for variable-size bursts |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing fail Testing issues
ci/intel-Functional success Functional PASS

Commit Message

Mattias Rönnblom May 11, 2023, 8:24 a.m. UTC
  Use non-burst event enqueue and dequeue calls from burst enqueue and
dequeue only when the burst size is compile-time constant (and equal
to one).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

---

v3: Actually include the change v2 claimed to contain.
v2: Wrap builtin call in __extension__, to avoid compiler warnings if
    application is compiled with -pedantic. (Morten Brørup)
---
 lib/eventdev/rte_eventdev.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
  

Comments

Morten Brørup May 11, 2023, 9:01 a.m. UTC | #1
> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Thursday, 11 May 2023 10.24
> 
> Use non-burst event enqueue and dequeue calls from burst enqueue and
> dequeue only when the burst size is compile-time constant (and equal
> to one).
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Acked-by: Morten Brørup <mb@smartsharesystems.com>
  
Jerin Jacob May 12, 2023, 11:59 a.m. UTC | #2
On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> Use non-burst event enqueue and dequeue calls from burst enqueue and
> dequeue only when the burst size is compile-time constant (and equal
> to one).
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>
> ---
>
> v3: Actually include the change v2 claimed to contain.
> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
>     application is compiled with -pedantic. (Morten Brørup)
> ---
>  lib/eventdev/rte_eventdev.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index a90e23ac8b..a471caeb6d 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
>          * Allow zero cost non burst mode routine invocation if application
>          * requests nb_events as const one
>          */
> -       if (nb_events == 1)
> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)

"Why" part is not clear from the commit message. Is this to avoid
nb_events read if it is built-in const.
If so, check should be following. Right?

if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
|| nb_events  == 1)

At least, It was my original intention in the code.



>                 return (fp_ops->enqueue)(port, ev);
>         else
>                 return fn(port, ev, nb_events);
> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>          * Allow zero cost non burst mode routine invocation if application
>          * requests nb_events as const one
>          */
> -       if (nb_events == 1)
> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>                 return (fp_ops->dequeue)(port, ev, timeout_ticks);
>         else
>                 return (fp_ops->dequeue_burst)(port, ev, nb_events,
> --
> 2.34.1
>
  
Mattias Rönnblom May 12, 2023, 1:15 p.m. UTC | #3
On 2023-05-12 13:59, Jerin Jacob wrote:
> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> Use non-burst event enqueue and dequeue calls from burst enqueue and
>> dequeue only when the burst size is compile-time constant (and equal
>> to one).
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>
>> ---
>>
>> v3: Actually include the change v2 claimed to contain.
>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
>>      application is compiled with -pedantic. (Morten Brørup)
>> ---
>>   lib/eventdev/rte_eventdev.h | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
>> index a90e23ac8b..a471caeb6d 100644
>> --- a/lib/eventdev/rte_eventdev.h
>> +++ b/lib/eventdev/rte_eventdev.h
>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
>>           * Allow zero cost non burst mode routine invocation if application
>>           * requests nb_events as const one
>>           */
>> -       if (nb_events == 1)
>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
> 
> "Why" part is not clear from the commit message. Is this to avoid
> nb_events read if it is built-in const.

The __builtin_constant_p() is introduced to avoid having the compiler 
generate a conditional branch and two different code paths in case 
nb_elem is a run-time variable.

In particular, this matters if nb_elems is run-time variable and varies 
between 1 and some larger value.

I should have mention this in the commit message.

A very slight performance improvement. It also makes the code better 
match the comment, imo. Zero cost for const one enqueues, but no impact 
non-compile-time-constant-length enqueues.

Feel free to ignore.

> If so, check should be following. Right?
> 
> if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
> || nb_events  == 1)
> 
> At least, It was my original intention in the code.
> 
> 
> 
>>                  return (fp_ops->enqueue)(port, ev);
>>          else
>>                  return fn(port, ev, nb_events);
>> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>>           * Allow zero cost non burst mode routine invocation if application
>>           * requests nb_events as const one
>>           */
>> -       if (nb_events == 1)
>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>>                  return (fp_ops->dequeue)(port, ev, timeout_ticks);
>>          else
>>                  return (fp_ops->dequeue_burst)(port, ev, nb_events,
>> --
>> 2.34.1
>>
  
Morten Brørup May 12, 2023, 1:56 p.m. UTC | #4
> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Friday, 12 May 2023 15.15
> 
> On 2023-05-12 13:59, Jerin Jacob wrote:
> > On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
> > <mattias.ronnblom@ericsson.com> wrote:
> >>
> >> Use non-burst event enqueue and dequeue calls from burst enqueue and
> >> dequeue only when the burst size is compile-time constant (and equal
> >> to one).
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>
> >> ---
> >>
> >> v3: Actually include the change v2 claimed to contain.
> >> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
> >>      application is compiled with -pedantic. (Morten Brørup)
> >> ---
> >>   lib/eventdev/rte_eventdev.h | 4 ++--
> >>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> >> index a90e23ac8b..a471caeb6d 100644
> >> --- a/lib/eventdev/rte_eventdev.h
> >> +++ b/lib/eventdev/rte_eventdev.h
> >> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t
> port_id,
> >>           * Allow zero cost non burst mode routine invocation if
> application
> >>           * requests nb_events as const one
> >>           */
> >> -       if (nb_events == 1)
> >> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events ==
> 1)
> >
> > "Why" part is not clear from the commit message. Is this to avoid
> > nb_events read if it is built-in const.
> 
> The __builtin_constant_p() is introduced to avoid having the compiler
> generate a conditional branch and two different code paths in case
> nb_elem is a run-time variable.
> 
> In particular, this matters if nb_elems is run-time variable and varies
> between 1 and some larger value.
> 
> I should have mention this in the commit message.
> 
> A very slight performance improvement. It also makes the code better
> match the comment, imo. Zero cost for const one enqueues, but no impact
> non-compile-time-constant-length enqueues.
> 
> Feel free to ignore.
> 
> > If so, check should be following. Right?
> >
> > if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
> > || nb_events  == 1)

@Mattias: You missed the second part of this comparison, also catching nb_events == 1 with non-constant nb_events.

@Jerin: Such a change has no effect, compared to the original code.

> >
> > At least, It was my original intention in the code.

@Jerin: Mattias implemented exactly what the comment says.

Perhaps only the comment should be updated, not the code.

Is nb_events likely to be non-constant 1, and are there benefits to calling either of the non-burst functions in those cases, vs. the branch cost of this comparison (which Mattias' patch gets rid of)?

> >
> >
> >
> >>                  return (fp_ops->enqueue)(port, ev);
> >>          else
> >>                  return fn(port, ev, nb_events);
> >> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t
> port_id, struct rte_event ev[],
> >>           * Allow zero cost non burst mode routine invocation if
> application
> >>           * requests nb_events as const one
> >>           */
> >> -       if (nb_events == 1)
> >> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events ==
> 1)
> >>                  return (fp_ops->dequeue)(port, ev, timeout_ticks);
> >>          else
> >>                  return (fp_ops->dequeue_burst)(port, ev, nb_events,
> >> --
> >> 2.34.1
> >>
  
Mattias Rönnblom May 12, 2023, 2:52 p.m. UTC | #5
On 2023-05-12 15:56, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Friday, 12 May 2023 15.15
>>
>> On 2023-05-12 13:59, Jerin Jacob wrote:
>>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>
>>>> Use non-burst event enqueue and dequeue calls from burst enqueue and
>>>> dequeue only when the burst size is compile-time constant (and equal
>>>> to one).
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>
>>>> ---
>>>>
>>>> v3: Actually include the change v2 claimed to contain.
>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
>>>>       application is compiled with -pedantic. (Morten Brørup)
>>>> ---
>>>>    lib/eventdev/rte_eventdev.h | 4 ++--
>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
>>>> index a90e23ac8b..a471caeb6d 100644
>>>> --- a/lib/eventdev/rte_eventdev.h
>>>> +++ b/lib/eventdev/rte_eventdev.h
>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t
>> port_id,
>>>>            * Allow zero cost non burst mode routine invocation if
>> application
>>>>            * requests nb_events as const one
>>>>            */
>>>> -       if (nb_events == 1)
>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events ==
>> 1)
>>>
>>> "Why" part is not clear from the commit message. Is this to avoid
>>> nb_events read if it is built-in const.
>>
>> The __builtin_constant_p() is introduced to avoid having the compiler
>> generate a conditional branch and two different code paths in case
>> nb_elem is a run-time variable.
>>
>> In particular, this matters if nb_elems is run-time variable and varies
>> between 1 and some larger value.
>>
>> I should have mention this in the commit message.
>>
>> A very slight performance improvement. It also makes the code better
>> match the comment, imo. Zero cost for const one enqueues, but no impact
>> non-compile-time-constant-length enqueues.
>>
>> Feel free to ignore.
>>
>>> If so, check should be following. Right?
>>>
>>> if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
>>> || nb_events  == 1)
> 
> @Mattias: You missed the second part of this comparison, also catching nb_events == 1 with non-constant nb_events.
> 

I didn't comment on that code snippet since it was based on a 
misconception of the intention of my patch.

> @Jerin: Such a change has no effect, compared to the original code.
> 
>>>
>>> At least, It was my original intention in the code.
> 
> @Jerin: Mattias implemented exactly what the comment says.
> 
> Perhaps only the comment should be updated, not the code.
> 
> Is nb_events likely to be non-constant 1, and are there benefits to calling either of the non-burst functions in those cases, vs. the branch cost of this comparison (which Mattias' patch gets rid of)?
> 

I think the main worry would be the cost of branch mispredictions in
case of alternating enqueue sizes (between 1 and some other size).

If there is a performance upside to calling single-event enqueue in a 
scenario where all enqueues are *run-time variable* and 1 (which I find 
unlikely, but well inside the realms of the possibility), the next 
question would be: OK, but how about for two events? Three? Four. Etc.

>>>
>>>
>>>
>>>>                   return (fp_ops->enqueue)(port, ev);
>>>>           else
>>>>                   return fn(port, ev, nb_events);
>>>> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t
>> port_id, struct rte_event ev[],
>>>>            * Allow zero cost non burst mode routine invocation if
>> application
>>>>            * requests nb_events as const one
>>>>            */
>>>> -       if (nb_events == 1)
>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events ==
>> 1)
>>>>                   return (fp_ops->dequeue)(port, ev, timeout_ticks);
>>>>           else
>>>>                   return (fp_ops->dequeue_burst)(port, ev, nb_events,
>>>> --
>>>> 2.34.1
>>>>
>
  
Jerin Jacob May 15, 2023, 12:33 p.m. UTC | #6
On Fri, May 12, 2023 at 7:26 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > Sent: Friday, 12 May 2023 15.15
> >
> > On 2023-05-12 13:59, Jerin Jacob wrote:
> > > On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
> > > <mattias.ronnblom@ericsson.com> wrote:
> > >>
> > >> Use non-burst event enqueue and dequeue calls from burst enqueue and
> > >> dequeue only when the burst size is compile-time constant (and equal
> > >> to one).
> > >>
> > >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > >>
> > >> ---
> > >>
> > >> v3: Actually include the change v2 claimed to contain.
> > >> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
> > >>      application is compiled with -pedantic. (Morten Brørup)
> > >> ---
> > >>   lib/eventdev/rte_eventdev.h | 4 ++--
> > >>   1 file changed, 2 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> > >> index a90e23ac8b..a471caeb6d 100644
> > >> --- a/lib/eventdev/rte_eventdev.h
> > >> +++ b/lib/eventdev/rte_eventdev.h
> > >> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t
> > port_id,
> > >>           * Allow zero cost non burst mode routine invocation if
> > application
> > >>           * requests nb_events as const one
> > >>           */
> > >> -       if (nb_events == 1)
> > >> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events ==
> > 1)
> > >
> > > "Why" part is not clear from the commit message. Is this to avoid
> > > nb_events read if it is built-in const.
> >
> > The __builtin_constant_p() is introduced to avoid having the compiler
> > generate a conditional branch and two different code paths in case
> > nb_elem is a run-time variable.
> >
> > In particular, this matters if nb_elems is run-time variable and varies
> > between 1 and some larger value.
> >
> > I should have mention this in the commit message.
> >
> > A very slight performance improvement. It also makes the code better
> > match the comment, imo. Zero cost for const one enqueues, but no impact
> > non-compile-time-constant-length enqueues.
> >
> > Feel free to ignore.
> >
> > > If so, check should be following. Right?
> > >
> > > if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
> > > || nb_events  == 1)
>
> @Mattias: You missed the second part of this comparison, also catching nb_events == 1 with non-constant nb_events.
>
> @Jerin: Such a change has no effect, compared to the original code.


Yes. That was the reason for I was skipping __builtin_constant_p in
the initial version.
  
Jerin Jacob May 15, 2023, 12:38 p.m. UTC | #7
On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> On 2023-05-12 13:59, Jerin Jacob wrote:
> > On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
> > <mattias.ronnblom@ericsson.com> wrote:
> >>
> >> Use non-burst event enqueue and dequeue calls from burst enqueue and
> >> dequeue only when the burst size is compile-time constant (and equal
> >> to one).
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>
> >> ---
> >>
> >> v3: Actually include the change v2 claimed to contain.
> >> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
> >>      application is compiled with -pedantic. (Morten Brørup)
> >> ---
> >>   lib/eventdev/rte_eventdev.h | 4 ++--
> >>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> >> index a90e23ac8b..a471caeb6d 100644
> >> --- a/lib/eventdev/rte_eventdev.h
> >> +++ b/lib/eventdev/rte_eventdev.h
> >> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
> >>           * Allow zero cost non burst mode routine invocation if application
> >>           * requests nb_events as const one
> >>           */
> >> -       if (nb_events == 1)
> >> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
> >
> > "Why" part is not clear from the commit message. Is this to avoid
> > nb_events read if it is built-in const.
>
> The __builtin_constant_p() is introduced to avoid having the compiler
> generate a conditional branch and two different code paths in case
> nb_elem is a run-time variable.
>
> In particular, this matters if nb_elems is run-time variable and varies
> between 1 and some larger value.
>
> I should have mention this in the commit message.
>
> A very slight performance improvement. It also makes the code better
> match the comment, imo. Zero cost for const one enqueues, but no impact
> non-compile-time-constant-length enqueues.
>
> Feel free to ignore.


I did some performance comparison of the patch.
A low-end ARM machines shows 0.7%  drop with single event case. No
regression see with high-end ARM cores with single event case.

IMO, optimizing the check for burst mode(the new patch) may not show
any real improvement as the cost is divided by number of event.
Whereas optimizing the check for single event case(The current code)
shows better performance with single event case and no regression
with burst mode as cost is divided by number of events.
If you agree, then we can skip this patch.


>
> > If so, check should be following. Right?
> >
> > if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
> > || nb_events  == 1)
> >
> > At least, It was my original intention in the code.
> >
> >
> >
> >>                  return (fp_ops->enqueue)(port, ev);
> >>          else
> >>                  return fn(port, ev, nb_events);
> >> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
> >>           * Allow zero cost non burst mode routine invocation if application
> >>           * requests nb_events as const one
> >>           */
> >> -       if (nb_events == 1)
> >> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
> >>                  return (fp_ops->dequeue)(port, ev, timeout_ticks);
> >>          else
> >>                  return (fp_ops->dequeue_burst)(port, ev, nb_events,
> >> --
> >> 2.34.1
> >>
>
  
Mattias Rönnblom May 15, 2023, 8:52 p.m. UTC | #8
On 2023-05-15 14:38, Jerin Jacob wrote:
> On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> On 2023-05-12 13:59, Jerin Jacob wrote:
>>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>
>>>> Use non-burst event enqueue and dequeue calls from burst enqueue and
>>>> dequeue only when the burst size is compile-time constant (and equal
>>>> to one).
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>
>>>> ---
>>>>
>>>> v3: Actually include the change v2 claimed to contain.
>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
>>>>       application is compiled with -pedantic. (Morten Brørup)
>>>> ---
>>>>    lib/eventdev/rte_eventdev.h | 4 ++--
>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
>>>> index a90e23ac8b..a471caeb6d 100644
>>>> --- a/lib/eventdev/rte_eventdev.h
>>>> +++ b/lib/eventdev/rte_eventdev.h
>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
>>>>            * Allow zero cost non burst mode routine invocation if application
>>>>            * requests nb_events as const one
>>>>            */
>>>> -       if (nb_events == 1)
>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>>>
>>> "Why" part is not clear from the commit message. Is this to avoid
>>> nb_events read if it is built-in const.
>>
>> The __builtin_constant_p() is introduced to avoid having the compiler
>> generate a conditional branch and two different code paths in case
>> nb_elem is a run-time variable.
>>
>> In particular, this matters if nb_elems is run-time variable and varies
>> between 1 and some larger value.
>>
>> I should have mention this in the commit message.
>>
>> A very slight performance improvement. It also makes the code better
>> match the comment, imo. Zero cost for const one enqueues, but no impact
>> non-compile-time-constant-length enqueues.
>>
>> Feel free to ignore.
> 
> 
> I did some performance comparison of the patch.
> A low-end ARM machines shows 0.7%  drop with single event case. No
> regression see with high-end ARM cores with single event case.
> 
> IMO, optimizing the check for burst mode(the new patch) may not show
> any real improvement as the cost is divided by number of event.
> Whereas optimizing the check for single event case(The current code)
> shows better performance with single event case and no regression
> with burst mode as cost is divided by number of events.

I ran some tests on an AMD Zen 3 with DSW.

In the below tests the enqueue burst size is not compile-time constant.

Enqueue burst size      Performance improvement
Run-time constant 1     ~5%
Run-time constant 2     ~0%
Run-time variable 1-2   ~9%
Run-time variable 1-16  ~0%

The run-time variable enqueue sizes randomly (uniformly) distributed in 
the specified range.

The first result may come as a surprise. The benchmark is using 
RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type 
in most apps). The single-event enqueue function only exists in a 
generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent). 
I suspect that is the reason for the performance improvement.

This effect is large-enough to make it somewhat beneficial (+~1%) to use 
run-time variable single-event enqueue compared to keeping the burst 
size compile-time constant.

The performance gain is counted toward both enqueue and dequeue costs 
(+benchmark app overhead), so an under-estimation if see this as an 
enqueue performance improvement.

> If you agree, then we can skip this patch.
>

I have no strong opinion if this should be included or not.

It was up to me, I would drop the single-enqueue special case handling 
altogether in the next ABI update.

> 
>>
>>> If so, check should be following. Right?
>>>
>>> if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
>>> || nb_events  == 1)
>>>
>>> At least, It was my original intention in the code.
>>>
>>>
>>>
>>>>                   return (fp_ops->enqueue)(port, ev);
>>>>           else
>>>>                   return fn(port, ev, nb_events);
>>>> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>>>>            * Allow zero cost non burst mode routine invocation if application
>>>>            * requests nb_events as const one
>>>>            */
>>>> -       if (nb_events == 1)
>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>                   return (fp_ops->dequeue)(port, ev, timeout_ticks);
>>>>           else
>>>>                   return (fp_ops->dequeue_burst)(port, ev, nb_events,
>>>> --
>>>> 2.34.1
>>>>
>>
  
Jerin Jacob May 16, 2023, 1:08 p.m. UTC | #9
On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
> On 2023-05-15 14:38, Jerin Jacob wrote:
> > On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
> > <mattias.ronnblom@ericsson.com> wrote:
> >>
> >> On 2023-05-12 13:59, Jerin Jacob wrote:
> >>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
> >>> <mattias.ronnblom@ericsson.com> wrote:
> >>>>
> >>>> Use non-burst event enqueue and dequeue calls from burst enqueue and
> >>>> dequeue only when the burst size is compile-time constant (and equal
> >>>> to one).
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>>
> >>>> ---
> >>>>
> >>>> v3: Actually include the change v2 claimed to contain.
> >>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
> >>>>       application is compiled with -pedantic. (Morten Brørup)
> >>>> ---
> >>>>    lib/eventdev/rte_eventdev.h | 4 ++--
> >>>>    1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> >>>> index a90e23ac8b..a471caeb6d 100644
> >>>> --- a/lib/eventdev/rte_eventdev.h
> >>>> +++ b/lib/eventdev/rte_eventdev.h
> >>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
> >>>>            * Allow zero cost non burst mode routine invocation if application
> >>>>            * requests nb_events as const one
> >>>>            */
> >>>> -       if (nb_events == 1)
> >>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
> >>>
> >>> "Why" part is not clear from the commit message. Is this to avoid
> >>> nb_events read if it is built-in const.
> >>
> >> The __builtin_constant_p() is introduced to avoid having the compiler
> >> generate a conditional branch and two different code paths in case
> >> nb_elem is a run-time variable.
> >>
> >> In particular, this matters if nb_elems is run-time variable and varies
> >> between 1 and some larger value.
> >>
> >> I should have mention this in the commit message.
> >>
> >> A very slight performance improvement. It also makes the code better
> >> match the comment, imo. Zero cost for const one enqueues, but no impact
> >> non-compile-time-constant-length enqueues.
> >>
> >> Feel free to ignore.
> >
> >
> > I did some performance comparison of the patch.
> > A low-end ARM machines shows 0.7%  drop with single event case. No
> > regression see with high-end ARM cores with single event case.
> >
> > IMO, optimizing the check for burst mode(the new patch) may not show
> > any real improvement as the cost is divided by number of event.
> > Whereas optimizing the check for single event case(The current code)
> > shows better performance with single event case and no regression
> > with burst mode as cost is divided by number of events.
>
> I ran some tests on an AMD Zen 3 with DSW.
> In the below tests the enqueue burst size is not compile-time constant.
>
> Enqueue burst size      Performance improvement
> Run-time constant 1     ~5%
> Run-time constant 2     ~0%
> Run-time variable 1-2   ~9%
> Run-time variable 1-16  ~0%
>
> The run-time variable enqueue sizes randomly (uniformly) distributed in
> the specified range.
>
> The first result may come as a surprise. The benchmark is using
> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type
> in most apps). The single-event enqueue function only exists in a
> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent).
> I suspect that is the reason for the performance improvement.
>
> This effect is large-enough to make it somewhat beneficial (+~1%) to use
> run-time variable single-event enqueue compared to keeping the burst
> size compile-time constant.

# Interesting, Could you share your testeventdev command to test it.
# By having quick glance on DSW code, following change can be added(or
 similar cases).
Not sure such change in DSW driver is making a difference or nor?


diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
index e84b65d99f..455470997b 100644
--- a/drivers/event/dsw/dsw_event.c
+++ b/drivers/event/dsw/dsw_event.c
@@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev
*dsw, struct dsw_port *source_port)
 uint16_t
 dsw_event_enqueue(void *port, const struct rte_event *ev)
 {
-       return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : 1);
+       return dsw_event_enqueue_burst(port, ev, 1);
 }

 static __rte_always_inline uint16_t
@@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port
*source_port,
        return (num_non_release + num_release);
 }

-uint16_t
+inline uint16_t
 dsw_event_enqueue_burst(void *port, const struct rte_event events[],
                        uint16_t events_len)
 {

# I am testing with command like this "/build/app/dpdk-test-eventdev
-l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8
--stlist p --nb_pkts=10000000000"

>
> The performance gain is counted toward both enqueue and dequeue costs
> (+benchmark app overhead), so an under-estimation if see this as an
> enqueue performance improvement.
>
> > If you agree, then we can skip this patch.
> >
>
> I have no strong opinion if this should be included or not.
>
> It was up to me, I would drop the single-enqueue special case handling
> altogether in the next ABI update.

That's a reasonable path. If we are willing to push a patch, we can
test it and give feedback.
Or in our spare time, We can do that as well.

>
> >
> >>
> >>> If so, check should be following. Right?
> >>>
> >>> if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
> >>> || nb_events  == 1)
> >>>
> >>> At least, It was my original intention in the code.
> >>>
> >>>
> >>>
> >>>>                   return (fp_ops->enqueue)(port, ev);
> >>>>           else
> >>>>                   return fn(port, ev, nb_events);
> >>>> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
> >>>>            * Allow zero cost non burst mode routine invocation if application
> >>>>            * requests nb_events as const one
> >>>>            */
> >>>> -       if (nb_events == 1)
> >>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
> >>>>                   return (fp_ops->dequeue)(port, ev, timeout_ticks);
> >>>>           else
> >>>>                   return (fp_ops->dequeue_burst)(port, ev, nb_events,
> >>>> --
> >>>> 2.34.1
> >>>>
> >>
  
Mattias Rönnblom May 17, 2023, 7:16 a.m. UTC | #10
On 2023-05-16 15:08, Jerin Jacob wrote:
> On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>> On 2023-05-15 14:38, Jerin Jacob wrote:
>>> On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>
>>>> On 2023-05-12 13:59, Jerin Jacob wrote:
>>>>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
>>>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>>>
>>>>>> Use non-burst event enqueue and dequeue calls from burst enqueue and
>>>>>> dequeue only when the burst size is compile-time constant (and equal
>>>>>> to one).
>>>>>>
>>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> v3: Actually include the change v2 claimed to contain.
>>>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
>>>>>>        application is compiled with -pedantic. (Morten Brørup)
>>>>>> ---
>>>>>>     lib/eventdev/rte_eventdev.h | 4 ++--
>>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
>>>>>> index a90e23ac8b..a471caeb6d 100644
>>>>>> --- a/lib/eventdev/rte_eventdev.h
>>>>>> +++ b/lib/eventdev/rte_eventdev.h
>>>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
>>>>>>             * Allow zero cost non burst mode routine invocation if application
>>>>>>             * requests nb_events as const one
>>>>>>             */
>>>>>> -       if (nb_events == 1)
>>>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>>
>>>>> "Why" part is not clear from the commit message. Is this to avoid
>>>>> nb_events read if it is built-in const.
>>>>
>>>> The __builtin_constant_p() is introduced to avoid having the compiler
>>>> generate a conditional branch and two different code paths in case
>>>> nb_elem is a run-time variable.
>>>>
>>>> In particular, this matters if nb_elems is run-time variable and varies
>>>> between 1 and some larger value.
>>>>
>>>> I should have mention this in the commit message.
>>>>
>>>> A very slight performance improvement. It also makes the code better
>>>> match the comment, imo. Zero cost for const one enqueues, but no impact
>>>> non-compile-time-constant-length enqueues.
>>>>
>>>> Feel free to ignore.
>>>
>>>
>>> I did some performance comparison of the patch.
>>> A low-end ARM machines shows 0.7%  drop with single event case. No
>>> regression see with high-end ARM cores with single event case.
>>>
>>> IMO, optimizing the check for burst mode(the new patch) may not show
>>> any real improvement as the cost is divided by number of event.
>>> Whereas optimizing the check for single event case(The current code)
>>> shows better performance with single event case and no regression
>>> with burst mode as cost is divided by number of events.
>>
>> I ran some tests on an AMD Zen 3 with DSW.
>> In the below tests the enqueue burst size is not compile-time constant.
>>
>> Enqueue burst size      Performance improvement
>> Run-time constant 1     ~5%
>> Run-time constant 2     ~0%
>> Run-time variable 1-2   ~9%
>> Run-time variable 1-16  ~0%
>>
>> The run-time variable enqueue sizes randomly (uniformly) distributed in
>> the specified range.
>>
>> The first result may come as a surprise. The benchmark is using
>> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type
>> in most apps). The single-event enqueue function only exists in a
>> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent).
>> I suspect that is the reason for the performance improvement.
>>
>> This effect is large-enough to make it somewhat beneficial (+~1%) to use
>> run-time variable single-event enqueue compared to keeping the burst
>> size compile-time constant.
> 
> # Interesting, Could you share your testeventdev command to test it.

I'm using a proprietary benchmark to evaluate the effect of these 
changes. There's certainly nothing secret about that program, and also 
nothing very DSW-specific either. I hope to at some point both extend 
DPDK eventdev tests to include DSW, and also to contribute 
benchmarks/characteristics tests (perf unit tests or as a separate 
program), if there seems to be a value in this.

> # By having quick glance on DSW code, following change can be added(or
>   similar cases).
> Not sure such change in DSW driver is making a difference or nor?
> 
> 
> diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
> index e84b65d99f..455470997b 100644
> --- a/drivers/event/dsw/dsw_event.c
> +++ b/drivers/event/dsw/dsw_event.c
> @@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev
> *dsw, struct dsw_port *source_port)
>   uint16_t
>   dsw_event_enqueue(void *port, const struct rte_event *ev)
>   {
> -       return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : 1);
> +       return dsw_event_enqueue_burst(port, ev, 1);

Good point.

Historical note: I think that comparison is old cruft borne out of a 
misconception, that the single-event enqueue could be called directly 
from application code, combined with the fact that producer-only ports 
needed some way to "maintain" a port, prior to the introduction of 
rte_event_maintain().

>   }
> 
>   static __rte_always_inline uint16_t
> @@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port
> *source_port,
>          return (num_non_release + num_release);
>   }
> 
> -uint16_t
> +inline uint16_t

 From what it seems, this does not have the desired effect, at least not 
on GCC 11.3 (w/ the default DPDK compiler configuration).

I reached this conclusion when I noticed that if I reshuffle the code so 
to force (not hint) the inlining of the burst (and generic burst) 
enqueue function into dsw_event_enqueue(), your change performs better.

>   dsw_event_enqueue_burst(void *port, const struct rte_event events[],
>                          uint16_t events_len)
>   {
> 
> # I am testing with command like this "/build/app/dpdk-test-eventdev
> -l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8
> --stlist p --nb_pkts=10000000000"
> 


I re-ran the compile-time variable, run-time constant enqueue size of 1, 
and I got the following:

Jerin's change: +4%
Jerin's change + ensure inlining: +6%
RFC v3: +7%

(Here I use a more different setup that produces more deterministic 
results, hence the different numbers compared to the previous runs. They 
were using a pipeline spread over two chiplets, and these runs are using 
only a single chiplet.)

It seems like with your suggested changes you eliminate most of the 
single-enqueue-special case performance degradation (for DSW), but not 
all of it. The remaining degradation is very small (for the above case, 
larger for small by run-time variable enqueue sizes), but it's a little 
sad that a supposedly performance-enhancing special case (that drives 
complexity in the code, although not much) actually degrades performance.

>>
>> The performance gain is counted toward both enqueue and dequeue costs
>> (+benchmark app overhead), so an under-estimation if see this as an
>> enqueue performance improvement.
>>
>>> If you agree, then we can skip this patch.
>>>
>>
>> I have no strong opinion if this should be included or not.
>>
>> It was up to me, I would drop the single-enqueue special case handling
>> altogether in the next ABI update.
> 
> That's a reasonable path. If we are willing to push a patch, we can
> test it and give feedback.
> Or in our spare time, We can do that as well.
> 

Sure, I'll give it a try.

The next release is an ABI-breaking one?

>>
>>>
>>>>
>>>>> If so, check should be following. Right?
>>>>>
>>>>> if (__extension__((__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>> || nb_events  == 1)
>>>>>
>>>>> At least, It was my original intention in the code.
>>>>>
>>>>>
>>>>>
>>>>>>                    return (fp_ops->enqueue)(port, ev);
>>>>>>            else
>>>>>>                    return fn(port, ev, nb_events);
>>>>>> @@ -2200,7 +2200,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>>>>>>             * Allow zero cost non burst mode routine invocation if application
>>>>>>             * requests nb_events as const one
>>>>>>             */
>>>>>> -       if (nb_events == 1)
>>>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
>>>>>>                    return (fp_ops->dequeue)(port, ev, timeout_ticks);
>>>>>>            else
>>>>>>                    return (fp_ops->dequeue_burst)(port, ev, nb_events,
>>>>>> --
>>>>>> 2.34.1
>>>>>>
>>>>
  
Jerin Jacob May 17, 2023, 12:28 p.m. UTC | #11
On Wed, May 17, 2023 at 12:46 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
> On 2023-05-16 15:08, Jerin Jacob wrote:
> > On Tue, May 16, 2023 at 2:22 AM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >>
> >> On 2023-05-15 14:38, Jerin Jacob wrote:
> >>> On Fri, May 12, 2023 at 6:45 PM Mattias Rönnblom
> >>> <mattias.ronnblom@ericsson.com> wrote:
> >>>>
> >>>> On 2023-05-12 13:59, Jerin Jacob wrote:
> >>>>> On Thu, May 11, 2023 at 2:00 PM Mattias Rönnblom
> >>>>> <mattias.ronnblom@ericsson.com> wrote:
> >>>>>>
> >>>>>> Use non-burst event enqueue and dequeue calls from burst enqueue and
> >>>>>> dequeue only when the burst size is compile-time constant (and equal
> >>>>>> to one).
> >>>>>>
> >>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> v3: Actually include the change v2 claimed to contain.
> >>>>>> v2: Wrap builtin call in __extension__, to avoid compiler warnings if
> >>>>>>        application is compiled with -pedantic. (Morten Brørup)
> >>>>>> ---
> >>>>>>     lib/eventdev/rte_eventdev.h | 4 ++--
> >>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> >>>>>> index a90e23ac8b..a471caeb6d 100644
> >>>>>> --- a/lib/eventdev/rte_eventdev.h
> >>>>>> +++ b/lib/eventdev/rte_eventdev.h
> >>>>>> @@ -1944,7 +1944,7 @@ __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
> >>>>>>             * Allow zero cost non burst mode routine invocation if application
> >>>>>>             * requests nb_events as const one
> >>>>>>             */
> >>>>>> -       if (nb_events == 1)
> >>>>>> +       if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
> >>>>>
> >>>>> "Why" part is not clear from the commit message. Is this to avoid
> >>>>> nb_events read if it is built-in const.
> >>>>
> >>>> The __builtin_constant_p() is introduced to avoid having the compiler
> >>>> generate a conditional branch and two different code paths in case
> >>>> nb_elem is a run-time variable.
> >>>>
> >>>> In particular, this matters if nb_elems is run-time variable and varies
> >>>> between 1 and some larger value.
> >>>>
> >>>> I should have mention this in the commit message.
> >>>>
> >>>> A very slight performance improvement. It also makes the code better
> >>>> match the comment, imo. Zero cost for const one enqueues, but no impact
> >>>> non-compile-time-constant-length enqueues.
> >>>>
> >>>> Feel free to ignore.
> >>>
> >>>
> >>> I did some performance comparison of the patch.
> >>> A low-end ARM machines shows 0.7%  drop with single event case. No
> >>> regression see with high-end ARM cores with single event case.
> >>>
> >>> IMO, optimizing the check for burst mode(the new patch) may not show
> >>> any real improvement as the cost is divided by number of event.
> >>> Whereas optimizing the check for single event case(The current code)
> >>> shows better performance with single event case and no regression
> >>> with burst mode as cost is divided by number of events.
> >>
> >> I ran some tests on an AMD Zen 3 with DSW.
> >> In the below tests the enqueue burst size is not compile-time constant.
> >>
> >> Enqueue burst size      Performance improvement
> >> Run-time constant 1     ~5%
> >> Run-time constant 2     ~0%
> >> Run-time variable 1-2   ~9%
> >> Run-time variable 1-16  ~0%
> >>
> >> The run-time variable enqueue sizes randomly (uniformly) distributed in
> >> the specified range.
> >>
> >> The first result may come as a surprise. The benchmark is using
> >> RTE_EVENT_OP_FORWARD type events (which likely is the dominating op type
> >> in most apps). The single-event enqueue function only exists in a
> >> generic variant (i.e., no rte_event_enqueue_forward_burst() equivalent).
> >> I suspect that is the reason for the performance improvement.
> >>
> >> This effect is large-enough to make it somewhat beneficial (+~1%) to use
> >> run-time variable single-event enqueue compared to keeping the burst
> >> size compile-time constant.
> >
> > # Interesting, Could you share your testeventdev command to test it.
>
> I'm using a proprietary benchmark to evaluate the effect of these
> changes. There's certainly nothing secret about that program, and also
> nothing very DSW-specific either. I hope to at some point both extend
> DPDK eventdev tests to include DSW, and also to contribute
> benchmarks/characteristics tests (perf unit tests or as a separate
> program), if there seems to be a value in this.

Yes. Please extend the testeventdev for your use case so that all
drivers can test
and help to optimize _real world_ cases. Testeventdev already has
plugin kind of interface,
it should pretty easy to add new MODES.


>
> > # By having quick glance on DSW code, following change can be added(or
> >   similar cases).
> > Not sure such change in DSW driver is making a difference or nor?
> >
> >
> > diff --git a/drivers/event/dsw/dsw_event.c b/drivers/event/dsw/dsw_event.c
> > index e84b65d99f..455470997b 100644
> > --- a/drivers/event/dsw/dsw_event.c
> > +++ b/drivers/event/dsw/dsw_event.c
> > @@ -1251,7 +1251,7 @@ dsw_port_flush_out_buffers(struct dsw_evdev
> > *dsw, struct dsw_port *source_port)
> >   uint16_t
> >   dsw_event_enqueue(void *port, const struct rte_event *ev)
> >   {
> > -       return dsw_event_enqueue_burst(port, ev, unlikely(ev == NULL) ? 0 : 1);
> > +       return dsw_event_enqueue_burst(port, ev, 1);
>
> Good point.
>
> Historical note: I think that comparison is old cruft borne out of a
> misconception, that the single-event enqueue could be called directly
> from application code, combined with the fact that producer-only ports
> needed some way to "maintain" a port, prior to the introduction of
> rte_event_maintain().
>
> >   }
> >
> >   static __rte_always_inline uint16_t
> > @@ -1340,7 +1340,7 @@ dsw_event_enqueue_burst_generic(struct dsw_port
> > *source_port,
> >          return (num_non_release + num_release);
> >   }
> >
> > -uint16_t
> > +inline uint16_t
>
>  From what it seems, this does not have the desired effect, at least not
> on GCC 11.3 (w/ the default DPDK compiler configuration).
>
> I reached this conclusion when I noticed that if I reshuffle the code so
> to force (not hint) the inlining of the burst (and generic burst)
> enqueue function into dsw_event_enqueue(), your change performs better.
>
> >   dsw_event_enqueue_burst(void *port, const struct rte_event events[],
> >                          uint16_t events_len)
> >   {
> >
> > # I am testing with command like this "/build/app/dpdk-test-eventdev
> > -l 0-23 -a 0002:0e:00.0 -- --test=perf_atq --plcores 1 --wlcores 8
> > --stlist p --nb_pkts=10000000000"
> >
>
>
> I re-ran the compile-time variable, run-time constant enqueue size of 1,
> and I got the following:
>
> Jerin's change: +4%
> Jerin's change + ensure inlining: +6%
> RFC v3: +7%
>
> (Here I use a more different setup that produces more deterministic
> results, hence the different numbers compared to the previous runs. They
> were using a pipeline spread over two chiplets, and these runs are using
> only a single chiplet.)
>
> It seems like with your suggested changes you eliminate most of the
> single-enqueue-special case performance degradation (for DSW), but not
> all of it. The remaining degradation is very small (for the above case,


Cores like AMD Zen 3, I was not expecting 1% diff with such check.
e.s.p if it has proper branch predictors. Even pretty low-end arm
cores, had around
0.7% diff and new arm cores shows no difference.

> larger for small by run-time variable enqueue sizes), but it's a little
> sad that a supposedly performance-enhancing special case (that drives
> complexity in the code, although not much) actually degrades performance.

OK. Let's get rid of fp_ops->dequeue callback. Initial RFC of eventdev
has public non burst API,
that was the reason for that callback.

>
> >>
> >> The performance gain is counted toward both enqueue and dequeue costs
> >> (+benchmark app overhead), so an under-estimation if see this as an
> >> enqueue performance improvement.
> >>
> >>> If you agree, then we can skip this patch.
> >>>
> >>
> >> I have no strong opinion if this should be included or not.
> >>
> >> It was up to me, I would drop the single-enqueue special case handling
> >> altogether in the next ABI update.
> >
> > That's a reasonable path. If we are willing to push a patch, we can
> > test it and give feedback.
> > Or in our spare time, We can do that as well.
> >
>
> Sure, I'll give it a try.
>
> The next release is an ABI-breaking one?

Yes  (23.11). Please plan to send deprecation notice before 23.07 release.

I will mark this patch as rejected in patchwork.

Thanks for your time.
  

Patch

diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index a90e23ac8b..a471caeb6d 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -1944,7 +1944,7 @@  __rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id,
 	 * Allow zero cost non burst mode routine invocation if application
 	 * requests nb_events as const one
 	 */
-	if (nb_events == 1)
+	if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
 		return (fp_ops->enqueue)(port, ev);
 	else
 		return fn(port, ev, nb_events);
@@ -2200,7 +2200,7 @@  rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * Allow zero cost non burst mode routine invocation if application
 	 * requests nb_events as const one
 	 */
-	if (nb_events == 1)
+	if (__extension__(__builtin_constant_p(nb_events)) && nb_events == 1)
 		return (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
 		return (fp_ops->dequeue_burst)(port, ev, nb_events,