diff mbox series

[RFC] eventdev: ensure 16-byte alignment for events

Message ID	20231005115101.12276-1-bruce.richardson@intel.com (mailing list archive)
State	Superseded, archived
Delegated to:	Jerin Jacob
Headers	From: Bruce Richardson <bruce.richardson@intel.com> To: dev@dpdk.org Cc: Bruce Richardson <bruce.richardson@intel.com>, Jerin Jacob <jerinj@marvell.com> Subject: [RFC PATCH] eventdev: ensure 16-byte alignment for events Date: Thu, 5 Oct 2023 12:51:00 +0100 Message-Id: <20231005115101.12276-1-bruce.richardson@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dev-bounces@dpdk.org
Series	[RFC] eventdev: ensure 16-byte alignment for events \| [RFC] eventdev: ensure 16-byte alignment for events

Checks

Context	Check	Description
ci/checkpatch	success	coding style OK
ci/loongarch-compilation	success	Compilation OK
ci/loongarch-unit-testing	success	Unit Testing PASS

Commit Message

Bruce Richardson Oct. 5, 2023, 11:51 a.m. UTC

  The event structure in DPDK is 16-bytes in size, and events are
regularly passed as parameters directly rather than being passed as
pointers. To help compiler optimize correctly, we can explicitly request
16-byte alignment for events, which means that we should be able
to do aligned vector loads/stores (e.g. with SSE or Neon) when working
with those events.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/eventdev/rte_eventdev.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Bruce Richardson Oct. 5, 2023, 12:06 p.m. UTC | #1

On Thu, Oct 05, 2023 at 12:51:00PM +0100, Bruce Richardson wrote:
> The event structure in DPDK is 16-bytes in size, and events are
> regularly passed as parameters directly rather than being passed as
> pointers. To help compiler optimize correctly, we can explicitly request
> 16-byte alignment for events, which means that we should be able
> to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> with those events.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>  lib/eventdev/rte_eventdev.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 2ba8a7b090..bb0d59b059 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -1344,7 +1344,7 @@ struct rte_event {
>  		struct rte_event_vector *vec;
>  		/**< Event vector pointer. */
>  	};
> -};
> +} __rte_aligned(16);
>  

Looking for feedback on this idea - hence the fact this is going as an RFC.
It seems to me something that should be done for performance reasons, but
I'm not sure if there are any negative consequences of doing this.

Since this is an ABI-affecting change, a decision on this needs to be made
for 23.11, or else it will be locked in for at least another year. Hence me
sending it now as an RFC late in the release cycle, rather than deferring
to next release.

/Bruce

Morten Brørup Oct. 5, 2023, 12:12 p.m. UTC | #2

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 5 October 2023 13.51
ure 16-byte alignment for events
> 
> The event structure in DPDK is 16-bytes in size, and events are
> regularly passed as parameters directly rather than being passed as
> pointers. To help compiler optimize correctly, we can explicitly request
> 16-byte alignment for events, which means that we should be able
> to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> with those events.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---

AFAICS it is not specified anywhere that the size of this struct must be 16 byte.

Consider adding this requirement to the struct documentation, and possibly a static_assert to verify.

Acked-by: Morten Brørup <mb@smartsharesystems.com>

Bruce Richardson Oct. 5, 2023, 1:02 p.m. UTC | #3

On Thu, Oct 05, 2023 at 02:12:10PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Thursday, 5 October 2023 13.51
> ure 16-byte alignment for events
> > 
> > The event structure in DPDK is 16-bytes in size, and events are
> > regularly passed as parameters directly rather than being passed as
> > pointers. To help compiler optimize correctly, we can explicitly request
> > 16-byte alignment for events, which means that we should be able
> > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > with those events.
> > 
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> 
> AFAICS it is not specified anywhere that the size of this struct must be 16 byte.
> 
> Consider adding this requirement to the struct documentation, and possibly a static_assert to verify.
> 
Yes, good point to document and verify.

Jerin Jacob Oct. 5, 2023, 1:11 p.m. UTC | #4

On Thu, Oct 5, 2023 at 6:01 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Thu, Oct 05, 2023 at 12:51:00PM +0100, Bruce Richardson wrote:
> > The event structure in DPDK is 16-bytes in size, and events are
> > regularly passed as parameters directly rather than being passed as
> > pointers. To help compiler optimize correctly, we can explicitly request
> > 16-byte alignment for events, which means that we should be able
> > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > with those events.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> >  lib/eventdev/rte_eventdev.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> > index 2ba8a7b090..bb0d59b059 100644
> > --- a/lib/eventdev/rte_eventdev.h
> > +++ b/lib/eventdev/rte_eventdev.h
> > @@ -1344,7 +1344,7 @@ struct rte_event {
> >               struct rte_event_vector *vec;
> >               /**< Event vector pointer. */
> >       };
> > -};
> > +} __rte_aligned(16);
> >
>

+ Eventdev driver maintainers for review and for performance testing.

> Looking for feedback on this idea - hence the fact this is going as an RFC.

Are you seeing any performance improvement ? Look like only DLB2
driver only using SEE or AVX512 instructions.

> It seems to me something that should be done for performance reasons, but
> I'm not sure if there are any negative consequences of doing this.

In general, it looks OK, However, We may need more testing.
I can only speculate the following as of now, Since event memory is
allocated from stack most case,
there may stack pointer fix up in code for desired alignment ie. some
add/sub instructions which comes for free most likely due to pipeline.

We will test on Marvel HW. Request others to test on their HWs.


>
> Since this is an ABI-affecting change, a decision on this needs to be made
> for 23.11, or else it will be locked in for at least another year. Hence me
> sending it now as an RFC late in the release cycle, rather than deferring
> to next release.
>
> /Bruce

Bruce Richardson Oct. 5, 2023, 1:15 p.m. UTC | #5

On Thu, Oct 05, 2023 at 06:41:34PM +0530, Jerin Jacob wrote:
> On Thu, Oct 5, 2023 at 6:01 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Thu, Oct 05, 2023 at 12:51:00PM +0100, Bruce Richardson wrote:
> > > The event structure in DPDK is 16-bytes in size, and events are
> > > regularly passed as parameters directly rather than being passed as
> > > pointers. To help compiler optimize correctly, we can explicitly request
> > > 16-byte alignment for events, which means that we should be able
> > > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > > with those events.
> > >
> > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > ---
> > >  lib/eventdev/rte_eventdev.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> > > index 2ba8a7b090..bb0d59b059 100644
> > > --- a/lib/eventdev/rte_eventdev.h
> > > +++ b/lib/eventdev/rte_eventdev.h
> > > @@ -1344,7 +1344,7 @@ struct rte_event {
> > >               struct rte_event_vector *vec;
> > >               /**< Event vector pointer. */
> > >       };
> > > -};
> > > +} __rte_aligned(16);
> > >
> >
> 
> + Eventdev driver maintainers for review and for performance testing.
> 
> > Looking for feedback on this idea - hence the fact this is going as an RFC.
> 
> Are you seeing any performance improvement ? Look like only DLB2
> driver only using SEE or AVX512 instructions.
> 

The idea would be that the driver code (and eventdev code) should not need
to use SSE directly. If we mark the event struct as aligned, it should help
encourage the compiler to use these instructions under-the-hood. For
example, when copying an event, the compiler should be emitting 128-bit
loads and stores for most platforms.

/Bruce

Jerin Jacob Oct. 6, 2023, 7:19 a.m. UTC | #6

On Thu, Oct 5, 2023 at 6:52 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Thu, Oct 05, 2023 at 06:41:34PM +0530, Jerin Jacob wrote:
> > On Thu, Oct 5, 2023 at 6:01 PM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> > >
> > > On Thu, Oct 05, 2023 at 12:51:00PM +0100, Bruce Richardson wrote:
> > > > The event structure in DPDK is 16-bytes in size, and events are
> > > > regularly passed as parameters directly rather than being passed as
> > > > pointers. To help compiler optimize correctly, we can explicitly request
> > > > 16-byte alignment for events, which means that we should be able
> > > > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > > > with those events.
> > > >
> > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > ---
> > > >  lib/eventdev/rte_eventdev.h | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> > > > index 2ba8a7b090..bb0d59b059 100644
> > > > --- a/lib/eventdev/rte_eventdev.h
> > > > +++ b/lib/eventdev/rte_eventdev.h
> > > > @@ -1344,7 +1344,7 @@ struct rte_event {
> > > >               struct rte_event_vector *vec;
> > > >               /**< Event vector pointer. */
> > > >       };
> > > > -};
> > > > +} __rte_aligned(16);
> > > >
> > >
> >
> > + Eventdev driver maintainers for review and for performance testing.
> >
> > > Looking for feedback on this idea - hence the fact this is going as an RFC.
> >
> > Are you seeing any performance improvement ? Look like only DLB2
> > driver only using SEE or AVX512 instructions.
> >
>
> The idea would be that the driver code (and eventdev code) should not need
> to use SSE directly. If we mark the event struct as aligned, it should help
> encourage the compiler to use these instructions under-the-hood. For
> example, when copying an event, the compiler should be emitting 128-bit
> loads and stores for most platforms.

With limited testing, there is no performance regression is seen. In
fact, the compiler is generating the
same instruction stream on both cases.

If there are no objections from others, Please send v1 with "ABI
change" update in doc/guides/rel_notes/release_23_11.rst.

With above change,
Acked-by: Jerin Jacob <jerinj@marvell.com>

NOTE: I already made PR to Thomas for rc1. Since is needs to part of
rc1, I need to sync with @Thomas Monjalon  as well.







>
> /Bruce

Mattias Rönnblom Oct. 6, 2023, 12:15 p.m. UTC | #7

On 2023-10-05 13:51, Bruce Richardson wrote:
> The event structure in DPDK is 16-bytes in size, and events are
> regularly passed as parameters directly rather than being passed as
> pointers.

When are events passed by-value, rather than by-reference? There are no 
such examples in the public eventdev API.

To help compiler optimize correctly, we can explicitly request
> 16-byte alignment for events, which means that we should be able
> to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> with those events.
> 

That change is both helping and sabotaging the optimizer's work. Now 
every stack allocation needs to be 2-byte aligned - in DPDK code, and in 
the application.

The effect this change has on an eventdev app using DSW is a ~3 
cycle/event performance degradation on an AMD Zen 3 system, and a ~4 
cycle/event performance degradation on a Skylake-generation Intel CPU.

What scenarios do you have in mind, where this change would improve the 
generated code? Something where there are no unaligned loads available 
in the ISA, or they are much slower than their aligned counterparts?

When I looked into the same issue for the DPDK IP checksumming routines, 
there basically were no such. Not that I could find.

> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>   lib/eventdev/rte_eventdev.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 2ba8a7b090..bb0d59b059 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -1344,7 +1344,7 @@ struct rte_event {
>   		struct rte_event_vector *vec;
>   		/**< Event vector pointer. */
>   	};
> -};
> +} __rte_aligned(16);
>   
>   /* Ethdev Rx adapter capability bitmap flags */
>   #define RTE_EVENT_ETH_RX_ADAPTER_CAP_INTERNAL_PORT	0x1

Bruce Richardson Oct. 6, 2023, 12:19 p.m. UTC | #8

On Fri, Oct 06, 2023 at 02:15:00PM +0200, Mattias Rönnblom wrote:
> On 2023-10-05 13:51, Bruce Richardson wrote:
> > The event structure in DPDK is 16-bytes in size, and events are
> > regularly passed as parameters directly rather than being passed as
> > pointers.
> 
> When are events passed by-value, rather than by-reference? There are no such
> examples in the public eventdev API.
> 
> To help compiler optimize correctly, we can explicitly request
> > 16-byte alignment for events, which means that we should be able
> > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > with those events.
> > 
> 
> That change is both helping and sabotaging the optimizer's work. Now every
> stack allocation needs to be 2-byte aligned - in DPDK code, and in the
> application.
> 
> The effect this change has on an eventdev app using DSW is a ~3 cycle/event
> performance degradation on an AMD Zen 3 system, and a ~4 cycle/event
> performance degradation on a Skylake-generation Intel CPU.
> 

Thanks for checking - this is the sort of feedback needed alright. In SW
eventdev we copy events around alot without using pointers, so I felt that
alignment would be helpful to avoid issues with events spanning cachelines.

However, since it has negative impacts, I'm quite happy to drop the idea,
and keep things as they are. I'll mark the change as rejected in patchwork.

/Bruce

Mattias Rönnblom Oct. 6, 2023, 12:29 p.m. UTC | #9

On 2023-10-06 14:19, Bruce Richardson wrote:
> On Fri, Oct 06, 2023 at 02:15:00PM +0200, Mattias Rönnblom wrote:
>> On 2023-10-05 13:51, Bruce Richardson wrote:
>>> The event structure in DPDK is 16-bytes in size, and events are
>>> regularly passed as parameters directly rather than being passed as
>>> pointers.
>>
>> When are events passed by-value, rather than by-reference? There are no such
>> examples in the public eventdev API.
>>
>> To help compiler optimize correctly, we can explicitly request
>>> 16-byte alignment for events, which means that we should be able
>>> to do aligned vector loads/stores (e.g. with SSE or Neon) when working
>>> with those events.
>>>
>>
>> That change is both helping and sabotaging the optimizer's work. Now every
>> stack allocation needs to be 2-byte aligned - in DPDK code, and in the
>> application. >>
>> The effect this change has on an eventdev app using DSW is a ~3 cycle/event
>> performance degradation on an AMD Zen 3 system, and a ~4 cycle/event
>> performance degradation on a Skylake-generation Intel CPU.
>>
> 
> Thanks for checking - this is the sort of feedback needed alright. In SW
> eventdev we copy events around alot without using pointers, so I felt that
> alignment would be helpful to avoid issues with events spanning cachelines.
> 
> However, since it has negative impacts, I'm quite happy to drop the idea,
> and keep things as they are. I'll mark the change as rejected in patchwork.
> 

I think this was an excellent idea, it just didn't happen to have the 
desired effect. At least not in the particular cases where I evaluated it.

Given the complexity of compilers and CPUs, it's almost impossible to 
predict the outcome, performance-wise, of a particular change.

Stephen Hemminger Jan. 19, 2024, 10:30 p.m. UTC | #10

On Fri, 6 Oct 2023 14:15:00 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 2023-10-05 13:51, Bruce Richardson wrote:
> > The event structure in DPDK is 16-bytes in size, and events are
> > regularly passed as parameters directly rather than being passed as
> > pointers.  
> 
> When are events passed by-value, rather than by-reference? There are no 
> such examples in the public eventdev API.
> 
> To help compiler optimize correctly, we can explicitly request
> > 16-byte alignment for events, which means that we should be able
> > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > with those events.
> >   
> 
> That change is both helping and sabotaging the optimizer's work. Now 
> every stack allocation needs to be 2-byte aligned - in DPDK code, and in 
> the application.
> 
> The effect this change has on an eventdev app using DSW is a ~3 
> cycle/event performance degradation on an AMD Zen 3 system, and a ~4 
> cycle/event performance degradation on a Skylake-generation Intel CPU.
> 
> What scenarios do you have in mind, where this change would improve the 
> generated code? Something where there are no unaligned loads available 
> in the ISA, or they are much slower than their aligned counterparts?
> 
> When I looked into the same issue for the DPDK IP checksumming routines, 
> there basically were no such. Not that I could find.
> 
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

Don't understand what the issue is. Isn't the event structure typically
on stack. Adding padding there would not help.

diff mbox series

Patch

diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 2ba8a7b090..bb0d59b059 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -1344,7 +1344,7 @@  struct rte_event {
 		struct rte_event_vector *vec;
 		/**< Event vector pointer. */
 	};
-};
+} __rte_aligned(16);
 
 /* Ethdev Rx adapter capability bitmap flags */
 #define RTE_EVENT_ETH_RX_ADAPTER_CAP_INTERNAL_PORT	0x1