mbox series

[v5,0/8] Introduce event vectorization

Message ID 20210324050525.4489-1-pbhagavatula@marvell.com (mailing list archive)
Headers show
Series Introduce event vectorization | expand

Message

Pavan Nikhilesh Bhagavatula March 24, 2021, 5:05 a.m. UTC
From: Pavan Nikhilesh <pbhagavatula@marvell.com>

In traditional event programming model, events are identified by a
flow-id and a uintptr_t. The flow-id uniquely identifies a given event
and determines the order of scheduling based on schedule type, the
uintptr_t holds a single object.

Event devices also support burst mode with configurable dequeue depth,
i.e. each dequeue call would return multiple events and each event
might be at a different stage of the pipeline.
Having a burst of events belonging to different stages in a dequeue
burst is not only difficult to vectorize but also increases the scheduler
overhead and application overhead of pipelining events further.
Using event vectors we see a performance gain of ~628% as shown in [1].

By introducing event vectorization, each event will be capable of holding
multiple uintptr_t of the same flow thereby allowing applications
to vectorize their pipeline and reduce the complexity of pipelining
events across multiple stages. This also reduces the complexity of handling
enqueue and dequeue on an event device.

Since event devices are transparent to the events they are scheduling
so the event producers such as eth_rx_adapter, crypto_adapter , etc..
are responsible for vectorizing the buffers of the same flow into a single
event.

The series also breaks ABI in the patch [8/8] which is targetted to the
v21.11 release.

The dpdk-test-eventdev application has been updated with options to test
multiple vector sizes and timeouts.

[1]
As for performance improvement, with a ARM Cortex-A72 equivalent processer,
software event device (--vdev=event_sw0), single worker core, single stage
and using one service core for Rx adapter, Tx adapter, Scheduling.

Without event vectorization:
    ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --vdev="event_sw0" --
         --prod_type_ethdev --nb_pkts=0 --verbose 2 --test=pipeline_queue
         --stlist=a --wlcores=20
    Port[0] using Rx adapter[0] configured
    Port[0] using Tx adapter[0] Configured
    4.728 mpps avg 4.728 mpps

With event vectorization:
    ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --vdev="event_sw0" --
        --prod_type_ethdev --nb_pkts=0 --verbose 2 --test=pipeline_queue
        --stlist=a --wlcores=20 --enable_vector --nb_eth_queues 1
        --vector_size 256
    Port[0] using Rx adapter[0] configured
    Port[0] using Tx adapter[0] Configured
    34.383 mpps avg 34.383 mpps

Having dedicated service cores for each Rx queues and tweaking the vector,
dequeue burst size would further improve performance.

API usage is shown below:

Configuration:

	struct rte_event_eth_rx_adapter_event_vector_config vec_conf;

	vector_pool = rte_event_vector_pool_create("vector_pool",
			nb_elem, 0, vector_size, socket_id);

	rte_event_eth_rx_adapter_create(id, event_id, &adptr_conf);
	rte_event_eth_rx_adapter_queue_add(id, eth_id, -1, &queue_conf);
	if (cap & RTE_EVENT_ETH_RX_ADAPTER_CAP_EVENT_VECTOR) {
		vec_conf.vector_sz = vector_size;
		vec_conf.vector_timeout_ns = vector_tmo_nsec;
		vec_conf.vector_mp = vector_pool;
		rte_event_eth_rx_adapter_queue_event_vector_config(id,
				eth_id, -1, &vec_conf);
	}

Fastpath:

	num = rte_event_dequeue_burst(event_id, port_id, &ev, 1, 0);
	if (!num)
		continue;

	if (ev.event_type & RTE_EVENT_TYPE_VECTOR) {
		switch (ev.event_type) {
		case RTE_EVENT_TYPE_ETHDEV_VECTOR:
		case RTE_EVENT_TYPE_ETH_RX_ADAPTER_VECTOR:
			struct rte_mbuf **mbufs;

			mbufs = ev.vector_ev->mbufs;
			for (i = 0; i < ev.vector_ev->nb_elem; i++)
				//Process mbufs.
			break;
		case ...
		}
	}
	...

v5 Changes:
- Make `rte_event_vector_pool_create non-inline` to ease ABI stability.(Ray)
- Move `rte_event_eth_rx_adapter_queue_event_vector_config` and
  `rte_event_eth_rx_adapter_vector_limits_get` implementation to the patch
  where they are initially defined.(Ray)
- Multiple gramatical and style fixes.(Jerin)
- Add missing release notes.(Jerin)

v4 Changes:
- Fix missing event vector structure in event structure.(Jay)

v3 Changes:
- Fix unintended formatting changes.

v2 Changes:
- Multiple gramatical and style fixes.(Jerin)
- Add parameter to define vector size in power of 2. (Jerin)
- Redo patch series w/o breaking ABI till the last patch.(David)
- Add deprication notice to announce ABI break in 21.11.(David)
- Add vector limits validation to app/test-eventdev.

Pavan Nikhilesh (8):
  eventdev: introduce event vector capability
  eventdev: introduce event vector Rx capability
  eventdev: introduce event vector Tx capability
  eventdev: add Rx adapter event vector support
  eventdev: add Tx adapter event vector support
  app/eventdev: add event vector mode in pipeline test
  doc: announce event Rx adapter config changes
  eventdev: simplify Rx adapter event vector config

 app/test-eventdev/evt_common.h                |   4 +
 app/test-eventdev/evt_options.c               |  52 +++
 app/test-eventdev/evt_options.h               |   4 +
 app/test-eventdev/test_pipeline_atq.c         | 310 +++++++++++++++--
 app/test-eventdev/test_pipeline_common.c      | 105 +++++-
 app/test-eventdev/test_pipeline_common.h      |  18 +
 app/test-eventdev/test_pipeline_queue.c       | 320 ++++++++++++++++--
 .../prog_guide/event_ethernet_rx_adapter.rst  |  38 +++
 .../prog_guide/event_ethernet_tx_adapter.rst  |  12 +
 doc/guides/prog_guide/eventdev.rst            |  36 +-
 doc/guides/rel_notes/deprecation.rst          |   9 +
 doc/guides/rel_notes/release_21_05.rst        |   8 +
 doc/guides/tools/testeventdev.rst             |  45 ++-
 lib/librte_eventdev/eventdev_pmd.h            |  31 +-
 .../rte_event_eth_rx_adapter.c                | 305 ++++++++++++++++-
 .../rte_event_eth_rx_adapter.h                |  78 +++++
 .../rte_event_eth_tx_adapter.c                |  66 +++-
 lib/librte_eventdev/rte_eventdev.c            |  53 ++-
 lib/librte_eventdev/rte_eventdev.h            | 113 ++++++-
 lib/librte_eventdev/version.map               |   4 +
 20 files changed, 1524 insertions(+), 87 deletions(-)

--
2.17.1

Comments

Jayatheerthan, Jay March 24, 2021, 5:39 a.m. UTC | #1
> -----Original Message-----
> From: pbhagavatula@marvell.com <pbhagavatula@marvell.com>
> Sent: Wednesday, March 24, 2021 10:35 AM
> To: jerinj@marvell.com; Jayatheerthan, Jay <jay.jayatheerthan@intel.com>; Carrillo, Erik G <erik.g.carrillo@intel.com>; Gujjar, Abhinandan
> S <abhinandan.gujjar@intel.com>; McDaniel, Timothy <timothy.mcdaniel@intel.com>; hemant.agrawal@nxp.com; Van Haaren, Harry
> <harry.van.haaren@intel.com>; mattias.ronnblom <mattias.ronnblom@ericsson.com>; Ma, Liang J <liang.j.ma@intel.com>
> Cc: dev@dpdk.org; Pavan Nikhilesh <pbhagavatula@marvell.com>
> Subject: [dpdk-dev] [PATCH v5 0/8] Introduce event vectorization
> 
> From: Pavan Nikhilesh <pbhagavatula@marvell.com>
> 
> In traditional event programming model, events are identified by a
> flow-id and a uintptr_t. The flow-id uniquely identifies a given event
> and determines the order of scheduling based on schedule type, the
> uintptr_t holds a single object.
> 
> Event devices also support burst mode with configurable dequeue depth,
> i.e. each dequeue call would return multiple events and each event
> might be at a different stage of the pipeline.
> Having a burst of events belonging to different stages in a dequeue
> burst is not only difficult to vectorize but also increases the scheduler
> overhead and application overhead of pipelining events further.
> Using event vectors we see a performance gain of ~628% as shown in [1].
This is very impressive performance boost. Thanks so much for putting this patchset together! Just curious, was any performance measurement done for existing applications (non-vector)?
> 
> By introducing event vectorization, each event will be capable of holding
> multiple uintptr_t of the same flow thereby allowing applications
> to vectorize their pipeline and reduce the complexity of pipelining
> events across multiple stages. This also reduces the complexity of handling
> enqueue and dequeue on an event device.
> 
> Since event devices are transparent to the events they are scheduling
> so the event producers such as eth_rx_adapter, crypto_adapter , etc..
> are responsible for vectorizing the buffers of the same flow into a single
> event.
> 
> The series also breaks ABI in the patch [8/8] which is targetted to the
> v21.11 release.
> 
> The dpdk-test-eventdev application has been updated with options to test
> multiple vector sizes and timeouts.
> 
> [1]
> As for performance improvement, with a ARM Cortex-A72 equivalent processer,
> software event device (--vdev=event_sw0), single worker core, single stage
> and using one service core for Rx adapter, Tx adapter, Scheduling.
> 
> Without event vectorization:
>     ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --vdev="event_sw0" --
>          --prod_type_ethdev --nb_pkts=0 --verbose 2 --test=pipeline_queue
>          --stlist=a --wlcores=20
>     Port[0] using Rx adapter[0] configured
>     Port[0] using Tx adapter[0] Configured
>     4.728 mpps avg 4.728 mpps
Is this number before the patchset? If so, it would help put similar number with the patchset but not using vectorization feature.
> 
> With event vectorization:
>     ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --vdev="event_sw0" --
>         --prod_type_ethdev --nb_pkts=0 --verbose 2 --test=pipeline_queue
>         --stlist=a --wlcores=20 --enable_vector --nb_eth_queues 1
>         --vector_size 256
>     Port[0] using Rx adapter[0] configured
>     Port[0] using Tx adapter[0] Configured
>     34.383 mpps avg 34.383 mpps
> 
> Having dedicated service cores for each Rx queues and tweaking the vector,
> dequeue burst size would further improve performance.
> 
> API usage is shown below:
> 
> Configuration:
> 
> 	struct rte_event_eth_rx_adapter_event_vector_config vec_conf;
> 
> 	vector_pool = rte_event_vector_pool_create("vector_pool",
> 			nb_elem, 0, vector_size, socket_id);
> 
> 	rte_event_eth_rx_adapter_create(id, event_id, &adptr_conf);
> 	rte_event_eth_rx_adapter_queue_add(id, eth_id, -1, &queue_conf);
> 	if (cap & RTE_EVENT_ETH_RX_ADAPTER_CAP_EVENT_VECTOR) {
> 		vec_conf.vector_sz = vector_size;
> 		vec_conf.vector_timeout_ns = vector_tmo_nsec;
> 		vec_conf.vector_mp = vector_pool;
> 		rte_event_eth_rx_adapter_queue_event_vector_config(id,
> 				eth_id, -1, &vec_conf);
> 	}
> 
> Fastpath:
> 
> 	num = rte_event_dequeue_burst(event_id, port_id, &ev, 1, 0);
> 	if (!num)
> 		continue;
> 
> 	if (ev.event_type & RTE_EVENT_TYPE_VECTOR) {
> 		switch (ev.event_type) {
> 		case RTE_EVENT_TYPE_ETHDEV_VECTOR:
> 		case RTE_EVENT_TYPE_ETH_RX_ADAPTER_VECTOR:
> 			struct rte_mbuf **mbufs;
> 
> 			mbufs = ev.vector_ev->mbufs;
> 			for (i = 0; i < ev.vector_ev->nb_elem; i++)
> 				//Process mbufs.
> 			break;
> 		case ...
> 		}
> 	}
> 	...
> 
> v5 Changes:
> - Make `rte_event_vector_pool_create non-inline` to ease ABI stability.(Ray)
> - Move `rte_event_eth_rx_adapter_queue_event_vector_config` and
>   `rte_event_eth_rx_adapter_vector_limits_get` implementation to the patch
>   where they are initially defined.(Ray)
> - Multiple gramatical and style fixes.(Jerin)
> - Add missing release notes.(Jerin)
> 
> v4 Changes:
> - Fix missing event vector structure in event structure.(Jay)
> 
> v3 Changes:
> - Fix unintended formatting changes.
> 
> v2 Changes:
> - Multiple gramatical and style fixes.(Jerin)
> - Add parameter to define vector size in power of 2. (Jerin)
> - Redo patch series w/o breaking ABI till the last patch.(David)
> - Add deprication notice to announce ABI break in 21.11.(David)
> - Add vector limits validation to app/test-eventdev.
> 
> Pavan Nikhilesh (8):
>   eventdev: introduce event vector capability
>   eventdev: introduce event vector Rx capability
>   eventdev: introduce event vector Tx capability
>   eventdev: add Rx adapter event vector support
>   eventdev: add Tx adapter event vector support
>   app/eventdev: add event vector mode in pipeline test
>   doc: announce event Rx adapter config changes
>   eventdev: simplify Rx adapter event vector config
> 
>  app/test-eventdev/evt_common.h                |   4 +
>  app/test-eventdev/evt_options.c               |  52 +++
>  app/test-eventdev/evt_options.h               |   4 +
>  app/test-eventdev/test_pipeline_atq.c         | 310 +++++++++++++++--
>  app/test-eventdev/test_pipeline_common.c      | 105 +++++-
>  app/test-eventdev/test_pipeline_common.h      |  18 +
>  app/test-eventdev/test_pipeline_queue.c       | 320 ++++++++++++++++--
>  .../prog_guide/event_ethernet_rx_adapter.rst  |  38 +++
>  .../prog_guide/event_ethernet_tx_adapter.rst  |  12 +
>  doc/guides/prog_guide/eventdev.rst            |  36 +-
>  doc/guides/rel_notes/deprecation.rst          |   9 +
>  doc/guides/rel_notes/release_21_05.rst        |   8 +
>  doc/guides/tools/testeventdev.rst             |  45 ++-
>  lib/librte_eventdev/eventdev_pmd.h            |  31 +-
>  .../rte_event_eth_rx_adapter.c                | 305 ++++++++++++++++-
>  .../rte_event_eth_rx_adapter.h                |  78 +++++
>  .../rte_event_eth_tx_adapter.c                |  66 +++-
>  lib/librte_eventdev/rte_eventdev.c            |  53 ++-
>  lib/librte_eventdev/rte_eventdev.h            | 113 ++++++-
>  lib/librte_eventdev/version.map               |   4 +
>  20 files changed, 1524 insertions(+), 87 deletions(-)
> 
> --
> 2.17.1

Just a heads up. v5 patchset doesn't apply cleanly on HEAD (5f0849c1155849dfdbf950c91c52cdf9cd301f59). Although, it applies cleanly on app/eventdev: fix timeout accuracy (c33d48387dc8ccf1b432820f6e0cd4992ab486df).
Pavan Nikhilesh Bhagavatula March 24, 2021, 6:44 a.m. UTC | #2
>> From: pbhagavatula@marvell.com <pbhagavatula@marvell.com>
>> Sent: Wednesday, March 24, 2021 10:35 AM
>> To: jerinj@marvell.com; Jayatheerthan, Jay
><jay.jayatheerthan@intel.com>; Carrillo, Erik G
><erik.g.carrillo@intel.com>; Gujjar, Abhinandan
>> S <abhinandan.gujjar@intel.com>; McDaniel, Timothy
><timothy.mcdaniel@intel.com>; hemant.agrawal@nxp.com; Van
>Haaren, Harry
>> <harry.van.haaren@intel.com>; mattias.ronnblom
><mattias.ronnblom@ericsson.com>; Ma, Liang J
><liang.j.ma@intel.com>
>> Cc: dev@dpdk.org; Pavan Nikhilesh <pbhagavatula@marvell.com>
>> Subject: [dpdk-dev] [PATCH v5 0/8] Introduce event vectorization
>>
>> From: Pavan Nikhilesh <pbhagavatula@marvell.com>
>>
>> In traditional event programming model, events are identified by a
>> flow-id and a uintptr_t. The flow-id uniquely identifies a given event
>> and determines the order of scheduling based on schedule type, the
>> uintptr_t holds a single object.
>>
>> Event devices also support burst mode with configurable dequeue
>depth,
>> i.e. each dequeue call would return multiple events and each event
>> might be at a different stage of the pipeline.
>> Having a burst of events belonging to different stages in a dequeue
>> burst is not only difficult to vectorize but also increases the scheduler
>> overhead and application overhead of pipelining events further.
>> Using event vectors we see a performance gain of ~628% as shown in
>[1].
>This is very impressive performance boost. Thanks so much for putting
>this patchset together! Just curious, was any performance
>measurement done for existing applications (non-vector)?
>>
>> By introducing event vectorization, each event will be capable of
>holding
>> multiple uintptr_t of the same flow thereby allowing applications
>> to vectorize their pipeline and reduce the complexity of pipelining
>> events across multiple stages. This also reduces the complexity of
>handling
>> enqueue and dequeue on an event device.
>>
>> Since event devices are transparent to the events they are scheduling
>> so the event producers such as eth_rx_adapter, crypto_adapter , etc..
>> are responsible for vectorizing the buffers of the same flow into a
>single
>> event.
>>
>> The series also breaks ABI in the patch [8/8] which is targetted to the
>> v21.11 release.
>>
>> The dpdk-test-eventdev application has been updated with options to
>test
>> multiple vector sizes and timeouts.
>>
>> [1]
>> As for performance improvement, with a ARM Cortex-A72 equivalent
>processer,
>> software event device (--vdev=event_sw0), single worker core, single
>stage
>> and using one service core for Rx adapter, Tx adapter, Scheduling.
>>
>> Without event vectorization:
>>     ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --
>vdev="event_sw0" --
>>          --prod_type_ethdev --nb_pkts=0 --verbose 2 --
>test=pipeline_queue
>>          --stlist=a --wlcores=20
>>     Port[0] using Rx adapter[0] configured
>>     Port[0] using Tx adapter[0] Configured
>>     4.728 mpps avg 4.728 mpps
>Is this number before the patchset? If so, it would help put similar
>number with the patchset but not using vectorization feature.

I don’t remember the exact clock frequency I was using when I ran 
the above test but with equal clocks:
1. Without the patchset applied
	5.071 mpps
2. With patchset applied w/o enabling vector
	5.123 mpps
3. With patchset applied with enabling vector
	vector_sz@256 42.715 mpps
	vector_sz@512 45.335 mpps
	
>>
>> With event vectorization:
>>     ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --
>vdev="event_sw0" --
>>         --prod_type_ethdev --nb_pkts=0 --verbose 2 --
>test=pipeline_queue
>>         --stlist=a --wlcores=20 --enable_vector --nb_eth_queues 1
>>         --vector_size 256
>>     Port[0] using Rx adapter[0] configured
>>     Port[0] using Tx adapter[0] Configured
>>     34.383 mpps avg 34.383 mpps
>>
>> Having dedicated service cores for each Rx queues and tweaking the
>vector,
>> dequeue burst size would further improve performance.
>>
>> API usage is shown below:
>>
>> Configuration:
>>
>> 	struct rte_event_eth_rx_adapter_event_vector_config
>vec_conf;
>>
>> 	vector_pool = rte_event_vector_pool_create("vector_pool",
>> 			nb_elem, 0, vector_size, socket_id);
>>
>> 	rte_event_eth_rx_adapter_create(id, event_id, &adptr_conf);
>> 	rte_event_eth_rx_adapter_queue_add(id, eth_id, -1,
>&queue_conf);
>> 	if (cap & RTE_EVENT_ETH_RX_ADAPTER_CAP_EVENT_VECTOR)
>{
>> 		vec_conf.vector_sz = vector_size;
>> 		vec_conf.vector_timeout_ns = vector_tmo_nsec;
>> 		vec_conf.vector_mp = vector_pool;
>>
>	rte_event_eth_rx_adapter_queue_event_vector_config(id,
>> 				eth_id, -1, &vec_conf);
>> 	}
>>
>> Fastpath:
>>
>> 	num = rte_event_dequeue_burst(event_id, port_id, &ev, 1, 0);
>> 	if (!num)
>> 		continue;
>>
>> 	if (ev.event_type & RTE_EVENT_TYPE_VECTOR) {
>> 		switch (ev.event_type) {
>> 		case RTE_EVENT_TYPE_ETHDEV_VECTOR:
>> 		case RTE_EVENT_TYPE_ETH_RX_ADAPTER_VECTOR:
>> 			struct rte_mbuf **mbufs;
>>
>> 			mbufs = ev.vector_ev->mbufs;
>> 			for (i = 0; i < ev.vector_ev->nb_elem; i++)
>> 				//Process mbufs.
>> 			break;
>> 		case ...
>> 		}
>> 	}
>> 	...
>>
>> v5 Changes:
>> - Make `rte_event_vector_pool_create non-inline` to ease ABI
>stability.(Ray)
>> - Move `rte_event_eth_rx_adapter_queue_event_vector_config` and
>>   `rte_event_eth_rx_adapter_vector_limits_get` implementation to
>the patch
>>   where they are initially defined.(Ray)
>> - Multiple gramatical and style fixes.(Jerin)
>> - Add missing release notes.(Jerin)
>>
>> v4 Changes:
>> - Fix missing event vector structure in event structure.(Jay)
>>
>> v3 Changes:
>> - Fix unintended formatting changes.
>>
>> v2 Changes:
>> - Multiple gramatical and style fixes.(Jerin)
>> - Add parameter to define vector size in power of 2. (Jerin)
>> - Redo patch series w/o breaking ABI till the last patch.(David)
>> - Add deprication notice to announce ABI break in 21.11.(David)
>> - Add vector limits validation to app/test-eventdev.
>>
>> Pavan Nikhilesh (8):
>>   eventdev: introduce event vector capability
>>   eventdev: introduce event vector Rx capability
>>   eventdev: introduce event vector Tx capability
>>   eventdev: add Rx adapter event vector support
>>   eventdev: add Tx adapter event vector support
>>   app/eventdev: add event vector mode in pipeline test
>>   doc: announce event Rx adapter config changes
>>   eventdev: simplify Rx adapter event vector config
>>
>>  app/test-eventdev/evt_common.h                |   4 +
>>  app/test-eventdev/evt_options.c               |  52 +++
>>  app/test-eventdev/evt_options.h               |   4 +
>>  app/test-eventdev/test_pipeline_atq.c         | 310 +++++++++++++++--
>>  app/test-eventdev/test_pipeline_common.c      | 105 +++++-
>>  app/test-eventdev/test_pipeline_common.h      |  18 +
>>  app/test-eventdev/test_pipeline_queue.c       | 320
>++++++++++++++++--
>>  .../prog_guide/event_ethernet_rx_adapter.rst  |  38 +++
>>  .../prog_guide/event_ethernet_tx_adapter.rst  |  12 +
>>  doc/guides/prog_guide/eventdev.rst            |  36 +-
>>  doc/guides/rel_notes/deprecation.rst          |   9 +
>>  doc/guides/rel_notes/release_21_05.rst        |   8 +
>>  doc/guides/tools/testeventdev.rst             |  45 ++-
>>  lib/librte_eventdev/eventdev_pmd.h            |  31 +-
>>  .../rte_event_eth_rx_adapter.c                | 305 ++++++++++++++++-
>>  .../rte_event_eth_rx_adapter.h                |  78 +++++
>>  .../rte_event_eth_tx_adapter.c                |  66 +++-
>>  lib/librte_eventdev/rte_eventdev.c            |  53 ++-
>>  lib/librte_eventdev/rte_eventdev.h            | 113 ++++++-
>>  lib/librte_eventdev/version.map               |   4 +
>>  20 files changed, 1524 insertions(+), 87 deletions(-)
>>
>> --
>> 2.17.1
>
>Just a heads up. v5 patchset doesn't apply cleanly on HEAD
>(5f0849c1155849dfdbf950c91c52cdf9cd301f59). Although, it applies
>cleanly on app/eventdev: fix timeout accuracy
>(c33d48387dc8ccf1b432820f6e0cd4992ab486df).

This patchset is currently rebased on main branch, I will rebase it on 
dpdk-next-event in next version.
Jayatheerthan, Jay March 24, 2021, 8:10 a.m. UTC | #3
> -----Original Message-----
> From: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
> Sent: Wednesday, March 24, 2021 12:15 PM
> To: Jayatheerthan, Jay <jay.jayatheerthan@intel.com>; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Carrillo, Erik G
> <erik.g.carrillo@intel.com>; Gujjar, Abhinandan S <abhinandan.gujjar@intel.com>; McDaniel, Timothy
> <timothy.mcdaniel@intel.com>; hemant.agrawal@nxp.com; Van Haaren, Harry <harry.van.haaren@intel.com>; mattias.ronnblom
> <mattias.ronnblom@ericsson.com>; Ma, Liang J <liang.j.ma@intel.com>
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v5 0/8] Introduce event vectorization
> 
> >> From: pbhagavatula@marvell.com <pbhagavatula@marvell.com>
> >> Sent: Wednesday, March 24, 2021 10:35 AM
> >> To: jerinj@marvell.com; Jayatheerthan, Jay
> ><jay.jayatheerthan@intel.com>; Carrillo, Erik G
> ><erik.g.carrillo@intel.com>; Gujjar, Abhinandan
> >> S <abhinandan.gujjar@intel.com>; McDaniel, Timothy
> ><timothy.mcdaniel@intel.com>; hemant.agrawal@nxp.com; Van
> >Haaren, Harry
> >> <harry.van.haaren@intel.com>; mattias.ronnblom
> ><mattias.ronnblom@ericsson.com>; Ma, Liang J
> ><liang.j.ma@intel.com>
> >> Cc: dev@dpdk.org; Pavan Nikhilesh <pbhagavatula@marvell.com>
> >> Subject: [dpdk-dev] [PATCH v5 0/8] Introduce event vectorization
> >>
> >> From: Pavan Nikhilesh <pbhagavatula@marvell.com>
> >>
> >> In traditional event programming model, events are identified by a
> >> flow-id and a uintptr_t. The flow-id uniquely identifies a given event
> >> and determines the order of scheduling based on schedule type, the
> >> uintptr_t holds a single object.
> >>
> >> Event devices also support burst mode with configurable dequeue
> >depth,
> >> i.e. each dequeue call would return multiple events and each event
> >> might be at a different stage of the pipeline.
> >> Having a burst of events belonging to different stages in a dequeue
> >> burst is not only difficult to vectorize but also increases the scheduler
> >> overhead and application overhead of pipelining events further.
> >> Using event vectors we see a performance gain of ~628% as shown in
> >[1].
> >This is very impressive performance boost. Thanks so much for putting
> >this patchset together! Just curious, was any performance
> >measurement done for existing applications (non-vector)?
> >>
> >> By introducing event vectorization, each event will be capable of
> >holding
> >> multiple uintptr_t of the same flow thereby allowing applications
> >> to vectorize their pipeline and reduce the complexity of pipelining
> >> events across multiple stages. This also reduces the complexity of
> >handling
> >> enqueue and dequeue on an event device.
> >>
> >> Since event devices are transparent to the events they are scheduling
> >> so the event producers such as eth_rx_adapter, crypto_adapter , etc..
> >> are responsible for vectorizing the buffers of the same flow into a
> >single
> >> event.
> >>
> >> The series also breaks ABI in the patch [8/8] which is targetted to the
> >> v21.11 release.
> >>
> >> The dpdk-test-eventdev application has been updated with options to
> >test
> >> multiple vector sizes and timeouts.
> >>
> >> [1]
> >> As for performance improvement, with a ARM Cortex-A72 equivalent
> >processer,
> >> software event device (--vdev=event_sw0), single worker core, single
> >stage
> >> and using one service core for Rx adapter, Tx adapter, Scheduling.
> >>
> >> Without event vectorization:
> >>     ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --
> >vdev="event_sw0" --
> >>          --prod_type_ethdev --nb_pkts=0 --verbose 2 --
> >test=pipeline_queue
> >>          --stlist=a --wlcores=20
> >>     Port[0] using Rx adapter[0] configured
> >>     Port[0] using Tx adapter[0] Configured
> >>     4.728 mpps avg 4.728 mpps
> >Is this number before the patchset? If so, it would help put similar
> >number with the patchset but not using vectorization feature.
> 
> I don’t remember the exact clock frequency I was using when I ran
> the above test but with equal clocks:
> 1. Without the patchset applied
> 	5.071 mpps
> 2. With patchset applied w/o enabling vector
> 	5.123 mpps
> 3. With patchset applied with enabling vector
> 	vector_sz@256 42.715 mpps
> 	vector_sz@512 45.335 mpps
> 

Thanks Pavan for the details. It may be useful to include this info in the patchset.

> >>
> >> With event vectorization:
> >>     ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --
> >vdev="event_sw0" --
> >>         --prod_type_ethdev --nb_pkts=0 --verbose 2 --
> >test=pipeline_queue
> >>         --stlist=a --wlcores=20 --enable_vector --nb_eth_queues 1
> >>         --vector_size 256
> >>     Port[0] using Rx adapter[0] configured
> >>     Port[0] using Tx adapter[0] Configured
> >>     34.383 mpps avg 34.383 mpps
> >>
> >> Having dedicated service cores for each Rx queues and tweaking the
> >vector,
> >> dequeue burst size would further improve performance.
> >>
> >> API usage is shown below:
> >>
> >> Configuration:
> >>
> >> 	struct rte_event_eth_rx_adapter_event_vector_config
> >vec_conf;
> >>
> >> 	vector_pool = rte_event_vector_pool_create("vector_pool",
> >> 			nb_elem, 0, vector_size, socket_id);
> >>
> >> 	rte_event_eth_rx_adapter_create(id, event_id, &adptr_conf);
> >> 	rte_event_eth_rx_adapter_queue_add(id, eth_id, -1,
> >&queue_conf);
> >> 	if (cap & RTE_EVENT_ETH_RX_ADAPTER_CAP_EVENT_VECTOR)
> >{
> >> 		vec_conf.vector_sz = vector_size;
> >> 		vec_conf.vector_timeout_ns = vector_tmo_nsec;
> >> 		vec_conf.vector_mp = vector_pool;
> >>
> >	rte_event_eth_rx_adapter_queue_event_vector_config(id,
> >> 				eth_id, -1, &vec_conf);
> >> 	}
> >>
> >> Fastpath:
> >>
> >> 	num = rte_event_dequeue_burst(event_id, port_id, &ev, 1, 0);
> >> 	if (!num)
> >> 		continue;
> >>
> >> 	if (ev.event_type & RTE_EVENT_TYPE_VECTOR) {
> >> 		switch (ev.event_type) {
> >> 		case RTE_EVENT_TYPE_ETHDEV_VECTOR:
> >> 		case RTE_EVENT_TYPE_ETH_RX_ADAPTER_VECTOR:
> >> 			struct rte_mbuf **mbufs;
> >>
> >> 			mbufs = ev.vector_ev->mbufs;
> >> 			for (i = 0; i < ev.vector_ev->nb_elem; i++)
> >> 				//Process mbufs.
> >> 			break;
> >> 		case ...
> >> 		}
> >> 	}
> >> 	...
> >>
> >> v5 Changes:
> >> - Make `rte_event_vector_pool_create non-inline` to ease ABI
> >stability.(Ray)
> >> - Move `rte_event_eth_rx_adapter_queue_event_vector_config` and
> >>   `rte_event_eth_rx_adapter_vector_limits_get` implementation to
> >the patch
> >>   where they are initially defined.(Ray)
> >> - Multiple gramatical and style fixes.(Jerin)
> >> - Add missing release notes.(Jerin)
> >>
> >> v4 Changes:
> >> - Fix missing event vector structure in event structure.(Jay)
> >>
> >> v3 Changes:
> >> - Fix unintended formatting changes.
> >>
> >> v2 Changes:
> >> - Multiple gramatical and style fixes.(Jerin)
> >> - Add parameter to define vector size in power of 2. (Jerin)
> >> - Redo patch series w/o breaking ABI till the last patch.(David)
> >> - Add deprication notice to announce ABI break in 21.11.(David)
> >> - Add vector limits validation to app/test-eventdev.
> >>
> >> Pavan Nikhilesh (8):
> >>   eventdev: introduce event vector capability
> >>   eventdev: introduce event vector Rx capability
> >>   eventdev: introduce event vector Tx capability
> >>   eventdev: add Rx adapter event vector support
> >>   eventdev: add Tx adapter event vector support
> >>   app/eventdev: add event vector mode in pipeline test
> >>   doc: announce event Rx adapter config changes
> >>   eventdev: simplify Rx adapter event vector config
> >>
> >>  app/test-eventdev/evt_common.h                |   4 +
> >>  app/test-eventdev/evt_options.c               |  52 +++
> >>  app/test-eventdev/evt_options.h               |   4 +
> >>  app/test-eventdev/test_pipeline_atq.c         | 310 +++++++++++++++--
> >>  app/test-eventdev/test_pipeline_common.c      | 105 +++++-
> >>  app/test-eventdev/test_pipeline_common.h      |  18 +
> >>  app/test-eventdev/test_pipeline_queue.c       | 320
> >++++++++++++++++--
> >>  .../prog_guide/event_ethernet_rx_adapter.rst  |  38 +++
> >>  .../prog_guide/event_ethernet_tx_adapter.rst  |  12 +
> >>  doc/guides/prog_guide/eventdev.rst            |  36 +-
> >>  doc/guides/rel_notes/deprecation.rst          |   9 +
> >>  doc/guides/rel_notes/release_21_05.rst        |   8 +
> >>  doc/guides/tools/testeventdev.rst             |  45 ++-
> >>  lib/librte_eventdev/eventdev_pmd.h            |  31 +-
> >>  .../rte_event_eth_rx_adapter.c                | 305 ++++++++++++++++-
> >>  .../rte_event_eth_rx_adapter.h                |  78 +++++
> >>  .../rte_event_eth_tx_adapter.c                |  66 +++-
> >>  lib/librte_eventdev/rte_eventdev.c            |  53 ++-
> >>  lib/librte_eventdev/rte_eventdev.h            | 113 ++++++-
> >>  lib/librte_eventdev/version.map               |   4 +
> >>  20 files changed, 1524 insertions(+), 87 deletions(-)
> >>
> >> --
> >> 2.17.1
> >
> >Just a heads up. v5 patchset doesn't apply cleanly on HEAD
> >(5f0849c1155849dfdbf950c91c52cdf9cd301f59). Although, it applies
> >cleanly on app/eventdev: fix timeout accuracy
> >(c33d48387dc8ccf1b432820f6e0cd4992ab486df).
> 
> This patchset is currently rebased on main branch, I will rebase it on
> dpdk-next-event in next version.
>