[RFC] Accelerator API to chain packet processing functions
Checks
Commit Message
Introduction
============
This RFC introduces a new DPDK library, rte_accelerator.
The main aim of this library is to provide a flexible and extensible way of combining one or more packet-processing functions into a single operation, thereby allowing these to be performed in parallel in optimized software libraries or in a hardware accelerator. These functions can include cryptography, compression and CRC/checksum calculation, while others can potentially be added in the future. Performing these functions in parallel as a single operation can enable a significant performance improvement.
Background
==========
There are a number of byte-wise operations which are present and common across many access network data-plane pipelines, such as Cipher, Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some prototyping has been done at Intel in relation to the 01.org access-network-dataplanes project to prove that a significant performance improvement is possible when such byte-wise operations are combined into a single pass of packet data processing. This performance boost has been prototyped for both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
The prototypes used some protocol-specific modifications to the DPDK cryptodev library. In order to make this performance improvement consumable by network access equipment vendors, a more extensible and correct solution is required that can be upstreamed into DPDK.
Hence, the introduction of rte_accelerator.
Use Cases
=========
The primary use cases for this new library have already been mentioned. These are:
- DOCSIS MAC: Crypto-CRC
- Order:
- Downstream: CRC, Encrypt
- Upstream: Decrypt, CRC
- Specifications:
- Crypto: 128-bit AES-CFB encryption variant for DOCSIS as described in section 11.1 of DOCSIS 3.1 Security Specification (https://apps.cablelabs.com/specification/CM-SP-SECv3.1)
- CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]
- XGS-PON MAC: Crypto-CRC-BIP
- Order:
- Downstream: CRC, Encrypt, BIP
- Upstream: BIP, Decrypt, CRC
- Specifications:
- Crypto: AES-128 [NIST FIPS-197] cipher, used in counter mode (AES-CTR), as described in [NIST SP800-38A].
- CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]
- BIP: 4-byte bit-interleaved even parity (BIP) field computed over the entire FS frame, refer to ITU-T G.989.3, sections 8.1.1.5 and 8.1.2.3 (https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-G.989.3-201510-I!!PDF-E)
Note that support for both these chained operations is already available in the Intel IPSec Multi-Buffer library.
However, it is not limited to these. The following are some of the other possible use-cases, which rte_accelerator will allow for:
- Storage:
- Compression followed by Encryption
- IPSec over UDP:
- UDP Checksum calculation followed by Encryption
While DPDK's rte_cryptodev and rte_compressdev allow many cryptographic and compression algorithms to be chained together in one operation, there is no way to chain these with any error detection or checksum algorithms. And there is no way to chain crypto and compression algorithms together. rte_accelerator will allow these chains to be created, and also allow any future type of operation to be easily added.
Architecture
============
The following diagram shows where rte_accelerator fits in an overall application architecture.
As can be seen from the diagram, the rte_accelerator API will depend on existing DPDK device libraries (i.e. rte_cryptodev and rte_compressdev) for existing features like crypto and compression. However, any new services/functions not covered in the existing libraries, such as CRC and BIP, will be provided directly by the rte_accelerator API:
+-------------------------------------------------------------------------------------+
| |
| Application |
| (e.g. vCMTS (DOCSIS), vOLT (XGS-PON), etc.) |
| |
+-------------------------------------------------------------------------------------+
| | |
+-----------------|----------------|---------------------------------|----------------+
| | | DPDK | |
| | | | |
| | | Session creation/deletion | |
| | ~ Operation creation/deletion | |
| Device | | Operation enqueue/dequeue | Device |
| initialization, | | | initialization,|
| configuration, ~ +--------------------------------------+ ~ configuration, |
| reset | | rte_accelerator | | reset |
| | | |-- err_detect xform/op | | |
| | | |-- Session creation/deletion fns | | |
| | | `-- Operation enqueue/dequeue fns | | |
| | +--------------------------------------+ | |
| | / / \ \ | |
| | / / \ \ | |
| | / | | \ | |
| +------------------------+ | | +------------------------+ |
| | rte_cryptodev | | | | rte_compressdev | |
| | |-- sym xforms/ops | | | | `-- comp xforms/ops | ... |
| | `-- asym xforms/ops | | | | | |
| +------------------------+ | | +------------------------+ |
| \ \ / | / |
| \ \______/_________ \ _____/ |
| \ / \ \ / |
| +------------+ +------------+ |
| | | | | |
| | AESNI-MB | | QAT | ... |
| | PMD | | PMD | |
| | | | | |
| +------------+ +------------+ |
| | | |
+--------------------------|--------------------|-------------------------------------+
| |
+------------+ +------------+
| | | |
| AESNI-MB | | QAT HW |
| SW LIB | | |
| | | |
+------------+ +------------+
Some key points are:
1) rte_accelerator will use xform and operation related definitions (i.e. structs, enums, defines) from existing the device libraries, such as rte_cryptodev and rte_compressdev, to create accelerator xform chains, sessions and operation chains
a) this allows as much re-use of existing definitions as possible
2) The application code will use the existing device library functions to initialize, configure, start, stop and reset the device
3) The application code will call rte_accelerator to create/delete accelerator sessions and enqueue/dequeue accelerator operations
a) Each device PMD will register functions with rte_accelerator to perform these tasks
4) rte_accelerator will provide definitions/structs/etc. for the error detection xform and operation
a) the error detection 'algorithms' initially supported will include CRC32 and BIP32, but others can be added in the future as needed
b) other xform and operation types which do not fit in any existing device libraries, such as rte_cryptodev or rte_compressdev, but need to be accelerated can also be added under rte_accelerator in the future
c) note, however, that error detection and other xform and operation types under rte_accelerator could be moved to an independent library in the future, if required
5) rte_accelerator will not provide a capability check feature. Instead, rte_accelerator will return an error code on session creation if the xform chain specified is not supported by the underlying device
6) rte_accelerator does not support session-less mode of operation
a) sessions MUST always be created and attached to the operations being enqueued
7) Initially, support will be added to existing AESNI-MB and QAT PMDs for rte_accelerator functionality. Support can be added to other PMDs in the future as needed
a) To add support for rte_accelerator functionality, the PMD must implement and register the required callback functions mentioned in 3a) above
Key API Definitions
===================
The full proposed rte_accelerator API is provided at the end of the RFC.
Here, some of the key structures and functions are described.
The following structure defines an accelerator xform
- Accelerator xforms are chained together through the next field
- The accelerator xform can contain a crypto (symmetric or asymmetric), compression or error detection xform. Others can be added in the future as needed
- The order of xforms in the chain specifies the order in which those operations should be performed on the packet data
/**
* Accelerator transform setup data
*
* This structure is used to specify the accelerator transforms required.
* Multiple transforms can be chained together to specify a chain of transforms
* such as symmetric crypto followed by error detection, or compression followed
* by symmetric crypto. Each transform structure holds a single transform, with
* the type field specifying which transform is contained within the union.
*/
struct rte_accelerator_xform {
struct rte_accelerator_xform *next;
/**<
* Next transform in the chain
* - the last transform in the chain MUST set this to NULL
*/
enum rte_accelerator_xform_type type;
/**< Transform type */
RTE_STD_C11
union {
struct rte_crypto_sym_xform crypto_sym;
/**< Symmetric crypto transform */
struct rte_crypto_asym_xform crypto_asym;
/**< Asymmetric crypto transform */
struct rte_comp_xform comp;
/**< Compression transform */
struct rte_err_detect_xform err_detect;
/**< Error detection transform */
};
};
The following structure defines an accelerator operation
- Accelerator operations are chained together through the next field
- The order of operations in this chain MUST match the order of xforms in the session's xform chain
- Any additional operation data (e.g. IV data) must follow immediately after this struct in memory
- The following fields MUST be set in the FIRST operation of the chain before enqueuing. These fields are ignored in the inner op structures and any subsequent rte_accelerator_op chain elements:
- sess
- m_src
- m_dst
- The following fields MUST be set in ALL operations in a chain before enqueuing:
- next
- mempool
- type
- After dequeuing, only the first operation in the chain will contain the overall status in the overall_status field and each chain element will contain it's individual status in the op_status field
/**
* Accelerator operation data
*
* This structure is used to specify the operations for a particular session.
* This includes specifying the source and, if required, destination mbufs and
* the lengths and offsets of the data within these mbufs on which the
* operations should be done. Multiple operations are chained together to
* specify the full set of operations to be performed
*/
struct rte_accelerator_op {
struct rte_accelerator_op *next;
/**<
* Next operation in the chain
* - the last operation in the chain MUST set this to NULL
*/
struct rte_accelerator_session *sess;
/**< Handle for the associated accelerator session */
struct rte_mempool *mempool;
/**< Mempool from which the operation is allocated */
struct rte_mbuf *m_src; /**< Source mbuf */
struct rte_mbuf *m_dst; /**< Destination mbuf */
enum rte_accelerator_op_status overall_status;
/**<
* Overall operation status
* - indicates if all the operations in the chain succeeded or if any
* one of them failed
*/
uint8_t op_status;
/**<
* Individual operation status
* - indicates the status of the individual operation in the chain
*/
enum rte_accelerator_op_type type;
/**< Operation type */
RTE_STD_C11
union {
struct rte_crypto_sym_op crypto_sym;
/**< Symmetric crypto operation */
struct rte_crypto_asym_op crypto_asym;
/**< Asymmetric crypto operation */
struct rte_comp_op comp;
/**< Compression operation */
struct rte_err_detect_op err_detect;
/**< Error detection operation */
};
};
The accelerator API is mainly a device-based API, with existing device PMDs extended to provide support for the API. When a device that supports the accelerator API initialises, it must setup it's accelerator context in it's private device data. The following structure defines an accelerator context - it contains a pointer to the device itself, a pointer to a structure containing the device's callback functions for session creation/deletion, operation enqueue/dequeue etc. and a count of the number of sessions attached to this context.
/**
* Accelerator context for a device
*
* Accelerator instance for each driver to register their accelerator
* operations. The application can get the accelerator context from the
* underlying device using the API functions provided by the device's main API
* library (e.g. rte_cryptodev_get_accelerator_ctx() for crypto devices,
* rte_compressdev_get_accelerator_ctx() for compress devices, etc.).
*/
struct rte_accelerator_ctx {
void *device;
/**< Pointer to the device */
const struct rte_accelerator_ops *ops;
/**< Pointer to accelerator ops for the device */
uint16_t sess_cnt;
/**< Number of sessions attached to this context */
};
As the comment on struct rte_accelerator_ctx mentions, rte_cryptodev, rte_compressdev and any other device library which supports the rte_accelerator API in the future must provide API functions to get the accelerator context for a device based on a device id. In the cases of rte_cryptodev and rte_compressdev, these functions will have the following prototypes:
void *
rte_cryptodev_get_accelerator_ctx(uint8_t dev_id);
void *
rte_compressdev_get_accelerator_ctx(uint8_t dev_id);
An application will call these functions to get the device accelerator context, and then use this context when it wants to interact with the device for accelerator functionality. The following are some of the main accelerator API functions provided. For full details of the parameters and return values, see the full API at the end of the RFC
- The following functions are used to create and destroy an accelerator session on a device:
struct rte_accelerator_session *
rte_accelerator_session_create(struct rte_accelerator_ctx *ctx,
struct rte_accelerator_xform *xform,
int socket_id);
int
rte_accelerator_session_destroy(struct rte_accelerator_ctx *ctx,
struct rte_accelerator_session *sess);
- The following functions are used to enqueue and dequeue accelerator operations to/from a device
- The qp_id parameter specifies the queue pair of the device on which to enqueue/dequeue
- It is the responsibility of the application to manage the queue pair assignments within the application (e.g. the same queue pair should not be used for accelerator enqueue/dequeue and cryptodev enqueue/dequeue)
uint16_t
rte_accelerator_ops_enqueue(struct rte_accelerator_ctx *ctx,
uint16_t qp_id,
struct rte_accelerator_op **ops,
uint16_t nb_ops);
uint16_t
rte_accelerator_ops_dequeue(struct rte_accelerator_ctx *ctx,
uint16_t qp_id,
struct rte_accelerator_op **ops,
uint16_t nb_ops);
Full API
========
The following is the full proposed API for the rte_accelerator library. There are some minor updates to the rte_cryptodev and rte_compressdev API which are also listed.
--------------------------------------------------------------
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263
This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.
Comments
On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
>
> Introduction
> ============
>
> This RFC introduces a new DPDK library, rte_accelerator.
>
> The main aim of this library is to provide a flexible and extensible way of combining one or more packet-processing functions into a single operation, thereby allowing these to be performed in parallel in optimized software libraries or in a hardware accelerator. These functions can include cryptography, compression and CRC/checksum calculation, while others can potentially be added in the future. Performing these functions in parallel as a single operation can enable a significant performance improvement.
>
>
> Background
> ==========
>
> There are a number of byte-wise operations which are present and common across many access network data-plane pipelines, such as Cipher, Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some prototyping has been done at Intel in relation to the 01.org access-network-dataplanes project to prove that a significant performance improvement is possible when such byte-wise operations are combined into a single pass of packet data processing. This performance boost has been prototyped for both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
Could you share the relative performance numbers to show the gain?
>
> The prototypes used some protocol-specific modifications to the DPDK cryptodev library. In order to make this performance improvement consumable by network access equipment vendors, a more extensible and correct solution is required that can be upstreamed into DPDK.
>
> Hence, the introduction of rte_accelerator.
>
>
> Use Cases
> =========
>
> The primary use cases for this new library have already been mentioned. These are:
>
> - DOCSIS MAC: Crypto-CRC
> - Order:
> - Downstream: CRC, Encrypt
> - Upstream: Decrypt, CRC
> - Specifications:
> - Crypto: 128-bit AES-CFB encryption variant for DOCSIS as described in section 11.1 of DOCSIS 3.1 Security Specification (https://apps.cablelabs.com/specification/CM-SP-SECv3.1)
> - CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]
>
> - XGS-PON MAC: Crypto-CRC-BIP
> - Order:
> - Downstream: CRC, Encrypt, BIP
I understand if the chain has two operations then it may possible to
have handcrafted SW code to do both operations in one pass.
I understand the spec is agnostic on a number of passes it does
require to enable the xfrom but To understand the SW/HW capability,
In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
or three passes in SW or one pass using HW?
> - Upstream: BIP, Decrypt, CRC
> - Specifications:
> - Crypto: AES-128 [NIST FIPS-197] cipher, used in counter mode (AES-CTR), as described in [NIST SP800-38A].
> - CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]
> - BIP: 4-byte bit-interleaved even parity (BIP) field computed over the entire FS frame, refer to ITU-T G.989.3, sections 8.1.1.5 and 8.1.2.3 (https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-G.989.3-201510-I!!PDF-E)
>
Hi Jerin,
Thanks for the comments. Please see replies below.
Kind Regards,
David
> On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
> >
> > Introduction
> > ============
> >
> > This RFC introduces a new DPDK library, rte_accelerator.
> >
> > The main aim of this library is to provide a flexible and extensible way of
> combining one or more packet-processing functions into a single operation,
> thereby allowing these to be performed in parallel in optimized software
> libraries or in a hardware accelerator. These functions can include
> cryptography, compression and CRC/checksum calculation, while others can
> potentially be added in the future. Performing these functions in parallel as a
> single operation can enable a significant performance improvement.
> >
> >
> > Background
> > ==========
> >
> > There are a number of byte-wise operations which are present and
> common across many access network data-plane pipelines, such as Cipher,
> Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
> prototyping has been done at Intel in relation to the 01.org access-network-
> dataplanes project to prove that a significant performance improvement is
> possible when such byte-wise operations are combined into a single pass of
> packet data processing. This performance boost has been prototyped for
> both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
>
>
> Could you share the relative performance numbers to show the gain?
[DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC, AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
- XGS-PON MAC (Crypto-CRC-BIP):
- 3 independent stages: 1429 cycles/buf (13.75Gbps)
- 1 single-pass stage: 896 cycles/buf (21.9Gbps)
37% cycle reduction
- DOCSIS MAC (Crypto-CRC):
- 2 independent stages: 1421 cycles/buf (13.84Gbps)
- 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
20% cycle reduction
Adding the accelerator API will allow vendors gain the benefits of these cycle savings
>
> >
> > The prototypes used some protocol-specific modifications to the DPDK
> cryptodev library. In order to make this performance improvement
> consumable by network access equipment vendors, a more extensible and
> correct solution is required that can be upstreamed into DPDK.
> >
> > Hence, the introduction of rte_accelerator.
> >
> >
> > Use Cases
> > =========
> >
> > The primary use cases for this new library have already been mentioned.
> These are:
> >
> > - DOCSIS MAC: Crypto-CRC
> > - Order:
> > - Downstream: CRC, Encrypt
> > - Upstream: Decrypt, CRC
> > - Specifications:
> > - Crypto: 128-bit AES-CFB encryption variant for DOCSIS as
> described in section 11.1 of DOCSIS 3.1 Security Specification
> (https://apps.cablelabs.com/specification/CM-SP-SECv3.1)
> > - CRC: Ethernet 32-bit CRC as defined in
> > Ethernet/[ISO/IEC 8802-3]
> >
> > - XGS-PON MAC: Crypto-CRC-BIP
> > - Order:
> > - Downstream: CRC, Encrypt, BIP
>
> I understand if the chain has two operations then it may possible to have
> handcrafted SW code to do both operations in one pass.
> I understand the spec is agnostic on a number of passes it does require to
> enable the xfrom but To understand the SW/HW capability, In the above
> case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
> or one pass using HW?
[DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
However, this could also be performed as a single pass in a HW accelerator
>
>
>
> > - Upstream: BIP, Decrypt, CRC
> > - Specifications:
> > - Crypto: AES-128 [NIST FIPS-197] cipher, used in counter mode
> (AES-CTR), as described in [NIST SP800-38A].
> > - CRC: Ethernet 32-bit CRC as defined in Ethernet/[ISO/IEC 8802-3]
> > - BIP: 4-byte bit-interleaved even parity (BIP) field
> > computed over the entire FS frame, refer to ITU-T G.989.3, sections 8.1.1.5
> and 8.1.2.3 (https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-
> G.989.3-201510-I!!PDF-E)
> >
--------------------------------------------------------------
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263
This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.
On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle@intel.com> wrote:
>
> Hi Jerin,
Hi David,
> Thanks for the comments. Please see replies below.
>
> Kind Regards,
> David
>
> > On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
> > >
> > > Introduction
> > > ============
> > >
> > > This RFC introduces a new DPDK library, rte_accelerator.
> > >
> > > The main aim of this library is to provide a flexible and extensible way of
> > combining one or more packet-processing functions into a single operation,
> > thereby allowing these to be performed in parallel in optimized software
> > libraries or in a hardware accelerator. These functions can include
> > cryptography, compression and CRC/checksum calculation, while others can
> > potentially be added in the future. Performing these functions in parallel as a
> > single operation can enable a significant performance improvement.
> > >
> > >
> > > Background
> > > ==========
> > >
> > > There are a number of byte-wise operations which are present and
> > common across many access network data-plane pipelines, such as Cipher,
> > Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
> > prototyping has been done at Intel in relation to the 01.org access-network-
> > dataplanes project to prove that a significant performance improvement is
> > possible when such byte-wise operations are combined into a single pass of
> > packet data processing. This performance boost has been prototyped for
> > both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
> >
> >
> > Could you share the relative performance numbers to show the gain?
>
> [DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
>
> We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
> 1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC, AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
> 2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
>
> We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
> - XGS-PON MAC (Crypto-CRC-BIP):
> - 3 independent stages: 1429 cycles/buf (13.75Gbps)
> - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
> 37% cycle reduction
>
> - DOCSIS MAC (Crypto-CRC):
> - 2 independent stages: 1421 cycles/buf (13.84Gbps)
> - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
> 20% cycle reduction
>
> Adding the accelerator API will allow vendors gain the benefits of these cycle savings
Numbers make sense. I have seen a similar performance improvement
doing in one pass with CPU instructions.
> > > - XGS-PON MAC: Crypto-CRC-BIP
> > > - Order:
> > > - Downstream: CRC, Encrypt, BIP
> >
> > I understand if the chain has two operations then it may possible to have
> > handcrafted SW code to do both operations in one pass.
> > I understand the spec is agnostic on a number of passes it does require to
> > enable the xfrom but To understand the SW/HW capability, In the above
> > case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
> > or one pass using HW?
>
> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
> However, this could also be performed as a single pass in a HW accelerator
As a specification, cascading the xform chains make sense.
Do we have any HW that does support chaining the xforms more than
"two" in one pass?
i.e real chaining function where two blocks of HWs work hand in hand
for chaining.
If none, it may be better to abstract as synonymous API(No dequeue, no
enqueue) for the CPU use case.
Hi Jerin, see reply below
> On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle@intel.com> wrote:
> >
> > Hi Jerin,
>
> Hi David,
>
> > Thanks for the comments. Please see replies below.
> >
> > Kind Regards,
> > David
> >
> > > On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com>
> wrote:
> > > >
> > > > Introduction
> > > > ============
> > > >
> > > > This RFC introduces a new DPDK library, rte_accelerator.
> > > >
> > > > The main aim of this library is to provide a flexible and
> > > > extensible way of
> > > combining one or more packet-processing functions into a single
> > > operation, thereby allowing these to be performed in parallel in
> > > optimized software libraries or in a hardware accelerator. These
> > > functions can include cryptography, compression and CRC/checksum
> > > calculation, while others can potentially be added in the future.
> > > Performing these functions in parallel as a single operation can enable a
> significant performance improvement.
> > > >
> > > >
> > > > Background
> > > > ==========
> > > >
> > > > There are a number of byte-wise operations which are present and
> > > common across many access network data-plane pipelines, such as
> > > Cipher, Authentication, CRC, Bit-Interleaved-Parity (BIP), other
> > > checksums etc. Some prototyping has been done at Intel in relation
> > > to the 01.org access-network- dataplanes project to prove that a
> > > significant performance improvement is possible when such byte-wise
> > > operations are combined into a single pass of packet data
> > > processing. This performance boost has been prototyped for both XGS-
> PON MAC data-plane and DOCSIS MAC data-plane pipelines.
> > >
> > >
> > > Could you share the relative performance numbers to show the gain?
> >
> > [DC] As mentioned above, the main performance gains are when the
> packet processing operations can be combined into a single pass of the
> packet.
> > Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS
> MAC) have been implemented in the AESNI MB library as single pass
> operation chains.
> >
> > We have modified the dpdk-crypto-perf-tester as part of our prototyping
> to test the cases where:
> > 1) each packet processing function is done as an independent stage
> > (e.g. calling rte_net_crc for CRC, AESNI MB through rte_cryptodev for
> > cipher, and a C function to calculate the BIP)
> > 2) all packet processing functions done as a single-pass operation in
> > AESNI MB through rte_cryptodev
> >
> > We see the following results for 1024 byte input frames from dpdk-crypto-
> perf-tester:
> > - XGS-PON MAC (Crypto-CRC-BIP):
> > - 3 independent stages: 1429 cycles/buf (13.75Gbps)
> > - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
> > 37% cycle reduction
> >
> > - DOCSIS MAC (Crypto-CRC):
> > - 2 independent stages: 1421 cycles/buf (13.84Gbps)
> > - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
> > 20% cycle reduction
> >
> > Adding the accelerator API will allow vendors gain the benefits of
> > these cycle savings
>
> Numbers make sense. I have seen a similar performance improvement doing
> in one pass with CPU instructions.
>
>
> > > > - XGS-PON MAC: Crypto-CRC-BIP
> > > > - Order:
> > > > - Downstream: CRC, Encrypt, BIP
> > >
> > > I understand if the chain has two operations then it may possible to
> > > have handcrafted SW code to do both operations in one pass.
> > > I understand the spec is agnostic on a number of passes it does
> > > require to enable the xfrom but To understand the SW/HW capability,
> > > In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
> > > or three passes in SW or one pass using HW?
> >
> > [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB
> library SW.
> > However, this could also be performed as a single pass in a HW
> > accelerator
>
> As a specification, cascading the xform chains make sense.
> Do we have any HW that does support chaining the xforms more than "two"
> in one pass?
> i.e real chaining function where two blocks of HWs work hand in hand for
> chaining.
> If none, it may be better to abstract as synonymous API(No dequeue, no
> enqueue) for the CPU use case.
[DC] I'm not aware of any HW that supports this at the moment, but that's not to say it couldn't in the future - if anyone else has any examples though, please feel free to share.
Regardless, I don't see why we would introduce a different API for SW devices and HW devices.
It would be up to each underlying PMD to decide if/how it supports a particular accelerator xform chain, but from an application's point of view, the accelerator API is always the same
On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com> wrote:
Hi David,
> >
> >
> > > > > - XGS-PON MAC: Crypto-CRC-BIP
> > > > > - Order:
> > > > > - Downstream: CRC, Encrypt, BIP
> > > >
> > > > I understand if the chain has two operations then it may possible to
> > > > have handcrafted SW code to do both operations in one pass.
> > > > I understand the spec is agnostic on a number of passes it does
> > > > require to enable the xfrom but To understand the SW/HW capability,
> > > > In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
> > > > or three passes in SW or one pass using HW?
> > >
> > > [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB
> > library SW.
> > > However, this could also be performed as a single pass in a HW
> > > accelerator
> >
> > As a specification, cascading the xform chains make sense.
> > Do we have any HW that does support chaining the xforms more than "two"
> > in one pass?
> > i.e real chaining function where two blocks of HWs work hand in hand for
> > chaining.
> > If none, it may be better to abstract as synonymous API(No dequeue, no
> > enqueue) for the CPU use case.
>
> [DC] I'm not aware of any HW that supports this at the moment, but that's not to say it couldn't in the future - if anyone else has any examples though, please feel free to share.
> Regardless, I don't see why we would introduce a different API for SW devices and HW devices.
There is a risk in drafting API that meant for HW without any HW
exists. Because there could be inefficiency on the metadata and fast
path API for both models.
For example, In the case of CPU based scheme, it will be pure overhead
emulate the "queue"(the enqueue and dequeue) for the sake of
abstraction where
CPU works better in the synchronous model and I have doubt that the
session-based scheme will work for HW or not as both difference HW
needs to work hand in hand(IOMMU aspects for two PCI device)
Having said that, I agree with the need for use case and API for CPU
case. Till we find a HW spec, we need to make the solution as CPU
specific and latter extend based on HW metadata required.
Accelerator API sounds like HW accelerator and there is no HW support
then it may not good. We can change the API that works for the use
cases that we know how it works efficiently.
> It would be up to each underlying PMD to decide if/how it supports a particular accelerator xform chain, but from an application's point of view, the accelerator API is always the same
>
>
Hi Jerin, see below
>
> On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> wrote:
>
> Hi David,
>
> > >
> > >
> > > > > > - XGS-PON MAC: Crypto-CRC-BIP
> > > > > > - Order:
> > > > > > - Downstream: CRC, Encrypt, BIP
> > > > >
> > > > > I understand if the chain has two operations then it may
> > > > > possible to have handcrafted SW code to do both operations in one
> pass.
> > > > > I understand the spec is agnostic on a number of passes it does
> > > > > require to enable the xfrom but To understand the SW/HW
> > > > > capability, In the above case, "CRC, Encrypt, BIP", It is done
> > > > > in one pass in SW or three passes in SW or one pass using HW?
> > > >
> > > > [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in
> > > > AESNI MB
> > > library SW.
> > > > However, this could also be performed as a single pass in a HW
> > > > accelerator
> > >
> > > As a specification, cascading the xform chains make sense.
> > > Do we have any HW that does support chaining the xforms more than
> "two"
> > > in one pass?
> > > i.e real chaining function where two blocks of HWs work hand in hand
> > > for chaining.
> > > If none, it may be better to abstract as synonymous API(No dequeue,
> > > no
> > > enqueue) for the CPU use case.
> >
> > [DC] I'm not aware of any HW that supports this at the moment, but that's
> not to say it couldn't in the future - if anyone else has any examples though,
> please feel free to share.
> > Regardless, I don't see why we would introduce a different API for SW
> devices and HW devices.
>
> There is a risk in drafting API that meant for HW without any HW exists.
> Because there could be inefficiency on the metadata and fast path API for
> both models.
> For example, In the case of CPU based scheme, it will be pure overhead
> emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> where CPU works better in the synchronous model and I have doubt that the
> session-based scheme will work for HW or not as both difference HW needs
> to work hand in hand(IOMMU aspects for two PCI device)
[DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.
From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability - the application doesn't need to know
the device type as such and therefore doesn't need to make different API calls.
The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
Finally, I'm unsure what you mean by this line:
"I have doubt that the session-based scheme will work for HW or not as both difference HW needs to work hand in hand(IOMMU aspects for two PCI device)"
What do mean by different HW working "hand in hand" and "two PCI device"?
The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain
>
> Having said that, I agree with the need for use case and API for CPU case. Till
> we find a HW spec, we need to make the solution as CPU specific and latter
> extend based on HW metadata required.
> Accelerator API sounds like HW accelerator and there is no HW support then
> it may not good. We can change the API that works for the use cases that we
> know how it works efficiently.
>
>
>
>
>
>
>
> > It would be up to each underlying PMD to decide if/how it supports a
> > particular accelerator xform chain, but from an application's point of
> > view, the accelerator API is always the same
> >
> >
On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
>
> Hi Jerin, see below
Hi David,
> >
> > On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> > wrote:
> >
> >
> > There is a risk in drafting API that meant for HW without any HW exists.
> > Because there could be inefficiency on the metadata and fast path API for
> > both models.
> > For example, In the case of CPU based scheme, it will be pure overhead
> > emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> > where CPU works better in the synchronous model and I have doubt that the
> > session-based scheme will work for HW or not as both difference HW needs
> > to work hand in hand(IOMMU aspects for two PCI device)
>
> [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
> In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
> The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.
In this release, we added CPU crypto support in cryptodev to support
the synchronous model to fix the overhead.
>
> From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability - the application doesn't need to know
> the device type as such and therefore doesn't need to make different API calls.
>
> The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
> As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
> is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
> pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
>
> We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
>
> Finally, I'm unsure what you mean by this line:
>
> "I have doubt that the session-based scheme will work for HW or not as both difference HW needs to work hand in hand(IOMMU aspects for two PCI device)"
>
> What do mean by different HW working "hand in hand" and "two PCI device"?
> The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain
I was thinking, it will be N PCIe devices that create the chain. Each
distinct PCI device does the fixed-function and chains them together.
I do understand the usage of QAT HW and CRC in SW.
So If I understand it correctly, in rte_security, we are combining
rte_ethdev and rte_cryptodev. With this spec, we are trying to
combine,
rte_cryptodev and rte_compressdev. So it looks good to me. My only
remaining concern is the name of this API, accelerator too generic
name. IMO, like rte_security, we may need to give more meaningful name
for the use case where crytodev and compressdev can work together.
On Fri, 7 Feb 2020 19:48:17 +0530
Jerin Jacob <jerinjacobk@gmail.com> wrote:
> On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
> >
> > Hi Jerin, see below
>
> Hi David,
>
> > >
> > > On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> > > wrote:
> > >
>
> > >
> > > There is a risk in drafting API that meant for HW without any HW exists.
> > > Because there could be inefficiency on the metadata and fast path API for
> > > both models.
> > > For example, In the case of CPU based scheme, it will be pure overhead
> > > emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> > > where CPU works better in the synchronous model and I have doubt that the
> > > session-based scheme will work for HW or not as both difference HW needs
> > > to work hand in hand(IOMMU aspects for two PCI device)
> >
> > [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
> > In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
> > The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.
>
> In this release, we added CPU crypto support in cryptodev to support
> the synchronous model to fix the overhead.
>
> >
> > From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability - the application doesn't need to know
> > the device type as such and therefore doesn't need to make different API calls.
> >
> > The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
> > As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
> > is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
> > pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
> >
> > We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
> >
> > Finally, I'm unsure what you mean by this line:
> >
> > "I have doubt that the session-based scheme will work for HW or not as both difference HW needs to work hand in hand(IOMMU aspects for two PCI device)"
> >
> > What do mean by different HW working "hand in hand" and "two PCI device"?
> > The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain
>
> I was thinking, it will be N PCIe devices that create the chain. Each
> distinct PCI device does the fixed-function and chains them together.
>
> I do understand the usage of QAT HW and CRC in SW.
> So If I understand it correctly, in rte_security, we are combining
> rte_ethdev and rte_cryptodev. With this spec, we are trying to
> combine,
> rte_cryptodev and rte_compressdev. So it looks good to me. My only
> remaining concern is the name of this API, accelerator too generic
> name. IMO, like rte_security, we may need to give more meaningful name
> for the use case where crytodev and compressdev can work together.
Having an API that could be used by parallel hardware does make sense,
but the DPDK already has multiple packet processing infrastructure pieces.
I would rather the DPDK converge on one widely used, robust and tested packet
method. Rather than the current "choose your poison or roll your own" which is
what we have now. The proposed graph seems to be the best so far.
On Sat, Feb 8, 2020 at 2:04 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Fri, 7 Feb 2020 19:48:17 +0530
> Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> > On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
> > >
> > > Hi Jerin, see below
> >
> > Hi David,
> >
> > > >
> > > > On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> > > > wrote:
> > > >
> >
> > > >
> > > > There is a risk in drafting API that meant for HW without any HW exists.
> > > > Because there could be inefficiency on the metadata and fast path API for
> > > > both models.
> > > > For example, In the case of CPU based scheme, it will be pure overhead
> > > > emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> > > > where CPU works better in the synchronous model and I have doubt that the
> > > > session-based scheme will work for HW or not as both difference HW needs
> > > > to work hand in hand(IOMMU aspects for two PCI device)
> > >
> > > [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
> > > In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
> > > The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.
> >
> > In this release, we added CPU crypto support in cryptodev to support
> > the synchronous model to fix the overhead.
> >
> > >
> > > From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability - the application doesn't need to know
> > > the device type as such and therefore doesn't need to make different API calls.
> > >
> > > The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
> > > As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
> > > is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
> > > pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
> > >
> > > We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
> > >
> > > Finally, I'm unsure what you mean by this line:
> > >
> > > "I have doubt that the session-based scheme will work for HW or not as both difference HW needs to work hand in hand(IOMMU aspects for two PCI device)"
> > >
> > > What do mean by different HW working "hand in hand" and "two PCI device"?
> > > The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain
> >
> > I was thinking, it will be N PCIe devices that create the chain. Each
> > distinct PCI device does the fixed-function and chains them together.
> >
> > I do understand the usage of QAT HW and CRC in SW.
> > So If I understand it correctly, in rte_security, we are combining
> > rte_ethdev and rte_cryptodev. With this spec, we are trying to
> > combine,
> > rte_cryptodev and rte_compressdev. So it looks good to me. My only
> > remaining concern is the name of this API, accelerator too generic
> > name. IMO, like rte_security, we may need to give more meaningful name
> > for the use case where crytodev and compressdev can work together.
>
> Having an API that could be used by parallel hardware does make sense,
> but the DPDK already has multiple packet processing infrastructure pieces.
>
> I would rather the DPDK converge on one widely used, robust and tested packet
> method. Rather than the current "choose your poison or roll your own" which is
> what we have now. The proposed graph seems to be the best so far.
I agree. Even I thought of saying graph can do this, as, it has higher
abstraction and runtime chaining support, but then I thought it will
be self markering.
David could you check https://www.mail-archive.com/dev@dpdk.org/msg156318.html
If this one only focusing crypto dev + compressdev, What if we have
ethdev + compressdev + security device in the future.
graph has higher abstraction so it can accommodate ANY chaining
requirements. i.e AESNI-MB + QAT will go as a separate node
On 06/02/2020 10:54 AM, Jerin Jacob wrote:
> On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle@intel.com> wrote:
>>
>> Hi Jerin,
>
> Hi David,
>
>> Thanks for the comments. Please see replies below.
>>
>> Kind Regards,
>> David
>>
>>> On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
>>>>
>>>> Introduction
>>>> ============
>>>>
>>>> This RFC introduces a new DPDK library, rte_accelerator.
>>>>
>>>> The main aim of this library is to provide a flexible and extensible way of
>>> combining one or more packet-processing functions into a single operation,
>>> thereby allowing these to be performed in parallel in optimized software
>>> libraries or in a hardware accelerator. These functions can include
>>> cryptography, compression and CRC/checksum calculation, while others can
>>> potentially be added in the future. Performing these functions in parallel as a
>>> single operation can enable a significant performance improvement.
>>>>
>>>>
>>>> Background
>>>> ==========
>>>>
>>>> There are a number of byte-wise operations which are present and
>>> common across many access network data-plane pipelines, such as Cipher,
>>> Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
>>> prototyping has been done at Intel in relation to the 01.org access-network-
>>> dataplanes project to prove that a significant performance improvement is
>>> possible when such byte-wise operations are combined into a single pass of
>>> packet data processing. This performance boost has been prototyped for
>>> both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
>>>
>>>
>>> Could you share the relative performance numbers to show the gain?
>>
>> [DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
>> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
>>
>> We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
>> 1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC, AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
>> 2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
>>
>> We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
>> - XGS-PON MAC (Crypto-CRC-BIP):
>> - 3 independent stages: 1429 cycles/buf (13.75Gbps)
>> - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
>> 37% cycle reduction
>>
>> - DOCSIS MAC (Crypto-CRC):
>> - 2 independent stages: 1421 cycles/buf (13.84Gbps)
>> - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
>> 20% cycle reduction
>>
>> Adding the accelerator API will allow vendors gain the benefits of these cycle savings
>
> Numbers make sense. I have seen a similar performance improvement
> doing in one pass with CPU instructions.
>
>
>>>> - XGS-PON MAC: Crypto-CRC-BIP
>>>> - Order:
>>>> - Downstream: CRC, Encrypt, BIP
>>>
>>> I understand if the chain has two operations then it may possible to have
>>> handcrafted SW code to do both operations in one pass.
>>> I understand the spec is agnostic on a number of passes it does require to
>>> enable the xfrom but To understand the SW/HW capability, In the above
>>> case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
>>> or one pass using HW?
>>
>> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
>> However, this could also be performed as a single pass in a HW accelerator
>
> As a specification, cascading the xform chains make sense.
> Do we have any HW that does support chaining the xforms more than
> "two" in one pass?
> i.e real chaining function where two blocks of HWs work hand in hand
> for chaining.
> If none, it may be better to abstract as synonymous API(No dequeue, no
> enqueue) for the CPU use case.
>
Where you thinking along the lines of a synchronous API option like that
just introduced to crytodev? i.e something like
uint16_t rte_accelerator_process(struct rte_accelerator_ctx *ctx,
struct rte_accelerator_op ops[],
uint16_t nb_ops);
On 06/02/2020 5:13 PM, Jerin Jacob wrote:
> On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com> wrote:
>
> Hi David,
>
>>>
>>>
>>>>>> - XGS-PON MAC: Crypto-CRC-BIP
>>>>>> - Order:
>>>>>> - Downstream: CRC, Encrypt, BIP
>>>>>
>>>>> I understand if the chain has two operations then it may possible to
>>>>> have handcrafted SW code to do both operations in one pass.
>>>>> I understand the spec is agnostic on a number of passes it does
>>>>> require to enable the xfrom but To understand the SW/HW capability,
>>>>> In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
>>>>> or three passes in SW or one pass using HW?
>>>>
>>>> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB
>>> library SW.
>>>> However, this could also be performed as a single pass in a HW
>>>> accelerator
>>>
>>> As a specification, cascading the xform chains make sense.
>>> Do we have any HW that does support chaining the xforms more than "two"
>>> in one pass?
>>> i.e real chaining function where two blocks of HWs work hand in hand for
>>> chaining.
>>> If none, it may be better to abstract as synonymous API(No dequeue, no
>>> enqueue) for the CPU use case.
>>
>> [DC] I'm not aware of any HW that supports this at the moment, but that's not to say it couldn't in the future - if anyone else has any examples though, please feel free to share.
>> Regardless, I don't see why we would introduce a different API for SW devices and HW devices.
>
> There is a risk in drafting API that meant for HW without any HW
> exists. Because there could be inefficiency on the metadata and fast
> path API for both models.
> For example, In the case of CPU based scheme, it will be pure overhead
> emulate the "queue"(the enqueue and dequeue) for the sake of
> abstraction where
> CPU works better in the synchronous model and I have doubt that the
> session-based scheme will work for HW or not as both difference HW
> needs to work hand in hand(IOMMU aspects for two PCI device)
We do have some proto-types in hardware which can do operation chaining
but in the case we have looked at, it is a single accelerator device
with multi-function which means the orchestration (order, passing of
data etc) of the chained operations is handled within the device itself,
meaning that we didn't see issues with shared session data or handling
moving data along discrete independent stage of a hardware pipeline
wasn't an issue.
Although if you wanted to offer this type of chained offload, I think we
would need the driver to handle this for the user, rather than the
application needing to understand how the hardware pipeline is interacting.
>
> Having said that, I agree with the need for use case and API for CPU
> case. Till we find a HW spec, we need to make the solution as CPU
> specific and latter extend based on HW metadata required.
> Accelerator API sounds like HW accelerator and there is no HW support
> then it may not good. We can change the API that works for the use
> cases that we know how it works efficiently.
>
>
>
>
>
>
>
>> It would be up to each underlying PMD to decide if/how it supports a particular accelerator xform chain, but from an application's point of view, the accelerator API is always the same
>>
>>
On 07/02/2020 2:18 PM, Jerin Jacob wrote:
> On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
>>
>> Hi Jerin, see below
>
> Hi David,
>
>>>
>>> On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
>>> wrote:
>>>
>
>>>
>>> There is a risk in drafting API that meant for HW without any HW exists.
>>> Because there could be inefficiency on the metadata and fast path API for
>>> both models.
>>> For example, In the case of CPU based scheme, it will be pure overhead
>>> emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
>>> where CPU works better in the synchronous model and I have doubt that the
>>> session-based scheme will work for HW or not as both difference HW needs
>>> to work hand in hand(IOMMU aspects for two PCI device)
>>
>> [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
>> In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
>> The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.
>
> In this release, we added CPU crypto support in cryptodev to support
> the synchronous model to fix the overhead.
>
>>
>> From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability - the application doesn't need to know
>> the device type as such and therefore doesn't need to make different API calls.
>>
>> The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
>> As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
>> is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
>> pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
>>
>> We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
>>
>> Finally, I'm unsure what you mean by this line:
>>
>> "I have doubt that the session-based scheme will work for HW or not as both difference HW needs to work hand in hand(IOMMU aspects for two PCI device)"
>>
>> What do mean by different HW working "hand in hand" and "two PCI device"?
>> The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain
>
> I was thinking, it will be N PCIe devices that create the chain. Each
> distinct PCI device does the fixed-function and chains them together.
>
The case we were looking at is more focused on a single discrete
(multi-function) device (from the perspective of the host) providing a
number of transforms (operations) in a single pass rather than the case
of N discrete hardware devices (from the perspective of the host)
chained together to achieve the same transforms set.
> I do understand the usage of QAT HW and CRC in SW.
> So If I understand it correctly, in rte_security, we are combining
> rte_ethdev and rte_cryptodev. With this spec, we are trying to
> combine,
> rte_cryptodev and rte_compressdev. So it looks good to me. My only
> remaining concern is the name of this API, accelerator too generic
> name. IMO, like rte_security, we may need to give more meaningful name
> for the use case where crytodev and compressdev can work together.
>
On Thu, Feb 13, 2020 at 5:01 PM Doherty, Declan
<declan.doherty@intel.com> wrote:
>
> On 06/02/2020 10:54 AM, Jerin Jacob wrote:
> > On Thu, Feb 6, 2020 at 3:35 PM Coyle, David <david.coyle@intel.com> wrote:
> >>
> >> Hi Jerin,
> >
> > Hi David,
> >
> >> Thanks for the comments. Please see replies below.
> >>
> >> Kind Regards,
> >> David
> >>
> >>> On Tue, Feb 4, 2020 at 8:15 PM David Coyle <david.coyle@intel.com> wrote:
> >>>>
> >>>> Introduction
> >>>> ============
> >>>>
> >>>> This RFC introduces a new DPDK library, rte_accelerator.
> >>>>
> >>>> The main aim of this library is to provide a flexible and extensible way of
> >>> combining one or more packet-processing functions into a single operation,
> >>> thereby allowing these to be performed in parallel in optimized software
> >>> libraries or in a hardware accelerator. These functions can include
> >>> cryptography, compression and CRC/checksum calculation, while others can
> >>> potentially be added in the future. Performing these functions in parallel as a
> >>> single operation can enable a significant performance improvement.
> >>>>
> >>>>
> >>>> Background
> >>>> ==========
> >>>>
> >>>> There are a number of byte-wise operations which are present and
> >>> common across many access network data-plane pipelines, such as Cipher,
> >>> Authentication, CRC, Bit-Interleaved-Parity (BIP), other checksums etc. Some
> >>> prototyping has been done at Intel in relation to the 01.org access-network-
> >>> dataplanes project to prove that a significant performance improvement is
> >>> possible when such byte-wise operations are combined into a single pass of
> >>> packet data processing. This performance boost has been prototyped for
> >>> both XGS-PON MAC data-plane and DOCSIS MAC data-plane pipelines.
> >>>
> >>>
> >>> Could you share the relative performance numbers to show the gain?
> >>
> >> [DC] As mentioned above, the main performance gains are when the packet processing operations can be combined into a single pass of the packet.
> >> Both Crypto-CRC-BIP (for XGS-PON MAC) and Crypto-CRC (for DOCSIS MAC) have been implemented in the AESNI MB library as single pass operation chains.
> >>
> >> We have modified the dpdk-crypto-perf-tester as part of our prototyping to test the cases where:
> >> 1) each packet processing function is done as an independent stage (e.g. calling rte_net_crc for CRC, AESNI MB through rte_cryptodev for cipher, and a C function to calculate the BIP)
> >> 2) all packet processing functions done as a single-pass operation in AESNI MB through rte_cryptodev
> >>
> >> We see the following results for 1024 byte input frames from dpdk-crypto-perf-tester:
> >> - XGS-PON MAC (Crypto-CRC-BIP):
> >> - 3 independent stages: 1429 cycles/buf (13.75Gbps)
> >> - 1 single-pass stage: 896 cycles/buf (21.9Gbps)
> >> 37% cycle reduction
> >>
> >> - DOCSIS MAC (Crypto-CRC):
> >> - 2 independent stages: 1421 cycles/buf (13.84Gbps)
> >> - 1 single-pass stage: 1133 cycles/buf (17.34Gbps)
> >> 20% cycle reduction
> >>
> >> Adding the accelerator API will allow vendors gain the benefits of these cycle savings
> >
> > Numbers make sense. I have seen a similar performance improvement
> > doing in one pass with CPU instructions.
> >
> >
> >>>> - XGS-PON MAC: Crypto-CRC-BIP
> >>>> - Order:
> >>>> - Downstream: CRC, Encrypt, BIP
> >>>
> >>> I understand if the chain has two operations then it may possible to have
> >>> handcrafted SW code to do both operations in one pass.
> >>> I understand the spec is agnostic on a number of passes it does require to
> >>> enable the xfrom but To understand the SW/HW capability, In the above
> >>> case, "CRC, Encrypt, BIP", It is done in one pass in SW or three passes in SW
> >>> or one pass using HW?
> >>
> >> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB library SW.
> >> However, this could also be performed as a single pass in a HW accelerator
> >
> > As a specification, cascading the xform chains make sense.
> > Do we have any HW that does support chaining the xforms more than
> > "two" in one pass?
> > i.e real chaining function where two blocks of HWs work hand in hand
> > for chaining.
> > If none, it may be better to abstract as synonymous API(No dequeue, no
> > enqueue) for the CPU use case.
> >
>
> Where you thinking along the lines of a synchronous API option like that
> just introduced to crytodev? i.e something like
>
> uint16_t rte_accelerator_process(struct rte_accelerator_ctx *ctx,
> struct rte_accelerator_op ops[],
> uint16_t nb_ops);
Yes. May be with capability or preference to denote application for
the preferred usage model.
>
>
On Thu, Feb 13, 2020 at 5:20 PM Doherty, Declan
<declan.doherty@intel.com> wrote:
>
> On 07/02/2020 2:18 PM, Jerin Jacob wrote:
> > On Fri, Feb 7, 2020 at 6:08 PM Coyle, David <david.coyle@intel.com> wrote:
> >>
> >> Hi Jerin, see below
> >
> > Hi David,
> >
> >>>
> >>> On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com>
> >>> wrote:
> >>>
> >
> >>>
> >>> There is a risk in drafting API that meant for HW without any HW exists.
> >>> Because there could be inefficiency on the metadata and fast path API for
> >>> both models.
> >>> For example, In the case of CPU based scheme, it will be pure overhead
> >>> emulate the "queue"(the enqueue and dequeue) for the sake of abstraction
> >>> where CPU works better in the synchronous model and I have doubt that the
> >>> session-based scheme will work for HW or not as both difference HW needs
> >>> to work hand in hand(IOMMU aspects for two PCI device)
> >>
> >> [DC] I understand what you are saying about the overhead of emulating the "sw queue" but this same model is already used in many of the existing device PMDs.
> >> In the case of SW devices, such as AESNI-MB or NULL for crypto or zlib for compression, the enqueue/dequeue in the PMD is emulated through an rte_ring which is very efficient.
> >> The accelerator API will use the existing device PMDs so keeping the same model seems like a sensible approach.
> >
> > In this release, we added CPU crypto support in cryptodev to support
> > the synchronous model to fix the overhead.
> >
> >>
> >> From an application's point of view, this abstraction of the underlying device type is important for usability and maintainability - the application doesn't need to know
> >> the device type as such and therefore doesn't need to make different API calls.
> >>
> >> The enqueue/dequeue type API was also used with QAT in mind. While QAT HW doesn't support these xform chains at the moment, it could potentially do so in the future.
> >> As a side note, as part of the work of adding the accelerator API, the QAT PMD will be updated to support the DOCSIS Crypto-CRC accelerator xform chain, where the Crypto
> >> is done on QAT HW and the CRC will be done in SW, most likely through a call to the optimized rte_net_crc library. This will give a consistent API for the DOCSIS-MAC data-plane
> >> pipeline prototype we have developed, which uses both AESNI-MB and QAT for benchmarks.
> >>
> >> We will take your feedback on the enqueue/dequeue approach for SW devices into consideration though during development.
> >>
> >> Finally, I'm unsure what you mean by this line:
> >>
> >> "I have doubt that the session-based scheme will work for HW or not as both difference HW needs to work hand in hand(IOMMU aspects for two PCI device)"
> >>
> >> What do mean by different HW working "hand in hand" and "two PCI device"?
> >> The intention is that 1 HW device (or it's PMD) would have to support the accel xform chain
> >
> > I was thinking, it will be N PCIe devices that create the chain. Each
> > distinct PCI device does the fixed-function and chains them together.
> >
>
> The case we were looking at is more focused on a single discrete
> (multi-function) device (from the perspective of the host) providing a
> number of transforms (operations) in a single pass rather than the case
> of N discrete hardware devices (from the perspective of the host)
> chained together to achieve the same transforms set.
Yes. For multifunction PCI device, it will be easy as it has a single
DMA stream aka single IOMMU context.
DMA between Two discrete hardware devices may need additional semantics.
>
>
> > I do understand the usage of QAT HW and CRC in SW.
> > So If I understand it correctly, in rte_security, we are combining
> > rte_ethdev and rte_cryptodev. With this spec, we are trying to
> > combine,
> > rte_cryptodev and rte_compressdev. So it looks good to me. My only
> > remaining concern is the name of this API, accelerator too generic
> > name. IMO, like rte_security, we may need to give more meaningful name
> > for the use case where crytodev and compressdev can work together.
> >
>
On Thu, Feb 13, 2020 at 5:14 PM Doherty, Declan
<declan.doherty@intel.com> wrote:
>
> On 06/02/2020 5:13 PM, Jerin Jacob wrote:
> > On Thu, Feb 6, 2020 at 10:01 PM Coyle, David <david.coyle@intel.com> wrote:
> >
> > Hi David,
> >
> >>>
> >>>
> >>>>>> - XGS-PON MAC: Crypto-CRC-BIP
> >>>>>> - Order:
> >>>>>> - Downstream: CRC, Encrypt, BIP
> >>>>>
> >>>>> I understand if the chain has two operations then it may possible to
> >>>>> have handcrafted SW code to do both operations in one pass.
> >>>>> I understand the spec is agnostic on a number of passes it does
> >>>>> require to enable the xfrom but To understand the SW/HW capability,
> >>>>> In the above case, "CRC, Encrypt, BIP", It is done in one pass in SW
> >>>>> or three passes in SW or one pass using HW?
> >>>>
> >>>> [DC] The CRC, Encrypt, BIP is also currently done as 1 pass in AESNI MB
> >>> library SW.
> >>>> However, this could also be performed as a single pass in a HW
> >>>> accelerator
> >>>
> >>> As a specification, cascading the xform chains make sense.
> >>> Do we have any HW that does support chaining the xforms more than "two"
> >>> in one pass?
> >>> i.e real chaining function where two blocks of HWs work hand in hand for
> >>> chaining.
> >>> If none, it may be better to abstract as synonymous API(No dequeue, no
> >>> enqueue) for the CPU use case.
> >>
> >> [DC] I'm not aware of any HW that supports this at the moment, but that's not to say it couldn't in the future - if anyone else has any examples though, please feel free to share.
> >> Regardless, I don't see why we would introduce a different API for SW devices and HW devices.
> >
> > There is a risk in drafting API that meant for HW without any HW
> > exists. Because there could be inefficiency on the metadata and fast
> > path API for both models.
> > For example, In the case of CPU based scheme, it will be pure overhead
> > emulate the "queue"(the enqueue and dequeue) for the sake of
> > abstraction where
> > CPU works better in the synchronous model and I have doubt that the
> > session-based scheme will work for HW or not as both difference HW
> > needs to work hand in hand(IOMMU aspects for two PCI device)
>
> We do have some proto-types in hardware which can do operation chaining
> but in the case we have looked at, it is a single accelerator device
> with multi-function which means the orchestration (order, passing of
> data etc) of the chained operations is handled within the device itself,
> meaning that we didn't see issues with shared session data or handling
> moving data along discrete independent stage of a hardware pipeline
> wasn't an issue.
>
> Although if you wanted to offer this type of chained offload, I think we
> would need the driver to handle this for the user, rather than the
> application needing to understand how the hardware pipeline is interacting.
Yes. The application should not understand the specifics.
The only question how to make this generic so that any hardware/SW
pipeline can work.
Currently, we have rte_security, which works on ethdev and cryptodev.
This new spec
is going to work on rte_cryptodev and rte_compressdev. If so, we need another
pipeline which needs to work with rte_cryptodev, rte_compressdev and ethdev then
we need to invent a new library.
I agree with the need for the hardware/SW pipeline.
As Stephen suggested, Why not look for general abstraction for HW/SW
based pipeline.
Marvell had a similar problem in abstracting various HW/SW pipeline,
Here is a proposal
for a generic HW/SW pipeline.
http://mails.dpdk.org/archives/dev/2020-January/156765.html
If the focus only for a specific case, say "CRC + something else",
better to have API for that and
better to not call the accelerator for the packet processing pipeline
as it has a big scope.
Just my 2c.
>
> >
> > Having said that, I agree with the need for use case and API for CPU
> > case. Till we find a HW spec, we need to make the solution as CPU
> > specific and latter extend based on HW metadata required.
> > Accelerator API sounds like HW accelerator and there is no HW support
> > then it may not good. We can change the API that works for the use
> > cases that we know how it works efficiently.
> >
> >
> >
> >
> >
> >
> >
> >> It would be up to each underlying PMD to decide if/how it supports a particular accelerator xform chain, but from an application's point of view, the accelerator API is always the same
> >>
> >>
>
> >
> > Having an API that could be used by parallel hardware does make sense,
> > but the DPDK already has multiple packet processing infrastructure pieces.
> >
> > I would rather the DPDK converge on one widely used, robust and tested
> > packet method. Rather than the current "choose your poison or roll
> > your own" which is what we have now. The proposed graph seems to be
> the best so far.
>
> I agree. Even I thought of saying graph can do this, as, it has higher
> abstraction and runtime chaining support, but then I thought it will be self
> markering.
> David could you check https://www.mail-
> archive.com/dev@dpdk.org/msg156318.html
> If this one only focusing crypto dev + compressdev, What if we have ethdev
> + compressdev + security device in the future.
> graph has higher abstraction so it can accommodate ANY chaining
> requirements. i.e AESNI-MB + QAT will go as a separate node
[DC] We have looked at the graph node library and we don't feel that using graph is the correct solution for what we are trying to solve here.
We want to combine 2 or more packet processing functions on a packet into a single operation on a single device, be that an optimized software library such as AESNI MB or a hardware accelerator such as QAT
So yes, these 2 packet processing functions could be a node (or nodes) within a graph.
However they would still need to be combined together at some point to be processed on the device as a single operation.
Our new proposal is to use rte_rawdev to access the devices and we propose to add a "multi-function" interface which the application and rawdev PMDs will use to create the xform chains, sessions and op chains.
The full details on this new proposal have been sent to you in a separate post and we feel it addresses the concerns of the original rte_accelerator API
In the future, rawdev enqueue/dequeue calls using this multi-function interface could potentially be configured as a node within a packet processing graph
On Thu, Mar 5, 2020 at 10:32 PM Coyle, David <david.coyle@intel.com> wrote:
>
> > >
> > > Having an API that could be used by parallel hardware does make sense,
> > > but the DPDK already has multiple packet processing infrastructure pieces.
> > >
> > > I would rather the DPDK converge on one widely used, robust and tested
> > > packet method. Rather than the current "choose your poison or roll
> > > your own" which is what we have now. The proposed graph seems to be
> > the best so far.
> >
> > I agree. Even I thought of saying graph can do this, as, it has higher
> > abstraction and runtime chaining support, but then I thought it will be self
> > markering.
> > David could you check https://www.mail-
> > archive.com/dev@dpdk.org/msg156318.html
> > If this one only focusing crypto dev + compressdev, What if we have ethdev
> > + compressdev + security device in the future.
> > graph has higher abstraction so it can accommodate ANY chaining
> > requirements. i.e AESNI-MB + QAT will go as a separate node
>
> [DC] We have looked at the graph node library and we don't feel that using graph is the correct solution for what we are trying to solve here.
> We want to combine 2 or more packet processing functions on a packet into a single operation on a single device, be that an optimized software library such as AESNI MB or a hardware accelerator such as QAT
> So yes, these 2 packet processing functions could be a node (or nodes) within a graph.
> However they would still need to be combined together at some point to be processed on the device as a single operation.
It is possible with graph architecture where, when application gives
two nodes and they are connected then it will be possible to replace
to an optimized node at runtime.
Having said that, It is NOT good to dedicate the graph as THE MODEL
for pipeline. i,e user should need to get both options rather sticking
to one model.
In that way, it makes sense to not abstract as a graph.
>
> Our new proposal is to use rte_rawdev to access the devices and we propose to add a "multi-function" interface which the application and rawdev PMDs will use to create the xform chains, sessions and op chains.
> The full details on this new proposal have been sent to you in a separate post and we feel it addresses the concerns of the original rte_accelerator API
>
> In the future, rawdev enqueue/dequeue calls using this multi-function interface could potentially be configured as a node within a packet processing graph
Yes. Makes sense.
>
new file mode 100644
@@ -0,0 +1,336 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation.
+ */
+
+#ifndef _RTE_ACCELERATOR_H_
+#define _RTE_ACCELERATOR_H_
+
+/**
+ * @file rte_accelerator.h
+ *
+ * RTE Accelerator Common Definitions
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_compat.h>
+#include <rte_common.h>
+#include <rte_mbuf.h>
+#include <rte_memory.h>
+#include <rte_mempool.h>
+#include <rte_comp.h>
+#include <rte_crypto.h>
+
+#include "rte_err_detect.h"
+
+/**
+ * Accelerator transform types
+ */
+enum rte_accelerator_xform_type {
+ RTE_ACC_XFORM_TYPE_CRYPTO_SYM,
+ /**< Symmetric crypto transform type */
+ RTE_ACC_XFORM_TYPE_CRYPTO_ASYM,
+ /**< Asymmetric crypto transform type */
+ RTE_ACC_XFORM_TYPE_COMP,
+ /**< Compression transform type */
+ RTE_ACC_XFORM_TYPE_ERR_DETECT,
+ /**< Error detection transform type */
+};
+
+/**
+ * Accelerator transform setup data
+ *
+ * This structure is used to specify the accelerator transforms required.
+ * Multiple transforms can be chained together to specify a chain of transforms
+ * such as symmetric crypto followed by error detection, or compression followed
+ * by symmetric crypto. Each transform structure holds a single transform, with
+ * the type field specifying which transform is contained within the union.
+ */
+struct rte_accelerator_xform {
+ struct rte_accelerator_xform *next;
+ /**<
+ * Next transform in the chain
+ * - the last transform in the chain MUST set this to NULL
+ */
+ enum rte_accelerator_xform_type type;
+ /**< Transform type */
+
+ RTE_STD_C11
+ union {
+ struct rte_crypto_sym_xform crypto_sym;
+ /**< Symmetric crypto transform */
+ struct rte_crypto_asym_xform crypto_asym;
+ /**< Asymmetric crypto transform */
+ struct rte_comp_xform comp;
+ /**< Compression transform */
+ struct rte_err_detect_xform err_detect;
+ /**< Error detection transform */
+ };
+};
+
+/**
+ * Accelerator operation types
+ *
+ * Each value must be a power of 2 so that the operations can be combined into
+ * a bitmask (see rte_accelerator_op_pool_create())
+ */
+enum rte_accelerator_op_type {
+ RTE_ACCELERATOR_OP_TYPE_CRYPTO_SYM = (0x1 << 0),
+ /**< Symmetric crypto operation type */
+ RTE_ACCELERATOR_OP_TYPE_CRYPTO_ASYM = (0x1 << 1),
+ /**< Asymmetric crypto operation type */
+ RTE_ACCELERATOR_OP_TYPE_COMP = (0x1 << 2),
+ /**< Compression operation type */
+ RTE_ACCELERATOR_OP_TYPE_ERR_DETECT = (0x1 << 3),
+ /**< Error detection operation type */
+};
+
+/**
+ * Accelerator operation status
+ */
+enum rte_accelerator_op_status {
+ RTE_ACCELERATOR_OP_STATUS_NOT_PROCESSED = 0,
+ /**< Operation has not yet been processed by a device */
+ RTE_ERR_DETECT_OP_STATUS_SUCCESS,
+ /**< Operation completed successfully */
+ RTE_ERR_DETECT_OP_STATUS_FAILURE,
+ /**< Operation completed with failure */
+};
+
+/**
+ * Accelerator operation data
+ *
+ * This structure is used to specify the operations for a particular session.
+ * This includes specifying the source and, if required, destination mbufs and
+ * the lengths and offsets of the data within these mbufs on which the
+ * operations should be done. Multiple operations are chained together to
+ * specify the full set of operations to be performed
+ *
+ * @note The rte_accelerator_op chain MUST match the session's xform
+ * chain exactly
+ * @note The first rte_accelerator_op element in the chain is the parent
+ * operation. The following fields MUST be set in this first operation before
+ * enqueuing and are ignored in the inner operations and any subsequent
+ * rte_accelerator_op chain elements:
+ * - *sess*
+ * - *m_src*
+ * - *m_dst* (if required)
+ * @note If *sess* or *m_src* is not set in the first rte_accelerator_op, this
+ * operation is invalid and will cause an error when attempting to enqueue.
+ * @note The following fields MUST be set in ALL rte_accelerator_op chain
+ * elements:
+ * - *next*
+ * - *mempool*
+ * - *type*
+ * @note After the operation has been dequeued, only the FIRST (i.e. the parent)
+ * rte_accelerator_op in the chain will contain the *overall_status*. Each
+ * chain element will contain it's individual *op_status*, the value of which is
+ * relevant to operation type (e.g. an ::rte_crypto_op_status,
+ * ::rte_comp_op_status or ::rte_err_detect_op_status)
+ */
+struct rte_accelerator_op {
+ struct rte_accelerator_op *next;
+ /**<
+ * Next operation in the chain
+ * - the last operation in the chain MUST set this to NULL
+ */
+ struct rte_accelerator_session *sess;
+ /**< Handle for the associated accelerator session */
+
+ struct rte_mempool *mempool;
+ /**< Mempool from which the operation is allocated */
+
+ struct rte_mbuf *m_src; /**< Source mbuf */
+ struct rte_mbuf *m_dst; /**< Destination mbuf */
+
+ enum rte_accelerator_op_status overall_status;
+ /**<
+ * Overall operation status
+ * - indicates if all the operations in the chain succeeded or if any
+ * one of them failed
+ */
+
+ uint8_t op_status;
+ /**<
+ * Individual operation status
+ * - indicates the status of the individual operation in the chain
+ */
+
+ enum rte_accelerator_op_type type;
+ /**< Operation type */
+
+ RTE_STD_C11
+ union {
+ struct rte_crypto_sym_op crypto_sym;
+ /**< Symmetric crypto operation */
+ struct rte_crypto_asym_op crypto_asym;
+ /**< Asymmetric crypto operation */
+ struct rte_comp_op comp;
+ /**< Compression operation */
+ struct rte_err_detect_op err_detect;
+ /**< Error detection operation */
+ };
+};
+
+/**
+ * Accelerator context for a device
+ *
+ * Accelerator instance for each device driver to register their accelerator
+ * operations. The application can get the accelerator context from the
+ * underlying device using the API functions provided by the device's main API
+ * library (e.g. rte_cryptodev_get_accelerator_ctx() for crypto devices,
+ * rte_compressdev_get_accelerator_ctx() for compress devices, etc.).
+ */
+struct rte_accelerator_ctx;
+
+/**
+ * Accelerator session data
+ */
+struct rte_accelerator_session;
+
+/**
+ * Create accelerator session as specified by the transform chain
+ *
+ * @param ctx Accelerator device instance
+ * @param xform Pointer to the first element of the session transform
+ * chain
+ * @param socket_id Socket to allocate the session on
+ *
+ * @return
+ * - Pointer to session, if successful
+ * - NULL, on failure
+ */
+__rte_experimental
+struct rte_accelerator_session *
+rte_accelerator_session_create(struct rte_accelerator_ctx *ctx,
+ struct rte_accelerator_xform *xform,
+ int socket_id);
+
+/**
+ * Free accelerator session header and the session private data and return it
+ * to its original mempool
+ *
+ * @param ctx Accelerator device instance
+ * @param sess Accelerator session to be freed
+ *
+ * @return
+ * - 0, if successful
+ * - -EINVAL, if session is NULL
+ * - -EBUSY, if not all device private data has been freed
+ */
+__rte_experimental
+int
+rte_accelerator_session_destroy(struct rte_accelerator_ctx *ctx,
+ struct rte_accelerator_session *sess);
+
+/**
+ * Creates an accelerator operation pool
+ *
+ * @param name Pool name
+ * @param op_types Bitmask of operations which this pool must support. This
+ * bitmask allows this function determine the maximum size
+ * of operation which must be accommodated in the mempool
+ * elements. See ::rte_accelerator_op_type for possible
+ * bitmask values
+ * @param nb_elts Number of elements in the pool
+ * @param cache_size Number of elements to cache on lcore, see
+ * rte_mempool_create() for further details about cache
+ * size
+ * @param priv_size Size of private data to allocate with each operation
+ * @param socket_id Socket to allocate the mempool on
+ *
+ * @return
+ * - Pointer to mempool, if successful
+ * - NULL, on failure
+ */
+__rte_experimental
+struct rte_mempool *
+rte_accelerator_op_pool_create(const char *name,
+ uint32_t op_types,
+ unsigned nb_elts,
+ unsigned cache_size,
+ uint16_t priv_size,
+ int socket_id);
+
+/**
+ * Bulk allocate accelerator operations from a mempool
+ *
+ * @param mempool Accelerator operation mempool
+ * @param ops Array in which to place allocated accelerator operations
+ * @param nb_ops Number of accelerator operations to allocate
+ *
+ * @returns
+ * - *nb_ops*, if the number of operations requested were allocated
+ * - 0, if the requested number of ops are not available. None are allocated in
+ * this case
+ */
+__rte_experimental
+uint16_t
+rte_accelerator_op_bulk_alloc(struct rte_mempool *mempool,
+ struct rte_accelerator_op **ops,
+ uint16_t nb_ops);
+
+/**
+ * Free accelerator operation back to it's mempool
+ *
+ * @param op Accelerator operation
+ */
+__rte_experimental
+void
+rte_accelerator_op_free(struct rte_accelerator_op *op);
+
+/**
+ * Enqueue a burst of operations for processing on the specified queue pair
+ * of a device for processing
+ *
+ * @param ctx Accelerator device instance
+ * @param qp_id Index of the device's queue pair to which operations
+ * are to be enqueued for processing
+ * @param ops Array of *nb_ops* pointers to accelerator operations
+ * to be processed
+ * @param nb_ops Number of operations to process
+ *
+ * @return
+ * The number of operations actually enqueued on the device. A return value
+ * equal to *nb_ops* means that all operations have been enqueued. The return
+ * value can be less than *nb_ops* when the device's queue is full or if
+ * invalid parameters are specified in an rte_accelerator_op
+ */
+__rte_experimental
+uint16_t
+rte_accelerator_ops_enqueue(struct rte_accelerator_ctx *ctx,
+ uint16_t qp_id,
+ struct rte_accelerator_op **ops,
+ uint16_t nb_ops);
+
+/**
+ * Dequeue a burst of processed operations from a queue on the specified device.
+ * The dequeued operation are stored in rte_accelerator_op structures whose
+ * pointers are supplied in the *ops* array
+ *
+ * @param ctx Accelerator device instance
+ * @param qp_id Index of the device's queue pair from which processed
+ * operation should be retrieved
+ * @param ops Array of pointers to rte_accelerator_op structures
+ * that must be large enough to store *nb_ops* pointers in
+ * it
+ * @param nb_ops Maximum number of operations to dequeue
+ *
+ * @return
+ * The number of operations actually dequeued, which is the number of pointers
+ * to rte_accelerator_op structures effectively supplied in the *ops* array
+ */
+__rte_experimental
+uint16_t
+rte_accelerator_ops_dequeue(struct rte_accelerator_ctx *ctx,
+ uint16_t qp_id,
+ struct rte_accelerator_op **ops,
+ uint16_t nb_ops);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_ACCELERATOR_H_ */
new file mode 100644
@@ -0,0 +1,146 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation.
+ */
+
+#ifndef _RTE_ACCELERATOR_DRIVER_H_
+#define _RTE_ACCELERATOR_DRIVER_H_
+
+/**
+ * @file rte_accelerator_driver.h
+ *
+ * RTE Accelerator Driver Definitions
+ *
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include "rte_accelerator.h"
+
+/**
+ * Accelerator context for a device
+ *
+ * Accelerator instance for each driver to register their accelerator
+ * operations. The application can get the accelerator context from the
+ * underlying device using the API functions provided by the device's main API
+ * library (e.g. rte_cryptodev_get_accelerator_ctx() for crypto devices,
+ * rte_compressdev_get_accelerator_ctx() for compress devices, etc.).
+ */
+struct rte_accelerator_ctx {
+ void *device;
+ /**< Pointer to the device */
+ const struct rte_accelerator_ops *ops;
+ /**< Pointer to accelerator ops for the device */
+ uint16_t sess_cnt;
+ /**< Number of sessions attached to this context */
+};
+
+/**
+ * Accelerator session data
+ */
+struct rte_accelerator_session {
+ void *sess_private_data;
+ /**< Device's private session data */
+};
+
+/**
+ * Configure private accelerator session data on a device
+ *
+ * @param device Device pointer
+ * @param xform Pointer to the first element of the session transform
+ * chain
+ * @param sess Accelerator session structure
+ * @param socket_id Socket to allocate the session on
+ *
+ * @return
+ * - 0, if the private session structure have been created successfully
+ * - -EINVAL, if the input parameters are invalid
+ * - -ENOTSUP, if the device does not support the session configuration
+ * - -ENOMEM, if memory for the session could not be allocated
+ */
+typedef int
+(*accelerator_session_create_t)(void *device,
+ struct rte_accelerator_xform *xform,
+ struct rte_accelerator_session *sess,
+ int socket_id);
+
+/**
+ * Free private accelerator session data
+ *
+ * @param device Device pointer
+ * @param sess Accelerator session structure
+ */
+typedef void
+(*accelerator_session_destroy_t)(void *device,
+ struct rte_accelerator_session *sess);
+
+/**
+ * Get the size of private accelerator session data
+ *
+ * @param device Device pointer
+ *
+ * @return
+ * - Size of the private session structure for device, if successful
+ * - 0, if failure
+ */
+typedef unsigned int
+(*accelerator_session_private_size_get_t)(void *device);
+
+/**
+ * Enqueue operations on queue pair of a device for processing
+ *
+ * @param device Device pointer
+ * @param qp_id Index of the device's queue pair to which operations
+ * are to be enqueued for processing
+ * @param ops Array of *nb_ops* pointers to operations to be enqueued
+ * @param nb_ops Number of operations to enqueue
+ *
+ * @return
+ * The number of operations actually enqueued on the device
+ */
+typedef uint16_t
+(*accelerator_ops_enqueue_t)(void *device,
+ uint16_t qp_id,
+ struct rte_accelerator_op **ops,
+ uint16_t nb_ops);
+
+/**
+ * Dequeue processed operations from a queue pair of a device
+ *
+ * @param device Device pointer
+ * @param qp_id Index of the device's queue pair from which processed
+ * operations should be dequeued
+ * @param ops Array of pointers to rte_accelerator_op structures
+ * that must be large enough to store *nb_ops* pointers in
+ * it
+ * @param nb_ops Maximum number of operations to dequeue
+ *
+ * @return
+ * The number of operations actually dequeued from the device
+ */
+typedef uint16_t
+(*accelerator_ops_dequeue_t)(void *device,
+ uint16_t qp_id,
+ struct rte_accelerator_op **ops,
+ uint16_t nb_ops);
+
+/** Accelerator device operations function pointer table */
+struct rte_accelerator_ops {
+ accelerator_session_create_t session_create;
+ /**< Configure an accelerator device's private session data */
+ accelerator_session_destroy_t session_destroy;
+ /**< Free an accelerator device's private session data */
+ accelerator_session_private_size_get_t session_private_size_get;
+ /**< Get the size of an accelerator device's private session data */
+ accelerator_ops_enqueue_t ops_enqueue;
+ /**< Enqueue a burst of operations to a device */
+ accelerator_ops_dequeue_t ops_dequeue;
+ /**< Dequeue a burst of operations from a device */
+};
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_ACCELERATOR_DRIVER_H_ */
new file mode 100644
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation.
+ */
+
+#ifndef _RTE_ERR_DETECT_H_
+#define _RTE_ERR_DETECT_H_
+
+/**
+ * @file rte_err_detect.h
+ *
+ * RTE Error Detection Definitions
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdint.h>
+
+/** Error Detection Algorithms */
+enum rte_err_detect_algorithm {
+ RTE_ERR_DETECT_CRC32_ETH,
+ /**< CRC32 Ethernet */
+ RTE_ERR_DETECT_BIP32
+ /**< BIP32 */
+};
+
+/** Error Detection Operation Types */
+enum rte_err_detect_operation {
+ RTE_ERR_DETECT_OP_VERIFY,
+ /**< Verify error detection result */
+ RTE_ERR_DETECY_OP_GENERATE
+ /**< Generate error detection result */
+};
+
+/** Error Detection Status */
+enum rte_err_detect_op_status {
+ RTE_ERR_DETECT_OP_STATUS_NOT_PROCESSED,
+ /**< Operation has not yet been processed by a device */
+ RTE_ERR_DETECT_OP_STATUS_SUCCESS,
+ /**< Operation completed successfully */
+ RTE_ERR_DETECT_OP_STATUS_VERIFY_FAILED,
+ /**< Verification failed */
+ RTE_ERR_DETECT_OP_STATUS_ERROR
+ /**< Error handling operation */
+};
+
+/**
+ * Error Detection Transform Data
+ *
+ * This structure contains data relating to an error detection transform. The
+ * fields *op* and *algo* are common to all error detection transforms and
+ * MUST be set
+ */
+struct rte_err_detect_xform {
+ enum rte_err_detect_operation op;
+ /**< Error detection operation type */
+ enum rte_err_detect_algorithm algo;
+ /**< Error detection algorithm */
+};
+
+/** Error Detection Operation */
+struct rte_err_detect_op {
+ struct rte_mbuf *m_src; /**< Source mbuf */
+
+ enum rte_err_detect_op_status status; /**< Operation status */
+
+ struct {
+ uint16_t offset;
+ /**<
+ * Starting point for error detection processing, specified
+ * as the number of bytes from start of the packet in the
+ * source mbuf
+ */
+ uint16_t length;
+ /**<
+ * The length, in bytes, of the source mbuf on which the error
+ * detection operation will be computed
+ */
+ } data; /**< Data offset and length for error detection */
+
+ struct {
+ uint8_t *data;
+ /**<
+ * This points to the location where the error detection
+ * result should be written (in the case of generation) or
+ * where the purported result exists (in the case of
+ * verification)
+ *
+ * The caller must ensure the required length of physically
+ * contiguous memory is available at this address
+ *
+ * For a CRC, this may point into the mbuf packet data. For
+ * an operation such as a BIP, this may point to a memory
+ * location after the op
+ *
+ * For generation, the result will overwrite any data at this
+ * location
+ */
+ rte_iova_t phys_addr;
+ /**< Physical address of output data */
+ } output; /**< Output location */
+};
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_ERR_DETECT_H_ */
@@ -568,6 +568,21 @@ __rte_experimental
int
rte_compressdev_private_xform_free(uint8_t dev_id, void *private_xform);
+/**
+ * Get accelerator context for a device.
+ *
+ * @param dev_id
+ * Compress device identifier
+ *
+ * @return
+ * - Pointer to the device's accelerator context, if the device supports
+ * the accelerator API
+ * - NULL, otherwise
+ */
+__rte_experimental
+void *
+rte_compressdev_get_accelerator_ctx(uint8_t dev_id);
+
#ifdef __cplusplus
}
#endif
@@ -79,6 +79,9 @@ struct rte_compressdev {
struct rte_device *device;
/**< Backing device */
+ void *accelerator_ctx;
+ /**< Context for accelerator ops */
+
__extension__
uint8_t attached : 1;
/**< Flag indicating the device is attached */
@@ -838,6 +838,9 @@ struct rte_cryptodev {
void *security_ctx;
/**< Context for security ops */
+ void *accelerator_ctx;
+ /**< Context for accelerator ops */
+
__extension__
uint8_t attached : 1;
/**< Flag indicating the device is attached */
@@ -847,6 +850,20 @@ void *
rte_cryptodev_get_sec_ctx(uint8_t dev_id);
/**
+ * Get accelerator context for a device
+ *
+ * @param dev_id Device id.
+ *
+ * @return
+ * - Pointer to the device's accelerator context, if the device supports
+ * the accelerator API
+ * - NULL, otherwise
+ */
+__rte_experimental
+void *
+rte_cryptodev_get_accelerator_ctx(uint8_t dev_id);
+
+/**
*
* The data part, with no function pointers, associated with each device.
*