mbox series

[RFC,0/4] cpu-crypto API choices

Message ID 20191105184122.15172-1-konstantin.ananyev@intel.com (mailing list archive)
Headers
Series cpu-crypto API choices |

Message

Ananyev, Konstantin Nov. 5, 2019, 6:41 p.m. UTC
  Originally both SW and HW crypto PMDs use rte_crypot_op based API to
process the crypto workload asynchronously. This way provides uniformity
to both PMD types, but also introduce unnecessary performance penalty to
SW PMDs that have to "simulate" HW async behavior
(crypto-ops enqueue/dequeue, HW addresses computations,
storing/dereferencing user provided data (mbuf) for each crypto-op,
etc).

The aim is to introduce a new optional API for SW crypto-devices
to perform crypto processing in a synchronous manner.
As summarized by Akhil, we need a synchronous API to perform crypto
operations on raw data using SW PMDs, that provides:
 - no crypto-ops.
 - avoid using mbufs inside this API, use raw data buffers instead.
 - no separate enqueue-dequeue, only single process() API for data path.
 - input data buffers should be grouped by session,
   i.e. each process() call takes one session and group of input buffers
   that  belong to that session. 
 - All parameters that are constant accross session, should be stored
   inside the session itself and reused by all incoming data buffers.

While there seems no controversy about need of such functionality,
there seems to be no agreement on what would be the best API for that.
So I am requesting for TB input on that matter.

Series structure:
- patch #1 - intorduce basic data structures to be used by sync API
  (no controversy here, I hope ..)
  [RFC 1/4] cpu-crypto: Introduce basic data structures
- patch #2 - Intel initial approach for new API (via rte_security)
  [RFC 2/4] security: introduce cpu-crypto API
- patch #3 - approach that reuses existing rte_cryptodev API as much as
  possible
  [RFC 3/4] cryptodev: introduce cpu-crypto API
- patch #4 - approach via introducing new session data structure and API
  [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API

Patches 2,3,4 are mutually exclusive,
and we probably have to choose which one to go forward with.
I put some explanations in each of the patches, hopefully that will help
to  understand pros and cons of each one.

Akhil strongly supports #3, AFAIK mainly because it allows PMDs to
reuse existing API and minimize API level changes.  
My favorite is #4, #2 is less preferable but ok too. 
#3 seems problematic to me by the reasons I outlined in #4 patch
description.

Please provide your opinion.

Konstantin Ananyev (4):
  cpu-crypto: Introduce basic data structures
  security: introduce cpu-crypto API
  cryptodev: introduce cpu-crypto API
  cryptodev: introduce rte_crypto_cpu_sym_session API

 lib/librte_cryptodev/rte_crypto_sym.h     | 63 +++++++++++++++++++++--
 lib/librte_cryptodev/rte_cryptodev.c      | 14 +++++
 lib/librte_cryptodev/rte_cryptodev.h      | 24 +++++++++
 lib/librte_cryptodev/rte_cryptodev_pmd.h  | 22 ++++++++
 lib/librte_security/rte_security.c        | 11 ++++
 lib/librte_security/rte_security.h        | 28 +++++++++-
 lib/librte_security/rte_security_driver.h | 20 +++++++
 7 files changed, 177 insertions(+), 5 deletions(-)
  

Comments

Honnappa Nagarahalli Nov. 6, 2019, 4:54 a.m. UTC | #1
<snip>

> 
> Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> process the crypto workload asynchronously. This way provides uniformity to
> both PMD types, but also introduce unnecessary performance penalty to SW
> PMDs that have to "simulate" HW async behavior (crypto-ops
> enqueue/dequeue, HW addresses computations, storing/dereferencing user
> provided data (mbuf) for each crypto-op, etc).
> 
> The aim is to introduce a new optional API for SW crypto-devices to perform
> crypto processing in a synchronous manner.
> As summarized by Akhil, we need a synchronous API to perform crypto
> operations on raw data using SW PMDs, that provides:
>  - no crypto-ops.
>  - avoid using mbufs inside this API, use raw data buffers instead.
>  - no separate enqueue-dequeue, only single process() API for data path.
>  - input data buffers should be grouped by session,
>    i.e. each process() call takes one session and group of input buffers
>    that  belong to that session.
>  - All parameters that are constant accross session, should be stored
>    inside the session itself and reused by all incoming data buffers.
> 
> While there seems no controversy about need of such functionality, there
> seems to be no agreement on what would be the best API for that.
> So I am requesting for TB input on that matter.
> 
> Series structure:
> - patch #1 - intorduce basic data structures to be used by sync API
>   (no controversy here, I hope ..)
>   [RFC 1/4] cpu-crypto: Introduce basic data structures
> - patch #2 - Intel initial approach for new API (via rte_security)
>   [RFC 2/4] security: introduce cpu-crypto API
> - patch #3 - approach that reuses existing rte_cryptodev API as much as
>   possible
>   [RFC 3/4] cryptodev: introduce cpu-crypto API
> - patch #4 - approach via introducing new session data structure and API
>   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
> 
> Patches 2,3,4 are mutually exclusive,
> and we probably have to choose which one to go forward with.
> I put some explanations in each of the patches, hopefully that will help to
> understand pros and cons of each one.
> 
> Akhil strongly supports #3, AFAIK mainly because it allows PMDs to reuse
> existing API and minimize API level changes.
IMO, from application perspective, it should not matter who (CPU or an accelerator) does the crypto functionality. It just needs to know if the result will be returned synchronously or asynchronously.

> My favorite is #4, #2 is less preferable but ok too.
> #3 seems problematic to me by the reasons I outlined in #4 patch description.
> 
> Please provide your opinion.
> 
> Konstantin Ananyev (4):
>   cpu-crypto: Introduce basic data structures
>   security: introduce cpu-crypto API
>   cryptodev: introduce cpu-crypto API
>   cryptodev: introduce rte_crypto_cpu_sym_session API
> 
>  lib/librte_cryptodev/rte_crypto_sym.h     | 63 +++++++++++++++++++++--
>  lib/librte_cryptodev/rte_cryptodev.c      | 14 +++++
>  lib/librte_cryptodev/rte_cryptodev.h      | 24 +++++++++
>  lib/librte_cryptodev/rte_cryptodev_pmd.h  | 22 ++++++++
>  lib/librte_security/rte_security.c        | 11 ++++
>  lib/librte_security/rte_security.h        | 28 +++++++++-
>  lib/librte_security/rte_security_driver.h | 20 +++++++
>  7 files changed, 177 insertions(+), 5 deletions(-)
> 
> --
> 2.17.1
  
Thomas Monjalon Nov. 6, 2019, 9:35 a.m. UTC | #2
06/11/2019 05:54, Honnappa Nagarahalli:
> <snip>
> 
> > Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> > process the crypto workload asynchronously. This way provides uniformity to
> > both PMD types, but also introduce unnecessary performance penalty to SW
> > PMDs that have to "simulate" HW async behavior (crypto-ops
> > enqueue/dequeue, HW addresses computations, storing/dereferencing user
> > provided data (mbuf) for each crypto-op, etc).
> > 
> > The aim is to introduce a new optional API for SW crypto-devices to perform
> > crypto processing in a synchronous manner.
> > As summarized by Akhil, we need a synchronous API to perform crypto
> > operations on raw data using SW PMDs, that provides:
> >  - no crypto-ops.
> >  - avoid using mbufs inside this API, use raw data buffers instead.
> >  - no separate enqueue-dequeue, only single process() API for data path.
> >  - input data buffers should be grouped by session,
> >    i.e. each process() call takes one session and group of input buffers
> >    that  belong to that session.
> >  - All parameters that are constant accross session, should be stored
> >    inside the session itself and reused by all incoming data buffers.
> > 
> > While there seems no controversy about need of such functionality, there
> > seems to be no agreement on what would be the best API for that.
> > So I am requesting for TB input on that matter.
> > 
> > Series structure:
> > - patch #1 - intorduce basic data structures to be used by sync API
> >   (no controversy here, I hope ..)
> >   [RFC 1/4] cpu-crypto: Introduce basic data structures
> > - patch #2 - Intel initial approach for new API (via rte_security)
> >   [RFC 2/4] security: introduce cpu-crypto API
> > - patch #3 - approach that reuses existing rte_cryptodev API as much as
> >   possible
> >   [RFC 3/4] cryptodev: introduce cpu-crypto API
> > - patch #4 - approach via introducing new session data structure and API
> >   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
> > 
> > Patches 2,3,4 are mutually exclusive,
> > and we probably have to choose which one to go forward with.
> > I put some explanations in each of the patches, hopefully that will help to
> > understand pros and cons of each one.
> > 
> > Akhil strongly supports #3, AFAIK mainly because it allows PMDs to reuse
> > existing API and minimize API level changes.
> 
> IMO, from application perspective, it should not matter who (CPU or an accelerator) does the crypto functionality. It just needs to know if the result will be returned synchronously or asynchronously.

We already have asymmetric and symmetric APIs.
Here you are proposing a third method: symmetric without mbuf for CPU PMDs

> > My favorite is #4, #2 is less preferable but ok too.
> > #3 seems problematic to me by the reasons I outlined in #4 patch description.
> > 
> > Please provide your opinion.

It means the API is not PMD agnostic, right?
  
Thomas Monjalon Nov. 6, 2019, 9:48 a.m. UTC | #3
06/11/2019 10:35, Thomas Monjalon:
> 06/11/2019 05:54, Honnappa Nagarahalli:
> > <snip>
> > 
> > > Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> > > process the crypto workload asynchronously. This way provides uniformity to
> > > both PMD types, but also introduce unnecessary performance penalty to SW
> > > PMDs that have to "simulate" HW async behavior (crypto-ops
> > > enqueue/dequeue, HW addresses computations, storing/dereferencing user
> > > provided data (mbuf) for each crypto-op, etc).
> > > 
> > > The aim is to introduce a new optional API for SW crypto-devices to perform
> > > crypto processing in a synchronous manner.
> > > As summarized by Akhil, we need a synchronous API to perform crypto
> > > operations on raw data using SW PMDs, that provides:
> > >  - no crypto-ops.
> > >  - avoid using mbufs inside this API, use raw data buffers instead.
> > >  - no separate enqueue-dequeue, only single process() API for data path.
> > >  - input data buffers should be grouped by session,
> > >    i.e. each process() call takes one session and group of input buffers
> > >    that  belong to that session.
> > >  - All parameters that are constant accross session, should be stored
> > >    inside the session itself and reused by all incoming data buffers.
> > > 
> > > While there seems no controversy about need of such functionality, there
> > > seems to be no agreement on what would be the best API for that.
> > > So I am requesting for TB input on that matter.
> > > 
> > > Series structure:
> > > - patch #1 - intorduce basic data structures to be used by sync API
> > >   (no controversy here, I hope ..)
> > >   [RFC 1/4] cpu-crypto: Introduce basic data structures
> > > - patch #2 - Intel initial approach for new API (via rte_security)
> > >   [RFC 2/4] security: introduce cpu-crypto API
> > > - patch #3 - approach that reuses existing rte_cryptodev API as much as
> > >   possible
> > >   [RFC 3/4] cryptodev: introduce cpu-crypto API
> > > - patch #4 - approach via introducing new session data structure and API
> > >   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
> > > 
> > > Patches 2,3,4 are mutually exclusive,
> > > and we probably have to choose which one to go forward with.
> > > I put some explanations in each of the patches, hopefully that will help to
> > > understand pros and cons of each one.
> > > 
> > > Akhil strongly supports #3, AFAIK mainly because it allows PMDs to reuse
> > > existing API and minimize API level changes.
> > 
> > IMO, from application perspective, it should not matter who (CPU or an accelerator) does the crypto functionality. It just needs to know if the result will be returned synchronously or asynchronously.
> 
> We already have asymmetric and symmetric APIs.
> Here you are proposing a third method: symmetric without mbuf for CPU PMDs

Sorry, for this garbage, I am mixing synchronous/asynchronous and symmetric/asymmetric.

> > > My favorite is #4, #2 is less preferable but ok too.
> > > #3 seems problematic to me by the reasons I outlined in #4 patch description.
> > > 
> > > Please provide your opinion.
> 
> It means the API is not PMD agnostic, right?

So the question is to know if a synchronous API will be implemented only for CPU virtual PMDs?
  
Ananyev, Konstantin Nov. 6, 2019, 10:14 a.m. UTC | #4
> > > > Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> > > > process the crypto workload asynchronously. This way provides uniformity to
> > > > both PMD types, but also introduce unnecessary performance penalty to SW
> > > > PMDs that have to "simulate" HW async behavior (crypto-ops
> > > > enqueue/dequeue, HW addresses computations, storing/dereferencing user
> > > > provided data (mbuf) for each crypto-op, etc).
> > > >
> > > > The aim is to introduce a new optional API for SW crypto-devices to perform
> > > > crypto processing in a synchronous manner.
> > > > As summarized by Akhil, we need a synchronous API to perform crypto
> > > > operations on raw data using SW PMDs, that provides:
> > > >  - no crypto-ops.
> > > >  - avoid using mbufs inside this API, use raw data buffers instead.
> > > >  - no separate enqueue-dequeue, only single process() API for data path.
> > > >  - input data buffers should be grouped by session,
> > > >    i.e. each process() call takes one session and group of input buffers
> > > >    that  belong to that session.
> > > >  - All parameters that are constant accross session, should be stored
> > > >    inside the session itself and reused by all incoming data buffers.
> > > >
> > > > While there seems no controversy about need of such functionality, there
> > > > seems to be no agreement on what would be the best API for that.
> > > > So I am requesting for TB input on that matter.
> > > >
> > > > Series structure:
> > > > - patch #1 - intorduce basic data structures to be used by sync API
> > > >   (no controversy here, I hope ..)
> > > >   [RFC 1/4] cpu-crypto: Introduce basic data structures
> > > > - patch #2 - Intel initial approach for new API (via rte_security)
> > > >   [RFC 2/4] security: introduce cpu-crypto API
> > > > - patch #3 - approach that reuses existing rte_cryptodev API as much as
> > > >   possible
> > > >   [RFC 3/4] cryptodev: introduce cpu-crypto API
> > > > - patch #4 - approach via introducing new session data structure and API
> > > >   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
> > > >
> > > > Patches 2,3,4 are mutually exclusive,
> > > > and we probably have to choose which one to go forward with.
> > > > I put some explanations in each of the patches, hopefully that will help to
> > > > understand pros and cons of each one.
> > > >
> > > > Akhil strongly supports #3, AFAIK mainly because it allows PMDs to reuse
> > > > existing API and minimize API level changes.
> > >
> > > IMO, from application perspective, it should not matter who (CPU or an accelerator) does the crypto functionality. It just needs to know
> if the result will be returned synchronously or asynchronously.
> >
> > We already have asymmetric and symmetric APIs.
> > Here you are proposing a third method: symmetric without mbuf for CPU PMDs
> 
> Sorry, for this garbage, I am mixing synchronous/asynchronous and symmetric/asymmetric.
> 
> > > > My favorite is #4, #2 is less preferable but ok too.
> > > > #3 seems problematic to me by the reasons I outlined in #4 patch description.
> > > >
> > > > Please provide your opinion.
> >
> > It means the API is not PMD agnostic, right?

Probably not...
Because inside DPDK we don't have any other abstraction for SW crypto-libs
except vdev, we do need dev_id to get session initialization point.
After that I believe all operations can be session based.
 
> So the question is to know if a synchronous API will be implemented only for CPU virtual PMDs?

I don't expect lookaside devices to benefit from sync mode.
I think performance penalty would be too high.
Konstantin
  
Ananyev, Konstantin Nov. 6, 2019, 11:33 a.m. UTC | #5
> > > > > Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> > > > > process the crypto workload asynchronously. This way provides uniformity to
> > > > > both PMD types, but also introduce unnecessary performance penalty to SW
> > > > > PMDs that have to "simulate" HW async behavior (crypto-ops
> > > > > enqueue/dequeue, HW addresses computations, storing/dereferencing user
> > > > > provided data (mbuf) for each crypto-op, etc).
> > > > >
> > > > > The aim is to introduce a new optional API for SW crypto-devices to perform
> > > > > crypto processing in a synchronous manner.
> > > > > As summarized by Akhil, we need a synchronous API to perform crypto
> > > > > operations on raw data using SW PMDs, that provides:
> > > > >  - no crypto-ops.
> > > > >  - avoid using mbufs inside this API, use raw data buffers instead.
> > > > >  - no separate enqueue-dequeue, only single process() API for data path.
> > > > >  - input data buffers should be grouped by session,
> > > > >    i.e. each process() call takes one session and group of input buffers
> > > > >    that  belong to that session.
> > > > >  - All parameters that are constant accross session, should be stored
> > > > >    inside the session itself and reused by all incoming data buffers.
> > > > >
> > > > > While there seems no controversy about need of such functionality, there
> > > > > seems to be no agreement on what would be the best API for that.
> > > > > So I am requesting for TB input on that matter.
> > > > >
> > > > > Series structure:
> > > > > - patch #1 - intorduce basic data structures to be used by sync API
> > > > >   (no controversy here, I hope ..)
> > > > >   [RFC 1/4] cpu-crypto: Introduce basic data structures
> > > > > - patch #2 - Intel initial approach for new API (via rte_security)
> > > > >   [RFC 2/4] security: introduce cpu-crypto API
> > > > > - patch #3 - approach that reuses existing rte_cryptodev API as much as
> > > > >   possible
> > > > >   [RFC 3/4] cryptodev: introduce cpu-crypto API
> > > > > - patch #4 - approach via introducing new session data structure and API
> > > > >   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
> > > > >
> > > > > Patches 2,3,4 are mutually exclusive,
> > > > > and we probably have to choose which one to go forward with.
> > > > > I put some explanations in each of the patches, hopefully that will help to
> > > > > understand pros and cons of each one.
> > > > >
> > > > > Akhil strongly supports #3, AFAIK mainly because it allows PMDs to reuse
> > > > > existing API and minimize API level changes.
> > > >
> > > > IMO, from application perspective, it should not matter who (CPU or an accelerator) does the crypto functionality. It just needs to
> know
> > if the result will be returned synchronously or asynchronously.
> > >
> > > We already have asymmetric and symmetric APIs.
> > > Here you are proposing a third method: symmetric without mbuf for CPU PMDs
> >
> > Sorry, for this garbage, I am mixing synchronous/asynchronous and symmetric/asymmetric.
> >
> > > > > My favorite is #4, #2 is less preferable but ok too.
> > > > > #3 seems problematic to me by the reasons I outlined in #4 patch description.
> > > > >
> > > > > Please provide your opinion.
> > >
> > > It means the API is not PMD agnostic, right?
> 
> Probably not...
> Because inside DPDK we don't have any other abstraction for SW crypto-libs
> except vdev, we do need dev_id to get session initialization point.
> After that I believe all operations can be session based.
> 
> > So the question is to know if a synchronous API will be implemented only for CPU virtual PMDs?
> 
> I don't expect lookaside devices to benefit from sync mode.
> I think performance penalty would be too high.

After another thought, if some lookaside PMD would like to support such API -
I think it is still possible: dev_id (or just pointer to internal dev/queue structure)
can be stored inside the session itself.
Though I really doubt any lookaside PMD would be interested in such mode.
Konstantin
  
Thomas Monjalon Nov. 6, 2019, 12:18 p.m. UTC | #6
06/11/2019 12:33, Ananyev, Konstantin:
> 
> > > > > > Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> > > > > > process the crypto workload asynchronously. This way provides uniformity to
> > > > > > both PMD types, but also introduce unnecessary performance penalty to SW
> > > > > > PMDs that have to "simulate" HW async behavior (crypto-ops
> > > > > > enqueue/dequeue, HW addresses computations, storing/dereferencing user
> > > > > > provided data (mbuf) for each crypto-op, etc).
> > > > > >
> > > > > > The aim is to introduce a new optional API for SW crypto-devices to perform
> > > > > > crypto processing in a synchronous manner.
> > > > > > As summarized by Akhil, we need a synchronous API to perform crypto
> > > > > > operations on raw data using SW PMDs, that provides:
> > > > > >  - no crypto-ops.
> > > > > >  - avoid using mbufs inside this API, use raw data buffers instead.
> > > > > >  - no separate enqueue-dequeue, only single process() API for data path.
> > > > > >  - input data buffers should be grouped by session,
> > > > > >    i.e. each process() call takes one session and group of input buffers
> > > > > >    that  belong to that session.
> > > > > >  - All parameters that are constant accross session, should be stored
> > > > > >    inside the session itself and reused by all incoming data buffers.
> > > > > >
> > > > > > While there seems no controversy about need of such functionality, there
> > > > > > seems to be no agreement on what would be the best API for that.
> > > > > > So I am requesting for TB input on that matter.
> > > > > >
> > > > > > Series structure:
> > > > > > - patch #1 - intorduce basic data structures to be used by sync API
> > > > > >   (no controversy here, I hope ..)
> > > > > >   [RFC 1/4] cpu-crypto: Introduce basic data structures
> > > > > > - patch #2 - Intel initial approach for new API (via rte_security)
> > > > > >   [RFC 2/4] security: introduce cpu-crypto API
> > > > > > - patch #3 - approach that reuses existing rte_cryptodev API as much as
> > > > > >   possible
> > > > > >   [RFC 3/4] cryptodev: introduce cpu-crypto API
> > > > > > - patch #4 - approach via introducing new session data structure and API
> > > > > >   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
> > > > > >
> > > > > > Patches 2,3,4 are mutually exclusive,
> > > > > > and we probably have to choose which one to go forward with.
> > > > > > I put some explanations in each of the patches, hopefully that will help to
> > > > > > understand pros and cons of each one.
> > > > > >
> > > > > > Akhil strongly supports #3, AFAIK mainly because it allows PMDs to reuse
> > > > > > existing API and minimize API level changes.
> > > > >
> > > > > IMO, from application perspective, it should not matter who (CPU or an accelerator) does the crypto functionality. It just needs to
> > know
> > > if the result will be returned synchronously or asynchronously.
> > > >
> > > > We already have asymmetric and symmetric APIs.
> > > > Here you are proposing a third method: symmetric without mbuf for CPU PMDs
> > >
> > > Sorry, for this garbage, I am mixing synchronous/asynchronous and symmetric/asymmetric.
> > >
> > > > > > My favorite is #4, #2 is less preferable but ok too.
> > > > > > #3 seems problematic to me by the reasons I outlined in #4 patch description.
> > > > > >
> > > > > > Please provide your opinion.
> > > >
> > > > It means the API is not PMD agnostic, right?
> > 
> > Probably not...
> > Because inside DPDK we don't have any other abstraction for SW crypto-libs
> > except vdev, we do need dev_id to get session initialization point.
> > After that I believe all operations can be session based.
> > 
> > > So the question is to know if a synchronous API will be implemented only for CPU virtual PMDs?
> > 
> > I don't expect lookaside devices to benefit from sync mode.
> > I think performance penalty would be too high.
> 
> After another thought, if some lookaside PMD would like to support such API -
> I think it is still possible: dev_id (or just pointer to internal dev/queue structure)
> can be stored inside the session itself.
> Though I really doubt any lookaside PMD would be interested in such mode.

So what should be the logic in the application?
How the combo PMD/API is chosen?
How does it work with the crypto scheduler?
  
Hemant Agrawal Nov. 6, 2019, 12:22 p.m. UTC | #7
> 06/11/2019 12:33, Ananyev, Konstantin:
> >
> > > > > > > Originally both SW and HW crypto PMDs use rte_crypot_op
> > > > > > > based API to process the crypto workload asynchronously.
> > > > > > > This way provides uniformity to both PMD types, but also
> > > > > > > introduce unnecessary performance penalty to SW PMDs that
> > > > > > > have to "simulate" HW async behavior (crypto-ops
> > > > > > > enqueue/dequeue, HW addresses computations,
> storing/dereferencing user provided data (mbuf) for each crypto-op, etc).
> > > > > > >
> > > > > > > The aim is to introduce a new optional API for SW
> > > > > > > crypto-devices to perform crypto processing in a synchronous
> manner.
> > > > > > > As summarized by Akhil, we need a synchronous API to perform
> > > > > > > crypto operations on raw data using SW PMDs, that provides:
> > > > > > >  - no crypto-ops.
> > > > > > >  - avoid using mbufs inside this API, use raw data buffers instead.
> > > > > > >  - no separate enqueue-dequeue, only single process() API for data
> path.
> > > > > > >  - input data buffers should be grouped by session,
> > > > > > >    i.e. each process() call takes one session and group of input
> buffers
> > > > > > >    that  belong to that session.
> > > > > > >  - All parameters that are constant accross session, should be
> stored
> > > > > > >    inside the session itself and reused by all incoming data buffers.
> > > > > > >
> > > > > > > While there seems no controversy about need of such
> > > > > > > functionality, there seems to be no agreement on what would be
> the best API for that.
> > > > > > > So I am requesting for TB input on that matter.
> > > > > > >
> > > > > > > Series structure:
> > > > > > > - patch #1 - intorduce basic data structures to be used by sync API
> > > > > > >   (no controversy here, I hope ..)
> > > > > > >   [RFC 1/4] cpu-crypto: Introduce basic data structures
> > > > > > > - patch #2 - Intel initial approach for new API (via rte_security)
> > > > > > >   [RFC 2/4] security: introduce cpu-crypto API
> > > > > > > - patch #3 - approach that reuses existing rte_cryptodev API as
> much as
> > > > > > >   possible
> > > > > > >   [RFC 3/4] cryptodev: introduce cpu-crypto API
> > > > > > > - patch #4 - approach via introducing new session data structure
> and API
> > > > > > >   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session
> > > > > > > API
> > > > > > >
> > > > > > > Patches 2,3,4 are mutually exclusive, and we probably have
> > > > > > > to choose which one to go forward with.
> > > > > > > I put some explanations in each of the patches, hopefully
> > > > > > > that will help to understand pros and cons of each one.
> > > > > > >
> > > > > > > Akhil strongly supports #3, AFAIK mainly because it allows
> > > > > > > PMDs to reuse existing API and minimize API level changes.
> > > > > >
> > > > > > IMO, from application perspective, it should not matter who
> > > > > > (CPU or an accelerator) does the crypto functionality. It just
> > > > > > needs to
> > > know
> > > > if the result will be returned synchronously or asynchronously.
> > > > >
> > > > > We already have asymmetric and symmetric APIs.
> > > > > Here you are proposing a third method: symmetric without mbuf
> > > > > for CPU PMDs
> > > >
> > > > Sorry, for this garbage, I am mixing synchronous/asynchronous and
> symmetric/asymmetric.
> > > >
> > > > > > > My favorite is #4, #2 is less preferable but ok too.
> > > > > > > #3 seems problematic to me by the reasons I outlined in #4 patch
> description.
> > > > > > >
> > > > > > > Please provide your opinion.
> > > > >
> > > > > It means the API is not PMD agnostic, right?
> > >
> > > Probably not...
> > > Because inside DPDK we don't have any other abstraction for SW
> > > crypto-libs except vdev, we do need dev_id to get session initialization
> point.
> > > After that I believe all operations can be session based.
> > >
> > > > So the question is to know if a synchronous API will be implemented
> only for CPU virtual PMDs?
> > >
> > > I don't expect lookaside devices to benefit from sync mode.
> > > I think performance penalty would be too high.
> >
> > After another thought, if some lookaside PMD would like to support
> > such API - I think it is still possible: dev_id (or just pointer to
> > internal dev/queue structure) can be stored inside the session itself.
> > Though I really doubt any lookaside PMD would be interested in such
> mode.

[Hemant] Lookaside PMDs may be interested but may not be in synchronous nature, but for raw buffers processing.

e.g. I see a use-case to support crypto without forcing to use crypto_ops or mbufs i.e. use plain buffers.

So, I want to take advantage of similar APIs, just extend an option to show that it is async in process.
And then overload existing or add an API to get new such raw crypto process API.
  
Ananyev, Konstantin Nov. 6, 2019, 3:19 p.m. UTC | #8
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Wednesday, November 6, 2019 12:19 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: techboard@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org; Zhang, Roy Fan
> <roy.fan.zhang@intel.com>; Doherty, Declan <declan.doherty@intel.com>; Akhil.goyal@nxp.com; nd <nd@arm.com>
> Subject: Re: [dpdk-techboard] [RFC 0/4] cpu-crypto API choices
> 
> 06/11/2019 12:33, Ananyev, Konstantin:
> >
> > > > > > > Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> > > > > > > process the crypto workload asynchronously. This way provides uniformity to
> > > > > > > both PMD types, but also introduce unnecessary performance penalty to SW
> > > > > > > PMDs that have to "simulate" HW async behavior (crypto-ops
> > > > > > > enqueue/dequeue, HW addresses computations, storing/dereferencing user
> > > > > > > provided data (mbuf) for each crypto-op, etc).
> > > > > > >
> > > > > > > The aim is to introduce a new optional API for SW crypto-devices to perform
> > > > > > > crypto processing in a synchronous manner.
> > > > > > > As summarized by Akhil, we need a synchronous API to perform crypto
> > > > > > > operations on raw data using SW PMDs, that provides:
> > > > > > >  - no crypto-ops.
> > > > > > >  - avoid using mbufs inside this API, use raw data buffers instead.
> > > > > > >  - no separate enqueue-dequeue, only single process() API for data path.
> > > > > > >  - input data buffers should be grouped by session,
> > > > > > >    i.e. each process() call takes one session and group of input buffers
> > > > > > >    that  belong to that session.
> > > > > > >  - All parameters that are constant accross session, should be stored
> > > > > > >    inside the session itself and reused by all incoming data buffers.
> > > > > > >
> > > > > > > While there seems no controversy about need of such functionality, there
> > > > > > > seems to be no agreement on what would be the best API for that.
> > > > > > > So I am requesting for TB input on that matter.
> > > > > > >
> > > > > > > Series structure:
> > > > > > > - patch #1 - intorduce basic data structures to be used by sync API
> > > > > > >   (no controversy here, I hope ..)
> > > > > > >   [RFC 1/4] cpu-crypto: Introduce basic data structures
> > > > > > > - patch #2 - Intel initial approach for new API (via rte_security)
> > > > > > >   [RFC 2/4] security: introduce cpu-crypto API
> > > > > > > - patch #3 - approach that reuses existing rte_cryptodev API as much as
> > > > > > >   possible
> > > > > > >   [RFC 3/4] cryptodev: introduce cpu-crypto API
> > > > > > > - patch #4 - approach via introducing new session data structure and API
> > > > > > >   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
> > > > > > >
> > > > > > > Patches 2,3,4 are mutually exclusive,
> > > > > > > and we probably have to choose which one to go forward with.
> > > > > > > I put some explanations in each of the patches, hopefully that will help to
> > > > > > > understand pros and cons of each one.
> > > > > > >
> > > > > > > Akhil strongly supports #3, AFAIK mainly because it allows PMDs to reuse
> > > > > > > existing API and minimize API level changes.
> > > > > >
> > > > > > IMO, from application perspective, it should not matter who (CPU or an accelerator) does the crypto functionality. It just needs to
> > > know
> > > > if the result will be returned synchronously or asynchronously.
> > > > >
> > > > > We already have asymmetric and symmetric APIs.
> > > > > Here you are proposing a third method: symmetric without mbuf for CPU PMDs
> > > >
> > > > Sorry, for this garbage, I am mixing synchronous/asynchronous and symmetric/asymmetric.
> > > >
> > > > > > > My favorite is #4, #2 is less preferable but ok too.
> > > > > > > #3 seems problematic to me by the reasons I outlined in #4 patch description.
> > > > > > >
> > > > > > > Please provide your opinion.
> > > > >
> > > > > It means the API is not PMD agnostic, right?
> > >
> > > Probably not...
> > > Because inside DPDK we don't have any other abstraction for SW crypto-libs
> > > except vdev, we do need dev_id to get session initialization point.
> > > After that I believe all operations can be session based.
> > >
> > > > So the question is to know if a synchronous API will be implemented only for CPU virtual PMDs?
> > >
> > > I don't expect lookaside devices to benefit from sync mode.
> > > I think performance penalty would be too high.
> >
> > After another thought, if some lookaside PMD would like to support such API -
> > I think it is still possible: dev_id (or just pointer to internal dev/queue structure)
> > can be stored inside the session itself.
> > Though I really doubt any lookaside PMD would be interested in such mode.
> 
> So what should be the logic in the application?
> How the combo PMD/API is chosen?

Up to the user.
At session creation time user has to choose what session he wants to use.
Then at data-path he can either call async API (enqueue/dequeue)
or sync API (process). 
I expect users who do care about extra perf will choose cpu-crypto mode
when it is available.
Existing apps and apps who'd like to have just one code-path
would stay with async mode and will be unaffected.  

> How does it work with the crypto scheduler?

If we want to add cpu-crypto support to crypto-scheduler PMD,
then changes would be needed anyway, not matter will we choose #3 or #4.
  
Jerin Jacob Nov. 14, 2019, 5:46 a.m. UTC | #9
On Wed, Nov 6, 2019 at 12:11 AM Konstantin Ananyev
<konstantin.ananyev@intel.com> wrote:
>
> Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> process the crypto workload asynchronously. This way provides uniformity
> to both PMD types, but also introduce unnecessary performance penalty to
> SW PMDs that have to "simulate" HW async behavior
> (crypto-ops enqueue/dequeue, HW addresses computations,
> storing/dereferencing user provided data (mbuf) for each crypto-op,
> etc).
>
> The aim is to introduce a new optional API for SW crypto-devices
> to perform crypto processing in a synchronous manner.
> As summarized by Akhil, we need a synchronous API to perform crypto
> operations on raw data using SW PMDs, that provides:
>  - no crypto-ops.
>  - avoid using mbufs inside this API, use raw data buffers instead.
>  - no separate enqueue-dequeue, only single process() API for data path.
>  - input data buffers should be grouped by session,
>    i.e. each process() call takes one session and group of input buffers
>    that  belong to that session.
>  - All parameters that are constant accross session, should be stored
>    inside the session itself and reused by all incoming data buffers.
>
> While there seems no controversy about need of such functionality,
> there seems to be no agreement on what would be the best API for that.
> So I am requesting for TB input on that matter.
>
> Series structure:
> - patch #1 - intorduce basic data structures to be used by sync API
>   (no controversy here, I hope ..)
>   [RFC 1/4] cpu-crypto: Introduce basic data structures
> - patch #2 - Intel initial approach for new API (via rte_security)
>   [RFC 2/4] security: introduce cpu-crypto API
> - patch #3 - approach that reuses existing rte_cryptodev API as much as
>   possible
>   [RFC 3/4] cryptodev: introduce cpu-crypto API
> - patch #4 - approach via introducing new session data structure and API
>   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
>
> Patches 2,3,4 are mutually exclusive,
> and we probably have to choose which one to go forward with.
> I put some explanations in each of the patches, hopefully that will help
> to  understand pros and cons of each one.
>
> Akhil strongly supports #3, AFAIK mainly because it allows PMDs to
> reuse existing API and minimize API level changes.
> My favorite is #4, #2 is less preferable but ok too.
> #3 seems problematic to me by the reasons I outlined in #4 patch
> description.
>
> Please provide your opinion.

I spend some time on the proposal and I agree that sync API is needed
and it makes sense to remove queue emulation and allocating/freeing
the crypto_ops
in case of sync API.

# I would prefer to not duplicate the session. If the newly added
fields are for optimization
then those can be applicable for HW too. For example, if we consider,
offset to be
constant for one session HW PMD will be able to leverage this. ref:
rte_crypto_aead_xfrom::cpu_crypto:offset

# I would prefer to not duplicate ops parameters, instead of the
existing rte_crypto_ops  can be updated.
I see that most members introduced in rte_crypto_sym_vec &
rte_crypto_vec are already existing in rte_crypto_op.

Also, since we are agreeing that the ops for SYNC API can be from
stack/one time allocated, the size shouldn't matter.

I understand that this would cause ABI breakage, but for this release,
we can work together and add some reserved fields
that we can implement later. I believe that's the reason why you want
to introduce new structures. I think that will bloat
the existing crypto lib.

If I understand it correctly, this will be used in conjunction with
IXGBE to handle fragmented IPsec traffic. If that's the fundamental
reasoning, then there is an alternate path possible. Currently, the
issue is, rte_security doesn't define the treatment for fragmented
packets. Maybe let's define it and then a similar CPU crypto
processing can be done inside the PMD. By creating an internal
function in S/W PMDs and calling it from the inline crypto enabled eth
PMDs, fragmentation support for inline crypto devices can
be achieved. This way the application would look clean. All the
fragmentation related configuration (no of fragmentation contexts
needed,
reassembly timeout etc) need to be added in rte_security library and
the result for that operation will come as dynamic fields in the mbuf.

Just my 2c.







>
> Konstantin Ananyev (4):
>   cpu-crypto: Introduce basic data structures
>   security: introduce cpu-crypto API
>   cryptodev: introduce cpu-crypto API
>   cryptodev: introduce rte_crypto_cpu_sym_session API
>
>  lib/librte_cryptodev/rte_crypto_sym.h     | 63 +++++++++++++++++++++--
>  lib/librte_cryptodev/rte_cryptodev.c      | 14 +++++
>  lib/librte_cryptodev/rte_cryptodev.h      | 24 +++++++++
>  lib/librte_cryptodev/rte_cryptodev_pmd.h  | 22 ++++++++
>  lib/librte_security/rte_security.c        | 11 ++++
>  lib/librte_security/rte_security.h        | 28 +++++++++-
>  lib/librte_security/rte_security_driver.h | 20 +++++++
>  7 files changed, 177 insertions(+), 5 deletions(-)
>
> --
> 2.17.1
>
  
Ananyev, Konstantin Nov. 18, 2019, 11:57 a.m. UTC | #10
Hi Jerin,

Thanks for input, my answers inline.
Other guys - please provide your input.
Thanks
Konstantin

> > Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> > process the crypto workload asynchronously. This way provides uniformity
> > to both PMD types, but also introduce unnecessary performance penalty to
> > SW PMDs that have to "simulate" HW async behavior
> > (crypto-ops enqueue/dequeue, HW addresses computations,
> > storing/dereferencing user provided data (mbuf) for each crypto-op,
> > etc).
> >
> > The aim is to introduce a new optional API for SW crypto-devices
> > to perform crypto processing in a synchronous manner.
> > As summarized by Akhil, we need a synchronous API to perform crypto
> > operations on raw data using SW PMDs, that provides:
> >  - no crypto-ops.
> >  - avoid using mbufs inside this API, use raw data buffers instead.
> >  - no separate enqueue-dequeue, only single process() API for data path.
> >  - input data buffers should be grouped by session,
> >    i.e. each process() call takes one session and group of input buffers
> >    that  belong to that session.
> >  - All parameters that are constant accross session, should be stored
> >    inside the session itself and reused by all incoming data buffers.
> >
> > While there seems no controversy about need of such functionality,
> > there seems to be no agreement on what would be the best API for that.
> > So I am requesting for TB input on that matter.
> >
> > Series structure:
> > - patch #1 - intorduce basic data structures to be used by sync API
> >   (no controversy here, I hope ..)
> >   [RFC 1/4] cpu-crypto: Introduce basic data structures
> > - patch #2 - Intel initial approach for new API (via rte_security)
> >   [RFC 2/4] security: introduce cpu-crypto API
> > - patch #3 - approach that reuses existing rte_cryptodev API as much as
> >   possible
> >   [RFC 3/4] cryptodev: introduce cpu-crypto API
> > - patch #4 - approach via introducing new session data structure and API
> >   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
> >
> > Patches 2,3,4 are mutually exclusive,
> > and we probably have to choose which one to go forward with.
> > I put some explanations in each of the patches, hopefully that will help
> > to  understand pros and cons of each one.
> >
> > Akhil strongly supports #3, AFAIK mainly because it allows PMDs to
> > reuse existing API and minimize API level changes.
> > My favorite is #4, #2 is less preferable but ok too.
> > #3 seems problematic to me by the reasons I outlined in #4 patch
> > description.
> >
> > Please provide your opinion.
> 
> I spend some time on the proposal and I agree that sync API is needed
> and it makes sense to remove queue emulation and allocating/freeing
> the crypto_ops
> in case of sync API.
> 
> # I would prefer to not duplicate the session. If the newly added
> fields are for optimization
> then those can be applicable for HW too. For example, if we consider,
> offset to be
> constant for one session HW PMD will be able to leverage this. ref:
> rte_crypto_aead_xfrom::cpu_crypto:offset

It might, but right for async API we pass this info in crypto_op instead.
So if I get you right your preference is sort of #3 approach
that reuses existing rte_cryptodev API as much as possible:
reuse existing rte_cryptodev_sym structure with new sync process() API?
 
> # I would prefer to not duplicate ops parameters, instead of the
> existing rte_crypto_ops  can be updated.
> I see that most members introduced in rte_crypto_sym_vec &
> rte_crypto_vec are already existing in rte_crypto_op.

rte_crypto_ops is way too generic/excessive.
Filling/reading it seems one of the main slowdowns that  we trying to
avoid in new API. 

> 
> Also, since we are agreeing that the ops for SYNC API can be from
> stack/one time allocated, the size shouldn't matter.

I can be on stack, but it means user will still have to fill them
and PMD will have to read/process/overwrite them. 
 
> I understand that this would cause ABI breakage, but for this release,
> we can work together and add some reserved fields
> that we can implement later. I believe that's the reason why you want
> to introduce new structures. I think that will bloat
> the existing crypto lib.

It will increase the lib code, but I don't think it will be significant.
Honestly, I think messing with crypto_op and other existing structures
might have much more negative effect. 
 
> If I understand it correctly, this will be used in conjunction with
> IXGBE to handle fragmented IPsec traffic. If that's the fundamental
> reasoning, then there is an alternate path possible.

No, it's just one of the use-case.
Pretty important, but not the only one.
The main reason - current cryptodev API (crypto_op based) is suboptimal for SW based PMDs.
We wasting too many cycles to pretend that it is a lookaside device underneath.
I think makes more sense to admit that it is SW based and exploit it nature,
instead of trying to hide it.

> Currently, the  issue is, rte_security doesn't define the treatment for fragmented
> packets. Maybe let's define it and then a similar CPU crypto
> processing can be done inside the PMD. By creating an internal
> function in S/W PMDs and calling it from the inline crypto enabled eth
> PMDs, fragmentation support for inline crypto devices can
> be achieved. This way the application would look clean. All the
> fragmentation related configuration (no of fragmentation contexts
> needed,
> reassembly timeout etc) need to be added in rte_security library and
> the result for that operation will come as dynamic fields in the mbuf.
> 
> Just my 2c.
>
  
Jerin Jacob Nov. 20, 2019, 2:27 p.m. UTC | #11
On Mon, Nov 18, 2019 at 5:27 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
>
> Hi Jerin,

Hi Konstantin,

>
> Thanks for input, my answers inline.
> Other guys - please provide your input.
> Thanks
> Konstantin
>
> > > Originally both SW and HW crypto PMDs use rte_crypot_op based API to
> > > process the crypto workload asynchronously. This way provides uniformity
> > > to both PMD types, but also introduce unnecessary performance penalty to
> > > SW PMDs that have to "simulate" HW async behavior
> > > (crypto-ops enqueue/dequeue, HW addresses computations,
> > > storing/dereferencing user provided data (mbuf) for each crypto-op,
> > > etc).
> > >
> > > The aim is to introduce a new optional API for SW crypto-devices
> > > to perform crypto processing in a synchronous manner.
> > > As summarized by Akhil, we need a synchronous API to perform crypto
> > > operations on raw data using SW PMDs, that provides:
> > >  - no crypto-ops.
> > >  - avoid using mbufs inside this API, use raw data buffers instead.
> > >  - no separate enqueue-dequeue, only single process() API for data path.
> > >  - input data buffers should be grouped by session,
> > >    i.e. each process() call takes one session and group of input buffers
> > >    that  belong to that session.
> > >  - All parameters that are constant accross session, should be stored
> > >    inside the session itself and reused by all incoming data buffers.
> > >
> > > While there seems no controversy about need of such functionality,
> > > there seems to be no agreement on what would be the best API for that.
> > > So I am requesting for TB input on that matter.
> > >
> > > Series structure:
> > > - patch #1 - intorduce basic data structures to be used by sync API
> > >   (no controversy here, I hope ..)
> > >   [RFC 1/4] cpu-crypto: Introduce basic data structures
> > > - patch #2 - Intel initial approach for new API (via rte_security)
> > >   [RFC 2/4] security: introduce cpu-crypto API
> > > - patch #3 - approach that reuses existing rte_cryptodev API as much as
> > >   possible
> > >   [RFC 3/4] cryptodev: introduce cpu-crypto API
> > > - patch #4 - approach via introducing new session data structure and API
> > >   [RFC 4/4] cryptodev: introduce rte_crypto_cpu_sym_session API
> > >
> > > Patches 2,3,4 are mutually exclusive,
> > > and we probably have to choose which one to go forward with.
> > > I put some explanations in each of the patches, hopefully that will help
> > > to  understand pros and cons of each one.
> > >
> > > Akhil strongly supports #3, AFAIK mainly because it allows PMDs to
> > > reuse existing API and minimize API level changes.
> > > My favorite is #4, #2 is less preferable but ok too.
> > > #3 seems problematic to me by the reasons I outlined in #4 patch
> > > description.
> > >
> > > Please provide your opinion.
> >
> > I spend some time on the proposal and I agree that sync API is needed
> > and it makes sense to remove queue emulation and allocating/freeing
> > the crypto_ops
> > in case of sync API.
> >
> > # I would prefer to not duplicate the session. If the newly added
> > fields are for optimization
> > then those can be applicable for HW too. For example, if we consider,
> > offset to be
> > constant for one session HW PMD will be able to leverage this. ref:
> > rte_crypto_aead_xfrom::cpu_crypto:offset
>
> It might, but right for async API we pass this info in crypto_op instead.
> So if I get you right your preference is sort of #3 approach
> that reuses existing rte_cryptodev API as much as possible:
> reuse existing rte_cryptodev_sym structure with new sync process() API?

Yes.

> > # I would prefer to not duplicate ops parameters, instead of the
> > existing rte_crypto_ops  can be updated.
> > I see that most members introduced in rte_crypto_sym_vec &
> > rte_crypto_vec are already existing in rte_crypto_op.
>
> rte_crypto_ops is way too generic/excessive.
> Filling/reading it seems one of the main slowdowns that  we trying to
> avoid in new API.

It does not look like it is going over 1 CL. Regarding the filling
case, I think,
We need to form the rte_crypto_ops in the slow path and change only in
mutable fields need to update per packet.

> >
> > Also, since we are agreeing that the ops for SYNC API can be from
> > stack/one time allocated, the size shouldn't matter.
>
> I can be on stack, but it means user will still have to fill them
> and PMD will have to read/process/overwrite them.
>
> > I understand that this would cause ABI breakage, but for this release,
> > we can work together and add some reserved fields
> > that we can implement later. I believe that's the reason why you want
> > to introduce new structures. I think that will bloat
> > the existing crypto lib.
>
> It will increase the lib code, but I don't think it will be significant.
> Honestly, I think messing with crypto_op and other existing structures
> might have much more negative effect.

Yes. We need to change it carefully.

>
> > If I understand it correctly, this will be used in conjunction with
> > IXGBE to handle fragmented IPsec traffic. If that's the fundamental
> > reasoning, then there is an alternate path possible.
>
> No, it's just one of the use-case.
> Pretty important, but not the only one.
> The main reason - current cryptodev API (crypto_op based) is suboptimal for SW based PMDs.
> We wasting too many cycles to pretend that it is a lookaside device underneath.

That I agree. I think, it should be fixed by the process() API.

> I think makes more sense to admit that it is SW based and exploit it nature,
> instead of trying to hide it.

Yes. I thought the separate process() device op will solve the major problems.

This is just my _personal_ opinion.  I leave crypto code contributors
to define specifics of API.