mbox series

[v2,0/4] Introduce IF proxy library

Message ID 20200310111037.31451-1-aostruszka@marvell.com (mailing list archive)
Headers
Series Introduce IF proxy library |

Message

Andrzej Ostruszka [C] March 10, 2020, 11:10 a.m. UTC
  What is this useful for
=======================

Usually, when an ethernet port is assigned to DPDK it vanishes from the
system and user looses ability to control it via normal configuration
utilities (e.g. those from iproute2 package).  Moreover by default DPDK
application is not aware of the network configuration of the system.

To address both of these issues application needs to:
- add some command line interface (or other mechanism) allowing for
  control of the port and its configuration
- query the status of network configuration and monitor its changes

The purpose of this library is to help with both of these tasks (as long
as they remain in domain of configuration available to the system).  In
other words, if DPDK application has some special needs, that cannot be
addressed by the normal system configuration utilities, then they need
to be solved by the application itself.

The connection between DPDK and system is based on the existence of
ports that are visible to both DPDK and system (like Tap, KNI and
possibly some other drivers).  These ports serve as an interface
proxies.

Let's visualize the action of the library by the following example:

              Linux             |            DPDK
==============================================================
                                |
                                |   +-------+       +-------+
                                |   | Port1 |       | Port2 |
"ip link set dev tap1 mtu 1600" |   +-------+       +-------+
                          |     |       ^              ^ ^
                          |  +------+   | mtu_change   | |
                          `->| Tap1 |---' callback     | |
                             +------+                  | |
"ip addr add 198.51.100.14 \    |                      | |
                  dev tap2"     |                      | |
                          |  +------+                  | |
                          +->| Tap2 |------------------' |
                          |  +------+  addr_add callback |
"ip route add 198.0.2.0/24 \    |  |                     |
                  dev tap2"     |  | route_add callback  |
                                |  `---------------------'

So we have two ports Port1 and Port2 that are not visible to the system.
We create two proxy interfaces (here based on Tap driver) and bind the
ports to their proxies.  When user issues a command changing MTU for
Tap1 interface the library notes this and calls "mtu_change" callback
for the Port1.  Similarly when user adds an IPv4 address to the Tap2
interface "addr_add" callback is called for the Port2 and the same
happens for configuration of routing rule pointing to Tap2.  Apart from
callbacks this library can notify about changes via adding events to
notification queues.  See below for more inforamtion about that and
a complete list of available callbacks.

Please note that nothing has been mentioned about forwarding of the
packets between system and DPDK.  Since the proxies are normal DPDK
ports you can receive/send to them via usual RX/TX burst API.  However
since the library is not aware of the structure of packet processing
used by the application it cannot automatically forward the packets - it
is responsibility of the application to include proxy ports into its
packet processing engine.

As mentioned above the intention of the library is to:
- provide information about network configuration that would allow
  application to decide what to do with the packets received on DPDK
  ports,
- allow for control of the ports via standard configuration utilities

Although the library only helps you to identify proxy for given port
(and vice versa) and calls appropriate callbacks it does open some
interesting possibilities.  For example you can use the proxy ports to
forward packets for protocols that you do not wish to handle in DPDK
application to the system protocol stack and just listen to the
configuration changes - so that way you can "offload" handling of those
protocols to the system.

How to use it
=============

Usage of this library is rather simple.  You have to:
1. Create proxy (if you don't have port suitable for being proxy or you
  have one but do not wish to use it as a proxy).
2. Bind port to proxy.
3. Register callbacks and/or event queues.
4. Start listening to the network configuration.

The only mandatory requirement for DPDK port to be able to act as
a proxy is that it is visible in the system - this is checked during
port to proxy binding by calling rte_eth_dev_info_get() on proxy port
and inspecting 'if_index' field (it has to be non-zero).
One can create such port in the application by calling:

  proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT);

Upon success this returns id of DPDK proxy port created
(RTE_MAX_ETHPORTS on failure).  The argument selects type of proxy port
to create (currently Tap/KNI only).  This function actually is just
a wrapper around:

  uint16_t rte_ifpx_create_by_devarg(const char *devarg);

creating valid 'devarg' string for the chosen type of proxy.  If you have
other driver capable of acting as a proxy you can call
rte_ifpx_create_by_devarg() directly passing appropriate argument.

Once you have id of both port and proxy you can bind the two via:

  rte_ifpx_port_bind(port_id, proxy_id);

This creates logical binding - as mentioned above there is no automatic
packet forwarding.  With this binding whenever user changes the state of
proxy interface in the system (link up/down, change mac/mtu, add/remove
IPv4/IPv6) you get appropriate notification for the bound port.

So far we've mentioned several times that the library calls callbacks.
They are grouped in 'struct rte_ifpx_callbacks' and user provides them
to the library via:

  rte_ifpx_callbacks_register(&cbs);

It is worth mentioning that the context (lcore/thread) in which these
callbacks are called is implementation defined.  It might differ between
different platforms, so the application needs to assume that some kind
of inter lcore/thread synchronization/communication is required.

Apart from notification via callbacks this library also supports
notifying about the changes via adding events to the configured
notification queues.  The queues are registered via:

  int rte_ifpx_queue_add(struct rte_ring *r);

and the actual logic used is: if there is callback registered then it is
called, if it returns non-zero then event is considered completed,
otherwise event is added to each configured notification queue.
That way application can update data structures that are safe to be
modified by single writer from within callback or do the common
preprocessing steps (if any needed) in callback and data that is
replicated can be updated during handling of queued events.

Once we have bindings in place and notification configured, the only
essential part that remains is to get the current network configuration
and start listening to its changes.  This is accomplished via a call to:

  rte_ifpx_listen();

And basically this is all one needs to understand how to use this
library.  Other less essential parts include:
- ability to query what events are available for given platform
- getting mapping between proxy and port
- unbinding the ports from proxy
- destroying proxy port
- closing the listening service
- getting basic information about proxy


Currently available features and implementation
===============================================

The library's API is system independent but it obviously needs some
system dependent parts.  We provide exemplary Linux implementation (based
on netlink sockets).  Very similar implementation is possible for
FreeBSD (with the usage of PF_ROUTE sockets).  Windows implementation
would need to differ much (probably IP Helper library would be of some help).

Here is the list of currently implemented callbacks:

struct rte_ifpx_callbacks {
  int (*mac_change)(const struct rte_ifpx_mac_change *event);
  int (*mtu_change)(const struct rte_ifpx_mtu_change *event);
  int (*link_change)(const struct rte_ifpx_link_change *event);
  int (*addr_add)(const struct rte_ifpx_addr_change *event);
  int (*addr_del)(const struct rte_ifpx_addr_change *event);
  int (*addr6_add)(const struct rte_ifpx_addr6_change *event);
  int (*addr6_del)(const struct rte_ifpx_addr6_change *event);
  int (*route_add)(const struct rte_ifpx_route_change *event);
  int (*route_del)(const struct rte_ifpx_route_change *event);
  int (*route6_add)(const struct rte_ifpx_route6_change *event);
  int (*route6_del)(const struct rte_ifpx_route6_change *event);
  int (*neigh_add)(const struct rte_ifpx_neigh_change *event);
  int (*neigh_del)(const struct rte_ifpx_neigh_change *event);
  int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event);
  int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event);
  int (*cfg_done)(void);
};

They are all rather self-descriptive with the exception of the last one.
When the user calls rte_ifpx_listen() the library first queries the
system for its current configuration.  That might require several
request/reply exchanges between DPDK and system and once it is finished
this callback is called to let application know that all info has been
gathered.

It is worth to mention also that while typical case would be a 1-to-1
mapping between port and proxy, the 1-to-many mapping is also supported.
In that case related callbacks will be called for each port bound to
given proxy interface - it is application responsibility to define
semantic of such mapping (e.g. all changes apply to all ports, or link
changes apply to all but other are accepted in "round robin" fashion, or
some other logic).

As mentioned above Linux implementation is based on netlink socket.
This socket is registered as file descriptor in EAL interrupts
(similarly to how EAL alarms are implemented).

What has changed since the RFC
==============================

- Platform dependent parts has been separated into a ifpx_platform
  structure with callbacks for initialization, getting information about
  the interface, listening to the changes and closing of the library.
  That should allow easier reimplementation.

- Notification scheme has been changed - instead of having just
  callbacks now event queueing is also available (or a mix of those
  two).

- Filtering of events only related to the proxy ports - previously all
  network configuration changes were reported.  But DPDK application
  doesn't need to know whole configuration - only just portion related
  to the proxy ports.  If a packet comes that does not match rules then
  it can be forwarded via proxy to the system to decide what to do with
  it.  If that is not desired and such packets should be dropped then
  null port can be created with proxy and e.g. default route installed
  on it.

- Removed previous example which was just printing notification.
  Instead added a simplified (stripped vectorization and other
  performance improvements) version of l3fwd that should serve as an
  example of using this library in real applications.

Changes in V2
=============
- Cleaned up checkpatch warnings
- Removed dead/unused code and added gateway clearing in l3fwd-ifpx


With regards
Andrzej Ostruszka

Note: Patch 4 in this series has a dependency on:
  https://patchwork.dpdk.org/patch/66492/
so I add here this newly proposed tag here:
Depends-on: series-8862

Andrzej Ostruszka (4):
  lib: introduce IF Proxy library
  if_proxy: add library documentation
  if_proxy: add simple functionality test
  if_proxy: add example application

 MAINTAINERS                                   |    6 +
 app/test/Makefile                             |    5 +
 app/test/meson.build                          |    4 +
 app/test/test_if_proxy.c                      |  707 +++++++++++
 config/common_base                            |    5 +
 config/common_linux                           |    1 +
 doc/guides/prog_guide/if_proxy_lib.rst        |  142 +++
 doc/guides/prog_guide/index.rst               |    1 +
 examples/Makefile                             |    1 +
 examples/l3fwd-ifpx/Makefile                  |   60 +
 examples/l3fwd-ifpx/l3fwd.c                   | 1131 +++++++++++++++++
 examples/l3fwd-ifpx/l3fwd.h                   |   98 ++
 examples/l3fwd-ifpx/main.c                    |  740 +++++++++++
 examples/l3fwd-ifpx/meson.build               |   11 +
 examples/meson.build                          |    2 +-
 lib/Makefile                                  |    2 +
 .../common/include/rte_eal_interrupts.h       |    2 +
 lib/librte_eal/linux/eal/eal_interrupts.c     |   14 +-
 lib/librte_if_proxy/Makefile                  |   29 +
 lib/librte_if_proxy/if_proxy_common.c         |  494 +++++++
 lib/librte_if_proxy/if_proxy_priv.h           |   97 ++
 lib/librte_if_proxy/linux/Makefile            |    4 +
 lib/librte_if_proxy/linux/if_proxy.c          |  550 ++++++++
 lib/librte_if_proxy/meson.build               |   19 +
 lib/librte_if_proxy/rte_if_proxy.h            |  561 ++++++++
 lib/librte_if_proxy/rte_if_proxy_version.map  |   19 +
 lib/meson.build                               |    2 +-
 27 files changed, 4701 insertions(+), 6 deletions(-)
 create mode 100644 app/test/test_if_proxy.c
 create mode 100644 doc/guides/prog_guide/if_proxy_lib.rst
 create mode 100644 examples/l3fwd-ifpx/Makefile
 create mode 100644 examples/l3fwd-ifpx/l3fwd.c
 create mode 100644 examples/l3fwd-ifpx/l3fwd.h
 create mode 100644 examples/l3fwd-ifpx/main.c
 create mode 100644 examples/l3fwd-ifpx/meson.build
 create mode 100644 lib/librte_if_proxy/Makefile
 create mode 100644 lib/librte_if_proxy/if_proxy_common.c
 create mode 100644 lib/librte_if_proxy/if_proxy_priv.h
 create mode 100644 lib/librte_if_proxy/linux/Makefile
 create mode 100644 lib/librte_if_proxy/linux/if_proxy.c
 create mode 100644 lib/librte_if_proxy/meson.build
 create mode 100644 lib/librte_if_proxy/rte_if_proxy.h
 create mode 100644 lib/librte_if_proxy/rte_if_proxy_version.map
  

Comments

David Marchand March 25, 2020, 8:08 a.m. UTC | #1
Hello Andrzej,

On Tue, Mar 10, 2020 at 12:11 PM Andrzej Ostruszka
<aostruszka@marvell.com> wrote:
>
> What is this useful for
> =======================
>
> Usually, when an ethernet port is assigned to DPDK it vanishes from the
> system and user looses ability to control it via normal configuration
> utilities (e.g. those from iproute2 package).  Moreover by default DPDK
> application is not aware of the network configuration of the system.
>
> To address both of these issues application needs to:
> - add some command line interface (or other mechanism) allowing for
>   control of the port and its configuration
> - query the status of network configuration and monitor its changes
>
> The purpose of this library is to help with both of these tasks (as long
> as they remain in domain of configuration available to the system).  In
> other words, if DPDK application has some special needs, that cannot be
> addressed by the normal system configuration utilities, then they need
> to be solved by the application itself.
>
> The connection between DPDK and system is based on the existence of
> ports that are visible to both DPDK and system (like Tap, KNI and
> possibly some other drivers).  These ports serve as an interface
> proxies.
>
> Let's visualize the action of the library by the following example:
>
>               Linux             |            DPDK
> ==============================================================
>                                 |
>                                 |   +-------+       +-------+
>                                 |   | Port1 |       | Port2 |
> "ip link set dev tap1 mtu 1600" |   +-------+       +-------+
>                           |     |       ^              ^ ^
>                           |  +------+   | mtu_change   | |
>                           `->| Tap1 |---' callback     | |
>                              +------+                  | |
> "ip addr add 198.51.100.14 \    |                      | |
>                   dev tap2"     |                      | |
>                           |  +------+                  | |
>                           +->| Tap2 |------------------' |
>                           |  +------+  addr_add callback |
> "ip route add 198.0.2.0/24 \    |  |                     |
>                   dev tap2"     |  | route_add callback  |
>                                 |  `---------------------'
>
> So we have two ports Port1 and Port2 that are not visible to the system.
> We create two proxy interfaces (here based on Tap driver) and bind the
> ports to their proxies.  When user issues a command changing MTU for
> Tap1 interface the library notes this and calls "mtu_change" callback
> for the Port1.  Similarly when user adds an IPv4 address to the Tap2
> interface "addr_add" callback is called for the Port2 and the same
> happens for configuration of routing rule pointing to Tap2.  Apart from
> callbacks this library can notify about changes via adding events to
> notification queues.  See below for more inforamtion about that and
> a complete list of available callbacks.
>
> Please note that nothing has been mentioned about forwarding of the
> packets between system and DPDK.  Since the proxies are normal DPDK
> ports you can receive/send to them via usual RX/TX burst API.  However
> since the library is not aware of the structure of packet processing
> used by the application it cannot automatically forward the packets - it
> is responsibility of the application to include proxy ports into its
> packet processing engine.
>
> As mentioned above the intention of the library is to:
> - provide information about network configuration that would allow
>   application to decide what to do with the packets received on DPDK
>   ports,
> - allow for control of the ports via standard configuration utilities
>
> Although the library only helps you to identify proxy for given port
> (and vice versa) and calls appropriate callbacks it does open some
> interesting possibilities.  For example you can use the proxy ports to
> forward packets for protocols that you do not wish to handle in DPDK
> application to the system protocol stack and just listen to the
> configuration changes - so that way you can "offload" handling of those
> protocols to the system.
>
> How to use it
> =============
>
> Usage of this library is rather simple.  You have to:
> 1. Create proxy (if you don't have port suitable for being proxy or you
>   have one but do not wish to use it as a proxy).
> 2. Bind port to proxy.
> 3. Register callbacks and/or event queues.
> 4. Start listening to the network configuration.
>
> The only mandatory requirement for DPDK port to be able to act as
> a proxy is that it is visible in the system - this is checked during
> port to proxy binding by calling rte_eth_dev_info_get() on proxy port
> and inspecting 'if_index' field (it has to be non-zero).
> One can create such port in the application by calling:
>
>   proxy_id = rte_ifpx_create(RTE_IFPX_DEFAULT);
>
> Upon success this returns id of DPDK proxy port created
> (RTE_MAX_ETHPORTS on failure).  The argument selects type of proxy port
> to create (currently Tap/KNI only).  This function actually is just
> a wrapper around:
>
>   uint16_t rte_ifpx_create_by_devarg(const char *devarg);
>
> creating valid 'devarg' string for the chosen type of proxy.  If you have
> other driver capable of acting as a proxy you can call
> rte_ifpx_create_by_devarg() directly passing appropriate argument.
>
> Once you have id of both port and proxy you can bind the two via:
>
>   rte_ifpx_port_bind(port_id, proxy_id);
>
> This creates logical binding - as mentioned above there is no automatic
> packet forwarding.  With this binding whenever user changes the state of
> proxy interface in the system (link up/down, change mac/mtu, add/remove
> IPv4/IPv6) you get appropriate notification for the bound port.
>
> So far we've mentioned several times that the library calls callbacks.
> They are grouped in 'struct rte_ifpx_callbacks' and user provides them
> to the library via:
>
>   rte_ifpx_callbacks_register(&cbs);
>
> It is worth mentioning that the context (lcore/thread) in which these
> callbacks are called is implementation defined.  It might differ between
> different platforms, so the application needs to assume that some kind
> of inter lcore/thread synchronization/communication is required.
>
> Apart from notification via callbacks this library also supports
> notifying about the changes via adding events to the configured
> notification queues.  The queues are registered via:
>
>   int rte_ifpx_queue_add(struct rte_ring *r);
>
> and the actual logic used is: if there is callback registered then it is
> called, if it returns non-zero then event is considered completed,
> otherwise event is added to each configured notification queue.
> That way application can update data structures that are safe to be
> modified by single writer from within callback or do the common
> preprocessing steps (if any needed) in callback and data that is
> replicated can be updated during handling of queued events.
>
> Once we have bindings in place and notification configured, the only
> essential part that remains is to get the current network configuration
> and start listening to its changes.  This is accomplished via a call to:
>
>   rte_ifpx_listen();
>
> And basically this is all one needs to understand how to use this
> library.  Other less essential parts include:
> - ability to query what events are available for given platform
> - getting mapping between proxy and port
> - unbinding the ports from proxy
> - destroying proxy port
> - closing the listening service
> - getting basic information about proxy
>
>
> Currently available features and implementation
> ===============================================
>
> The library's API is system independent but it obviously needs some
> system dependent parts.  We provide exemplary Linux implementation (based
> on netlink sockets).  Very similar implementation is possible for
> FreeBSD (with the usage of PF_ROUTE sockets).  Windows implementation
> would need to differ much (probably IP Helper library would be of some help).
>
> Here is the list of currently implemented callbacks:
>
> struct rte_ifpx_callbacks {
>   int (*mac_change)(const struct rte_ifpx_mac_change *event);
>   int (*mtu_change)(const struct rte_ifpx_mtu_change *event);
>   int (*link_change)(const struct rte_ifpx_link_change *event);
>   int (*addr_add)(const struct rte_ifpx_addr_change *event);
>   int (*addr_del)(const struct rte_ifpx_addr_change *event);
>   int (*addr6_add)(const struct rte_ifpx_addr6_change *event);
>   int (*addr6_del)(const struct rte_ifpx_addr6_change *event);
>   int (*route_add)(const struct rte_ifpx_route_change *event);
>   int (*route_del)(const struct rte_ifpx_route_change *event);
>   int (*route6_add)(const struct rte_ifpx_route6_change *event);
>   int (*route6_del)(const struct rte_ifpx_route6_change *event);
>   int (*neigh_add)(const struct rte_ifpx_neigh_change *event);
>   int (*neigh_del)(const struct rte_ifpx_neigh_change *event);
>   int (*neigh6_add)(const struct rte_ifpx_neigh6_change *event);
>   int (*neigh6_del)(const struct rte_ifpx_neigh6_change *event);
>   int (*cfg_done)(void);
> };
>
> They are all rather self-descriptive with the exception of the last one.
> When the user calls rte_ifpx_listen() the library first queries the
> system for its current configuration.  That might require several
> request/reply exchanges between DPDK and system and once it is finished
> this callback is called to let application know that all info has been
> gathered.
>
> It is worth to mention also that while typical case would be a 1-to-1
> mapping between port and proxy, the 1-to-many mapping is also supported.
> In that case related callbacks will be called for each port bound to
> given proxy interface - it is application responsibility to define
> semantic of such mapping (e.g. all changes apply to all ports, or link
> changes apply to all but other are accepted in "round robin" fashion, or
> some other logic).
>
> As mentioned above Linux implementation is based on netlink socket.
> This socket is registered as file descriptor in EAL interrupts
> (similarly to how EAL alarms are implemented).
>
> What has changed since the RFC
> ==============================
>
> - Platform dependent parts has been separated into a ifpx_platform
>   structure with callbacks for initialization, getting information about
>   the interface, listening to the changes and closing of the library.
>   That should allow easier reimplementation.
>
> - Notification scheme has been changed - instead of having just
>   callbacks now event queueing is also available (or a mix of those
>   two).
>
> - Filtering of events only related to the proxy ports - previously all
>   network configuration changes were reported.  But DPDK application
>   doesn't need to know whole configuration - only just portion related
>   to the proxy ports.  If a packet comes that does not match rules then
>   it can be forwarded via proxy to the system to decide what to do with
>   it.  If that is not desired and such packets should be dropped then
>   null port can be created with proxy and e.g. default route installed
>   on it.
>
> - Removed previous example which was just printing notification.
>   Instead added a simplified (stripped vectorization and other
>   performance improvements) version of l3fwd that should serve as an
>   example of using this library in real applications.
>
> Changes in V2
> =============
> - Cleaned up checkpatch warnings
> - Removed dead/unused code and added gateway clearing in l3fwd-ifpx

I can see we end up exposing structures for registering callbacks.
Did you consider some ways to avoid exposure of those? (thinking of
ABI maintenance for when this library will elect to non-experimental).
I can see some canary at the end of an enum, can we do without it?

Is there a pb with merging ifpx support into the existing l3fwd
application rather than introduce a new example?
  
Morten Brørup March 25, 2020, 11:11 a.m. UTC | #2
Andrzej,

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of David Marchand
> Sent: Wednesday, March 25, 2020 9:08 AM
> 
> Hello Andrzej,
> 
> On Tue, Mar 10, 2020 at 12:11 PM Andrzej Ostruszka
> <aostruszka@marvell.com> wrote:
> >

<snip>

> >
> > What has changed since the RFC
> > ==============================
> >
> > - Platform dependent parts has been separated into a ifpx_platform
> >   structure with callbacks for initialization, getting information
> about
> >   the interface, listening to the changes and closing of the library.
> >   That should allow easier reimplementation.
> >
> > - Notification scheme has been changed - instead of having just
> >   callbacks now event queueing is also available (or a mix of those
> >   two).

Thank you for adding event queueing!

David mentions ABI forward compatibility below. Consider using a dynamically sized generic TLV (type, length, value) message format instead of a big union structure for the events. This would make it easier to extend the list of event types without breaking the ABI.

And I am still strongly opposed to the callback method:
The callbacks are handled as DPDK interrupts, which are running in a non-DPDK thread, i.e. a running callback may be preempted by some other Linux process. This makes it difficult to implement callbacks correctly.
The risk of someone calling a non-thread safe function from a callback is high, e.g. DPDK hash table manipulation (except lookup) is not thread safe.

Your documentation is far too vague about this:
Please note however that the context in which these callbacks are 
called is most probably different from the one in which packets are 
handled and it is application writer responsibility to use proper 
synchronization mechanisms - if they are needed.

You need a big fat WARNING about how difficult the DPDK interrupt thread is to work with. As I described above, it is not "most probably" it is "certainly" a very different kind of context.

Did you check that the functions you use in your example callbacks are all thread safe and non-blocking, so they can safely be called from a non-DPDK thread that may be preempted by a another Linux process?

<snip>

> 
> I can see we end up exposing structures for registering callbacks.
> Did you consider some ways to avoid exposure of those? (thinking of
> ABI maintenance for when this library will elect to non-experimental).
> I can see some canary at the end of an enum, can we do without it?
> 
> Is there a pb with merging ifpx support into the existing l3fwd
> application rather than introduce a new example?
> 
> 
> --
> David Marchand
>
  
Andrzej Ostruszka March 26, 2020, 12:41 p.m. UTC | #3
Thank you David for taking time to look at this.

On 3/25/20 9:08 AM, David Marchand wrote:
> Hello Andrzej,
> 
> On Tue, Mar 10, 2020 at 12:11 PM Andrzej Ostruszka
[...]
> I can see we end up exposing structures for registering callbacks.

Right.  I was thinking more in terms of user convenience so it seemed
like a good choice to gather them in one struct and call 'register'
once.  The fact that the same structure is used to keep them is an
implementation choice and this can be decoupled.

> Did you consider some ways to avoid exposure of those? (thinking of
> ABI maintenance for when this library will elect to non-experimental).

I will.  So far I used the union for the input since I like when things
are well typed :) and there is no need for casting.  However I will
spend some time on this and will get back to you soon (if you have
already something in your head please share).  Right now I'm thinking
about taking array of callbacks with each entry being ("event type",
callback) pair, however need to figure out how to have minimum amount of
type casting.

> I can see some canary at the end of an enum, can we do without it?

I followed discussion on the list about that and have thought about it
but deemed that to be not a problem.  This enum value is never returned
from the library and the event type enum is never taken as an input
(only used for event notification).  So this is really implementation
thing and you are right it would be better to hide it.  This might be
resolved by itself when I come up with something for the above ABI
stability issue.

> Is there a pb with merging ifpx support into the existing l3fwd
> application rather than introduce a new example?

I don't see a problem with merging per se.  That might be my
misunderstanding of what the examples are.  I thought that each library
can have its own example to show how it is supposed to be used.  So
decided to have simplified version of l3fwd  - and initially I thought
about updating l3fwd but it has some non-trivial optimizations and two
modes of operations (hash/lpm) so I wanted something simple to just show
how to use the library.  Don't know what is the reason for this
bi-modality of l3fwd:
- if this is just a need to show LPM/Hash in use then I can replace that
  with single mode of l3fwd-ifpx where LPM is used for routing and Hash
  is used to keep neighbouring info
- if this is to show that both LPM and Hash can be used for routing then
  it would complicate things as these two have different update
  properties.

I assume (but don't have a solid proof for that) that LPM can be updated
by a single writer while being used by multiple readers and use this
assumption to show how such structures can be updated (Morten please
cover your eyes ;-)) from a callback while other can be updated via
event queuing.

So if the community decides that it would be OK to morph l3fwd to:
- strip the bi-modality
- use LPM and Hash for different things (not both for routing)
then I'm OK with that and will happily do that.  Otherwise adding IFPX
to l3fwd will end up with two modes with different routing
implementation and different update strategies - a bit like two
different apps bundled into one and chosen by the command arg.

There is also a question of not having FreeBSD and Windows support yet -
so things might get complicated.

With regards
Andrzej Ostruszka
  
Andrzej Ostruszka March 26, 2020, 5:42 p.m. UTC | #4
On 3/25/20 12:11 PM, Morten Brørup wrote:
[...]
>>> - Notification scheme has been changed - instead of having just
>>>   callbacks now event queueing is also available (or a mix of those
>>>   two).
> 
> Thank you for adding event queueing!

That was actually a good input from you - thank you.

> David mentions ABI forward compatibility below.
> Consider using a dynamically sized generic TLV (type, length, value)
> message format instead of a big union structure for the events. This
> would make it easier to extend the list of event types without breaking
> the ABI.

My understanding is that David was talking about registering of
callbacks and you want to extend this to event definition.

So let's focus on one example:
...
	RTE_IFPX_NEIGH_ADD,
	RTE_IFPX_NEIGH_DEL,
...
struct rte_ifpx_neigh_change {
	uint16_t port_id;
	struct rte_ether_addr mac;
	uint32_t ip;
};

Right now the event is defined as:

struct rte_ifpx_event {
	enum rte_ifpx_event_type type;
	union {
	...
		struct rte_ifpx_neigh_change neigh_change;
	...
	};
};

So what the user does is a switch on event->type:

	switch (ev->type) {
		case RTE_IFPX_NEIGH_ADD:
			handle_neigh_add(lconf, &ev->neigh_change);
			break;
		case RTE_IFPX_NEIGH_DEL:
			handle_neigh_del(lconf, &ev->neigh_change);
			break;

How does adding more event types to this union would break ABI?  User
gets event from the queue (allocated by the lib) checks the type and
casts the pointer past the 'type' to proper event definition.  And when
done with the event simply free()s it (BTW right now it is malloc() not
rte_malloc() - should I change that?).  If app links against newer
version of lib then it might get type which it does not
understand/handle so it should skip (possibly with a warning).  I'm not
sure how changing rte_ifpx_event to:

struct rte_ifpx_event {
	enut rte_ifpx_event_type type;
	int length;
	uint8_t data[];
};

would help here.  The user would need to cast data based on event type
whereas now it takes address of a proper union member - and the union is
there only to avoid casting.  In both cases what is important is that
RTE_IFPX_NEIGH_ADD/DEL and "struct rte_ifpx_neigh_change" don't change
between versions (new values can be added - or new versions of the
previously existing events when trying to make a change).

And for the callbacks it is more or less the same - library will prepare
data and call callback with a pointer to this data.  Handling of new
event types should be automatic when I implement what David wanted -
simply lib callback for the new event will be NULL nothing will be
called and application will work without problems.

> And I am still strongly opposed to the callback method:

Noted - however for now I would like to keep them.  I don't have much
experience with this library so if they prove to be inadequate then we
will remove them.  Right now they seem to add some flexibility that I like:
- if something should be changed globally and once (and it is safe to do
  so!) then it can be done from the callback
- if something can be prepared once and consumed later by lcores then it
  can be done in callback and the callback returns 0 so that event is
  still queued and lcores (under assumption that queues are per lcore)
  pick up what has been prepared.

> The callbacks are handled as DPDK interrupts, which are running in a non-DPDK
> thread, i.e. a running callback may be preempted by some other Linux process.
> This makes it difficult to implement callbacks correctly.
> The risk of someone calling a non-thread safe function from a callback is high,
> e.g. DPDK hash table manipulation (except lookup) is not thread safe.
> 
> Your documentation is far too vague about this:
> Please note however that the context in which these callbacks are 
> called is most probably different from the one in which packets are 
> handled and it is application writer responsibility to use proper 
> synchronization mechanisms - if they are needed.
> 
> You need a big fat WARNING about how difficult the DPDK interrupt thread is to
> work with. As I described above, it is not "most probably" it is "certainly" a
> very different kind of context.

OK.  Will update in next version.

> Did you check that the functions you use in your example callbacks are all
> thread safe and non-blocking, so they can safely be called from a non-DPDK thread
> that may be preempted by a another Linux process?

I believe so.  However there is a big question whether my assumption
about LPM is correct.  I've looked at the code and it looks like it so
but I'm not in power to authoritatively declare it.  So again, to me LPM
looks like safe to be changed by a single writer while being used by
multiple readers (with an obvious transient period when rule is being
expanded and some IPs might go with an old and some with a new destination).

With regards
Andrzej Ostruszka
  
Andrzej Ostruszka March 30, 2020, 7:23 p.m. UTC | #5
On 3/26/20 1:41 PM, Andrzej Ostruszka wrote:
> Thank you David for taking time to look at this.
> 
> On 3/25/20 9:08 AM, David Marchand wrote:
>> Hello Andrzej,
>>
>> On Tue, Mar 10, 2020 at 12:11 PM Andrzej Ostruszka
> [...]
>> I can see we end up exposing structures for registering callbacks.
> 
> Right.  I was thinking more in terms of user convenience so it seemed
> like a good choice to gather them in one struct and call 'register'
> once.  The fact that the same structure is used to keep them is an
> implementation choice and this can be decoupled.
> 
>> Did you consider some ways to avoid exposure of those? (thinking of
>> ABI maintenance for when this library will elect to non-experimental).
> 
> I will.  So far I used the union for the input since I like when things
> are well typed :) and there is no need for casting.  However I will
> spend some time on this and will get back to you soon (if you have
> already something in your head please share).  Right now I'm thinking
> about taking array of callbacks with each entry being ("event type",
> callback) pair, however need to figure out how to have minimum amount of
> type casting.

David, I thought about this a bit and here is my proposal.

Define "typeful" callback pointer (public):

union rte_ifpx_cb_ptr {
    int (*mac_change)(const struct mac_change *ev);
    int (*mtu_change)(const struct mtu_change *ev);
    ...
    int (*cfg_done)(void);
};

In implementation make sure its size is as expected:

_Static_assert(sizeof(union rte_ifpx_cb_ptr) == sizeof (int(*)(void*)),
               "Size of callback pointer has to be"
               "equal to size of function pointer");

Accept as input tagged callbacks (also public type):

struct rte_ifpx_callback {
    enum rte_ifpx_event_type type;
    union rte_ifpx_cb_ptr callback;
};

The user would be defining array of callbacks:

struct rte_ifpx_callback callbacks[] = {
    {RTE_IFPX_MAC_CHANGE, {.mac_change = mac_change}},
    {RTE_IFPX_MTU_CHANGE, {.mtu_change = mtu_change}},
    ...
    {RTE_IFPX_CFG_DONE,   {.cfg_done   = finished}},
};

and passing it to registration together with its length like:

int rte_ifpx_callbacks_register(int len,
                                const struct rte_ifpx_callback *cbs)
{
    for (int i = 0; i < len; ++i) {
        switch (cbs[i].type) {
            case RTE_IFPX_MAC_CHANGE:
                priv_cbs.mac_change = cbs[i].callback.mac_change;
                break;
    ...
}

This way we should be protected from ABI breakage when adding new event
types and how the callbacks are stored would not be visible to the user.

Let me know what do you think about it.

With regards
Andrzej Ostruszka
  
Andrzej Ostruszka [C] April 2, 2020, 1:48 p.m. UTC | #6
On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
> On 3/25/20 12:11 PM, Morten Brørup wrote:
[...]
>> And I am still strongly opposed to the callback method:
> 
> Noted - however for now I would like to keep them.  I don't have much
> experience with this library so if they prove to be inadequate then we
> will remove them.  Right now they seem to add some flexibility that I like:
> - if something should be changed globally and once (and it is safe to do
>   so!) then it can be done from the callback
> - if something can be prepared once and consumed later by lcores then it
>   can be done in callback and the callback returns 0 so that event is
>   still queued and lcores (under assumption that queues are per lcore)
>   pick up what has been prepared.

Morten

I've been thinking about this a bit and would like to know your (and
others) opinion about following proposed enhancement.

Right now, how queues are used is left to the application decision (per
lcore, per port, ...) - and I intend to keep it that way - but they are
"match all".  What I mean by that is that (unlike callbacks where you
have separate per event type) queue has no chance to be selective.

So if someone would like to go with queues only they he would have to
coordinate between queues (or their "owners") which one does the
handling of an event that supposedly should be handled only once.

Let's take this forwarding example - the queues are per lcore and each
lcore keeps its own copy of ARP table (hash) so when the change is
noticed the event is queued to all registered queue, each lcore updates
its own copy and everything is OK.  However the routing is global (and
right now is updated from callback) and if no callback is used for that
then the event would be queued to all lcores and application would need
to select the one which does the update.

Would that be easier/better to register queue together with a bitmask of
event types that given queue is accepting?  Than during setup phase
application would select just one queue to handle "global" events and
the logic of event handling for lcores should be simplier.

Let me know what you think.

With regards
Andrzej Ostruszka
  
Thomas Monjalon April 3, 2020, 5:19 p.m. UTC | #7
02/04/2020 15:48, Andrzej Ostruszka [C]:
> On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
> > On 3/25/20 12:11 PM, Morten Brørup wrote:
> [...]
> >> And I am still strongly opposed to the callback method:
> > 
> > Noted - however for now I would like to keep them.  I don't have much
> > experience with this library so if they prove to be inadequate then we
> > will remove them.  Right now they seem to add some flexibility that I like:
> > - if something should be changed globally and once (and it is safe to do
> >   so!) then it can be done from the callback
> > - if something can be prepared once and consumed later by lcores then it
> >   can be done in callback and the callback returns 0 so that event is
> >   still queued and lcores (under assumption that queues are per lcore)
> >   pick up what has been prepared.
> 
> Morten
> 
> I've been thinking about this a bit and would like to know your (and
> others) opinion about following proposed enhancement.
> 
> Right now, how queues are used is left to the application decision (per
> lcore, per port, ...) - and I intend to keep it that way - but they are
> "match all".  What I mean by that is that (unlike callbacks where you
> have separate per event type) queue has no chance to be selective.
> 
> So if someone would like to go with queues only they he would have to
> coordinate between queues (or their "owners") which one does the
> handling of an event that supposedly should be handled only once.
> 
> Let's take this forwarding example - the queues are per lcore and each
> lcore keeps its own copy of ARP table (hash) so when the change is
> noticed the event is queued to all registered queue, each lcore updates
> its own copy and everything is OK.  However the routing is global (and
> right now is updated from callback) and if no callback is used for that
> then the event would be queued to all lcores and application would need
> to select the one which does the update.
> 
> Would that be easier/better to register queue together with a bitmask of
> event types that given queue is accepting?  Than during setup phase
> application would select just one queue to handle "global" events and
> the logic of event handling for lcores should be simplier.
> 
> Let me know what you think.

I think we want to avoid complicate design.
So let's choose between callback and message queue.
I vote for message queue because it can handle any situation,
and it allows to control the context of the event processing.
The other reason is that I believe we need message queueing for
other purposes in DPDK (ex: multi-process, telemetry).

You start thinking about complex message management.
And I start thinking about other usages of message queueing.
So I think it is the right time to introduce a generic messaging in DPDK.
Note: the IPC rte_mp should be built on top of such generic messaging.

If you agree, we can start a new email thread to better discuss
the generic messaging sub-system.
I describe here the 3 properties I have in mind:

1/ Message policy
One very important rule in DPDK is to let the control to the application.
So the messaging policy must be managed by the application via DPDK API.

2/ Message queue
It seems we should rely on ZeroMQ. Here is why:
http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ

3/ Message format
I am not sure whether we can manage with "simple strings", TLV,
or should we use something more complex like protobuf?
  
Jerin Jacob April 3, 2020, 7:09 p.m. UTC | #8
On Fri, Apr 3, 2020 at 10:49 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 02/04/2020 15:48, Andrzej Ostruszka [C]:
> > On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
> > > On 3/25/20 12:11 PM, Morten Brørup wrote:
> > [...]
> > >> And I am still strongly opposed to the callback method:
> > >
> > > Noted - however for now I would like to keep them.  I don't have much
> > > experience with this library so if they prove to be inadequate then we
> > > will remove them.  Right now they seem to add some flexibility that I like:
> > > - if something should be changed globally and once (and it is safe to do
> > >   so!) then it can be done from the callback
> > > - if something can be prepared once and consumed later by lcores then it
> > >   can be done in callback and the callback returns 0 so that event is
> > >   still queued and lcores (under assumption that queues are per lcore)
> > >   pick up what has been prepared.
> >
> > Morten
> >
> > I've been thinking about this a bit and would like to know your (and
> > others) opinion about following proposed enhancement.
> >
> > Right now, how queues are used is left to the application decision (per
> > lcore, per port, ...) - and I intend to keep it that way - but they are
> > "match all".  What I mean by that is that (unlike callbacks where you
> > have separate per event type) queue has no chance to be selective.
> >
> > So if someone would like to go with queues only they he would have to
> > coordinate between queues (or their "owners") which one does the
> > handling of an event that supposedly should be handled only once.
> >
> > Let's take this forwarding example - the queues are per lcore and each
> > lcore keeps its own copy of ARP table (hash) so when the change is
> > noticed the event is queued to all registered queue, each lcore updates
> > its own copy and everything is OK.  However the routing is global (and
> > right now is updated from callback) and if no callback is used for that
> > then the event would be queued to all lcores and application would need
> > to select the one which does the update.
> >
> > Would that be easier/better to register queue together with a bitmask of
> > event types that given queue is accepting?  Than during setup phase
> > application would select just one queue to handle "global" events and
> > the logic of event handling for lcores should be simplier.
> >
> > Let me know what you think.
>
> I think we want to avoid complicate design.
> So let's choose between callback and message queue.
> I vote for message queue because it can handle any situation,
> and it allows to control the context of the event processing.

IMO, it should be left to application decision, Application can use
either callback or
message queue based on their design and I don't think, DPDK needs to
enforce certain model.
On the upside, Giving two options, the application can choose the right model.
The simple use case like updating the global routing table, The
callback scheme would be more than enough.
The downside of pushing the architecture to message queue would
be that application either need to create additional control thread to
poll or call select()
get the event or in worst case check the message queue emptiness in fastpath.
So why to enforce?

Thoughts?


> The other reason is that I believe we need message queueing for
> other purposes in DPDK (ex: multi-process, telemetry).

As far as I know, telemetry is using Linux socket fro IPC, I am not sure
why do we need to standardize message queue infra? Becasue, each use
case is different.

>
> You start thinking about complex message management.
> And I start thinking about other usages of message queueing.
> So I think it is the right time to introduce a generic messaging in DPDK.
> Note: the IPC rte_mp should be built on top of such generic messaging.
>
> If you agree, we can start a new email thread to better discuss
> the generic messaging sub-system.
> I describe here the 3 properties I have in mind:
>
> 1/ Message policy
> One very important rule in DPDK is to let the control to the application.
> So the messaging policy must be managed by the application via DPDK API.

Do you mean send() and recv() should be wrapped around DPDK call?

>
> 2/ Message queue
> It seems we should rely on ZeroMQ. Here is why:
> http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ

IMO, ZeroMQ used for IPC over network etc. In this case, the purpose is
to pass the Netlink message IN THE SAME SYSTEM to application.
Do you need external library dependency? On the same system or
multiprocess application, our rte_ring would be more than enough. Right?
If not, please enumerate the use case.

>
> 3/ Message format
> I am not sure whether we can manage with "simple strings", TLV,
> or should we use something more complex like protobuf?

In this use case, we are relying the Netlink message to application at least
in Linux case. I think the message should be similar to Netlink message and give
provision for other OS'es such as scheme.

Why reinvent the wheel?


>
>
  
Morten Brørup April 3, 2020, 9:18 p.m. UTC | #9
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> Sent: Friday, April 3, 2020 9:09 PM
> 
> On Fri, Apr 3, 2020 at 10:49 PM Thomas Monjalon <thomas@monjalon.net>
> wrote:
> >
> > 02/04/2020 15:48, Andrzej Ostruszka [C]:
> > > On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
> > > > On 3/25/20 12:11 PM, Morten Brørup wrote:
> > > [...]
> > > >> And I am still strongly opposed to the callback method:
> > > >
> > > > Noted - however for now I would like to keep them.  I don't have
> much
> > > > experience with this library so if they prove to be inadequate
> then we
> > > > will remove them.  Right now they seem to add some flexibility
> that I like:
> > > > - if something should be changed globally and once (and it is
> safe to do
> > > >   so!) then it can be done from the callback
> > > > - if something can be prepared once and consumed later by lcores
> then it
> > > >   can be done in callback and the callback returns 0 so that
> event is
> > > >   still queued and lcores (under assumption that queues are per
> lcore)
> > > >   pick up what has been prepared.
> > >
> > > Morten
> > >
> > > I've been thinking about this a bit and would like to know your
> (and
> > > others) opinion about following proposed enhancement.
> > >
> > > Right now, how queues are used is left to the application decision
> (per
> > > lcore, per port, ...) - and I intend to keep it that way - but they
> are
> > > "match all".  What I mean by that is that (unlike callbacks where
> you
> > > have separate per event type) queue has no chance to be selective.
> > >
> > > So if someone would like to go with queues only they he would have
> to
> > > coordinate between queues (or their "owners") which one does the
> > > handling of an event that supposedly should be handled only once.
> > >
> > > Let's take this forwarding example - the queues are per lcore and
> each
> > > lcore keeps its own copy of ARP table (hash) so when the change is
> > > noticed the event is queued to all registered queue, each lcore
> updates
> > > its own copy and everything is OK.  However the routing is global
> (and
> > > right now is updated from callback) and if no callback is used for
> that
> > > then the event would be queued to all lcores and application would
> need
> > > to select the one which does the update.
> > >
> > > Would that be easier/better to register queue together with a
> bitmask of
> > > event types that given queue is accepting?  Than during setup phase
> > > application would select just one queue to handle "global" events
> and
> > > the logic of event handling for lcores should be simplier.
> > >
> > > Let me know what you think.
> >
> > I think we want to avoid complicate design.
> > So let's choose between callback and message queue.
> > I vote for message queue because it can handle any situation,
> > and it allows to control the context of the event processing.
> 
> IMO, it should be left to application decision, Application can use
> either callback or
> message queue based on their design and I don't think, DPDK needs to
> enforce certain model.
> On the upside, Giving two options, the application can choose the right
> model.
> The simple use case like updating the global routing table, The
> callback scheme would be more than enough.
> The downside of pushing the architecture to message queue would
> be that application either need to create additional control thread to
> poll or call select()
> get the event or in worst case check the message queue emptiness in
> fastpath.
> So why to enforce?
> 
> Thoughts?

A message queue would not require an additional control thread. It would use the existing control thread that the application already has.

I think you are missing an important point:

The application needs to handle all control plane interactions, not just control plane interactions related to the interface proxy library.

So the application already has (or needs to add) mechanisms in place for this. E.g. if a control plane event (from the interface proxy library or some other trigger) needs to be distributed across a single or multiple data plane lcores, the application already has (or needs to add) a mechanism for doing it. Adding a specific mechanism only in this library does not help all the other control plane interactions the application needs to handle. Actually it does the opposite: it requires that the application handles events from the interface proxy library in a specific way that is different from the way the application already handles other control plane events.

So I'm also voting for simplicity: A single event queue, leaving it up to the application how to handle these events.

> > The other reason is that I believe we need message queueing for
> > other purposes in DPDK (ex: multi-process, telemetry).
> 
> As far as I know, telemetry is using Linux socket fro IPC, I am not
> sure
> why do we need to standardize message queue infra? Becasue, each use
> case is different.

I think Thomas is suggesting that we consider the generic case of interaction with the control plane, as I described above. Not just interaction with the interface proxy events.

> >
> > You start thinking about complex message management.
> > And I start thinking about other usages of message queueing.
> > So I think it is the right time to introduce a generic messaging in
> DPDK.
> > Note: the IPC rte_mp should be built on top of such generic
> messaging.
> >
> > If you agree, we can start a new email thread to better discuss
> > the generic messaging sub-system.

I agree that it should be separated from the interface proxy library.

And yes, DPDK is missing a generic framework - or at least a "best practices" description - for interaction between the control plane and the data plane. So far, every DPDK application developer has to come up with his own.

> > I describe here the 3 properties I have in mind:
> >
> > 1/ Message policy
> > One very important rule in DPDK is to let the control to the
> application.
> > So the messaging policy must be managed by the application via DPDK
> API.
> 
> Do you mean send() and recv() should be wrapped around DPDK call?
> 
> >
> > 2/ Message queue
> > It seems we should rely on ZeroMQ. Here is why:
> > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ
> 
> IMO, ZeroMQ used for IPC over network etc. In this case, the purpose is
> to pass the Netlink message IN THE SAME SYSTEM to application.
> Do you need external library dependency? On the same system or
> multiprocess application, our rte_ring would be more than enough.
> Right?
> If not, please enumerate the use case.
> 
> >
> > 3/ Message format
> > I am not sure whether we can manage with "simple strings", TLV,
> > or should we use something more complex like protobuf?

Lean and mean is the way to go. A binary format, please. No more JSON or similar bloated encoding!

> 
> In this use case, we are relying the Netlink message to application at
> least
> in Linux case. I think the message should be similar to Netlink message
> and give
> provision for other OS'es such as scheme.
> 
> Why reinvent the wheel?
  
Thomas Monjalon April 3, 2020, 9:42 p.m. UTC | #10
Hi Andrzej,

Thanks for the very good explanations in the cover letter.
I have several comments and questions about the design.
I think IF proxy is a good idea which should be part of a bigger plan.


10/03/2020 12:10, Andrzej Ostruszka:
> What is this useful for
> =======================
> 
> Usually, when an ethernet port is assigned to DPDK it vanishes from the
> system and user looses ability to control it via normal configuration
> utilities (e.g. those from iproute2 package).  Moreover by default DPDK
> application is not aware of the network configuration of the system.
> 
> To address both of these issues application needs to:
> - add some command line interface (or other mechanism) allowing for
>   control of the port and its configuration
> - query the status of network configuration and monitor its changes
> 
> The purpose of this library is to help with both of these tasks (as long
> as they remain in domain of configuration available to the system).  In
> other words, if DPDK application has some special needs, that cannot be
> addressed by the normal system configuration utilities, then they need
> to be solved by the application itself.

In any case, the application must be in the loop.
The application should always remain in control.
When querying some information, nothing need to be controlled I guess.
But when adjusting some configuration, the application must be able
to be notified and decide which change is allowed.
Of course, the application might allow being bypassed.

Currently this rule is not respected in the rte_mp IPC system.
I think rte_mp and IF proxy should follow the same path,
keeping the primary application process in control.

I would like not only secondary process and IF proxy be able to use
this control path. It should be generic enough to allow any application
(local or remote) be part of the control path, communicating with
the DPDK application primary process.

As a summary, I propose to target the following goal:
implement a user configuration path as a DPDK standard
that the application can enable.

Do we agree that the exception packet path is out of scope?


[...]
> We create two proxy interfaces (here based on Tap driver) and bind the
> ports to their proxies.  When user issues a command changing MTU for
> Tap1 interface the library notes this and calls "mtu_change" callback
> for the Port1.  Similarly when user adds an IPv4 address to the Tap2
> interface "addr_add" callback is called for the Port2 and the same
> happens for configuration of routing rule pointing to Tap2.

Will it work as well with TC flow configuration converted to rte_flow?


> Apart from
> callbacks this library can notify about changes via adding events to
> notification queues.  See below for more inforamtion about that and
> a complete list of available callbacks.

There is choice between callback in a random context,
or a read from a message queue in a controlled context.
Second option looks better.


> Please note that nothing has been mentioned about forwarding of the
> packets between system and DPDK.  Since the proxies are normal DPDK
> ports you can receive/send to them via usual RX/TX burst API.  However
> since the library is not aware of the structure of packet processing
> used by the application it cannot automatically forward the packets - it
> is responsibility of the application to include proxy ports into its
> packet processing engine.

So IF proxy does nothing special with packets, right?


> As mentioned above the intention of the library is to:
> - provide information about network configuration that would allow
>   application to decide what to do with the packets received on DPDK
>   ports,
> - allow for control of the ports via standard configuration utilities
> 
> Although the library only helps you to identify proxy for given port
> (and vice versa) and calls appropriate callbacks it does open some
> interesting possibilities.  For example you can use the proxy ports to
> forward packets for protocols that you do not wish to handle in DPDK
> application to the system protocol stack and just listen to the
> configuration changes - so that way you can "offload" handling of those
> protocols to the system.

Note that when using a bifurcated driver (af_xdp or mlx),
the exception path in the kernel is not going through DPDK.
Moreover, no proxy is needed for device configuration in such case.


[...]
> The only mandatory requirement for DPDK port to be able to act as
> a proxy is that it is visible in the system - this is checked during
> port to proxy binding by calling rte_eth_dev_info_get() on proxy port
> and inspecting 'if_index' field (it has to be non-zero).

Simple, good :)


> This creates logical binding - as mentioned above there is no automatic
> packet forwarding.  With this binding whenever user changes the state of
> proxy interface in the system (link up/down, change mac/mtu, add/remove
> IPv4/IPv6) you get appropriate notification for the bound port.

When configuring a port via DPDK API, is it mirrored automatically
to the kernel device?


> So far we've mentioned several times that the library calls callbacks.
> They are grouped in 'struct rte_ifpx_callbacks' and user provides them
> to the library via:
> 
>   rte_ifpx_callbacks_register(&cbs);
> 
> It is worth mentioning that the context (lcore/thread) in which these
> callbacks are called is implementation defined.  It might differ between
> different platforms, so the application needs to assume that some kind
> of inter lcore/thread synchronization/communication is required.
> 
> Apart from notification via callbacks this library also supports
> notifying about the changes via adding events to the configured
> notification queues.  The queues are registered via:
> 
>   int rte_ifpx_queue_add(struct rte_ring *r);
> 
> and the actual logic used is: if there is callback registered then it is
> called, if it returns non-zero then event is considered completed,
> otherwise event is added to each configured notification queue.
> That way application can update data structures that are safe to be
> modified by single writer from within callback or do the common
> preprocessing steps (if any needed) in callback and data that is
> replicated can be updated during handling of queued events.

As explained above, the application must control every changes.

One issue is thread safety.
The simplest model is to manage control path from a single thread
in the primary process.

If we create an API to allow the application managing the control path
from external requests, I think it should be a building block
independent of IF proxy. Then IF proxy can plug into this subsystem.
It would allow other control path mechanisms to co-exist.


[...]
> It is worth to mention also that while typical case would be a 1-to-1
> mapping between port and proxy, the 1-to-many mapping is also supported.
> In that case related callbacks will be called for each port bound to
> given proxy interface - it is application responsibility to define
> semantic of such mapping (e.g. all changes apply to all ports, or link
> changes apply to all but other are accepted in "round robin" fashion, or
> some other logic).

I don't get the interest of one-to-many mapping.


[...]

Thanks for the work.
It seems there are some overlaps with telemetry and rte_mp channels.
The same channel could be used also for dynamic tracing command
or for remote control.
Would you be OK to extend it to a global control subsystem,
having IF proxy plugged in?
  
Thomas Monjalon April 3, 2020, 9:57 p.m. UTC | #11
03/04/2020 23:18, Morten Brørup:
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> > Thomas Monjalon <thomas@monjalon.net> wrote:
> > > 02/04/2020 15:48, Andrzej Ostruszka [C]:
> > > > On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
> > > > > On 3/25/20 12:11 PM, Morten Brørup wrote:
> > > > [...]
> > > > >> And I am still strongly opposed to the callback method:
> > > > >
> > > > > Noted - however for now I would like to keep them.  I don't have
> > much
> > > > > experience with this library so if they prove to be inadequate
> > then we
> > > > > will remove them.  Right now they seem to add some flexibility
> > that I like:
> > > > > - if something should be changed globally and once (and it is
> > safe to do
> > > > >   so!) then it can be done from the callback
> > > > > - if something can be prepared once and consumed later by lcores
> > then it
> > > > >   can be done in callback and the callback returns 0 so that
> > event is
> > > > >   still queued and lcores (under assumption that queues are per
> > lcore)
> > > > >   pick up what has been prepared.
> > > >
> > > > Morten
> > > >
> > > > I've been thinking about this a bit and would like to know your
> > (and
> > > > others) opinion about following proposed enhancement.
> > > >
> > > > Right now, how queues are used is left to the application decision
> > (per
> > > > lcore, per port, ...) - and I intend to keep it that way - but they
> > are
> > > > "match all".  What I mean by that is that (unlike callbacks where
> > you
> > > > have separate per event type) queue has no chance to be selective.
> > > >
> > > > So if someone would like to go with queues only they he would have
> > to
> > > > coordinate between queues (or their "owners") which one does the
> > > > handling of an event that supposedly should be handled only once.
> > > >
> > > > Let's take this forwarding example - the queues are per lcore and
> > each
> > > > lcore keeps its own copy of ARP table (hash) so when the change is
> > > > noticed the event is queued to all registered queue, each lcore
> > updates
> > > > its own copy and everything is OK.  However the routing is global
> > (and
> > > > right now is updated from callback) and if no callback is used for
> > that
> > > > then the event would be queued to all lcores and application would
> > need
> > > > to select the one which does the update.
> > > >
> > > > Would that be easier/better to register queue together with a
> > bitmask of
> > > > event types that given queue is accepting?  Than during setup phase
> > > > application would select just one queue to handle "global" events
> > and
> > > > the logic of event handling for lcores should be simplier.
> > > >
> > > > Let me know what you think.
> > >
> > > I think we want to avoid complicate design.
> > > So let's choose between callback and message queue.
> > > I vote for message queue because it can handle any situation,
> > > and it allows to control the context of the event processing.
> > 
> > IMO, it should be left to application decision, Application can use
> > either callback or
> > message queue based on their design and I don't think, DPDK needs to
> > enforce certain model.
> > On the upside, Giving two options, the application can choose the right
> > model.
> > The simple use case like updating the global routing table, The
> > callback scheme would be more than enough.
> > The downside of pushing the architecture to message queue would
> > be that application either need to create additional control thread to
> > poll or call select()
> > get the event or in worst case check the message queue emptiness in
> > fastpath.
> > So why to enforce?
> > 
> > Thoughts?
> 
> A message queue would not require an additional control thread. It would use the existing control thread that the application already has.
> 
> I think you are missing an important point:
> 
> The application needs to handle all control plane interactions,
> not just control plane interactions related to the interface proxy library.

Yes this is the point.

> So the application already has (or needs to add) mechanisms in place for this. E.g. if a control plane event (from the interface proxy library or some other trigger) needs to be distributed across a single or multiple data plane lcores, the application already has (or needs to add) a mechanism for doing it. Adding a specific mechanism only in this library does not help all the other control plane interactions the application needs to handle. Actually it does the opposite: it requires that the application handles events from the interface proxy library in a specific way that is different from the way the application already handles other control plane events.
> 
> So I'm also voting for simplicity: A single event queue, leaving it up to the application how to handle these events.
> 
> > > The other reason is that I believe we need message queueing for
> > > other purposes in DPDK (ex: multi-process, telemetry).
> > 
> > As far as I know, telemetry is using Linux socket fro IPC, I am not
> > sure
> > why do we need to standardize message queue infra? Becasue, each use
> > case is different.
> 
> I think Thomas is suggesting that we consider the generic case of
> interaction with the control plane, as I described above.
> Not just interaction with the interface proxy events.
> 
> > >
> > > You start thinking about complex message management.
> > > And I start thinking about other usages of message queueing.
> > > So I think it is the right time to introduce a generic messaging in
> > DPDK.
> > > Note: the IPC rte_mp should be built on top of such generic
> > messaging.
> > >
> > > If you agree, we can start a new email thread to better discuss
> > > the generic messaging sub-system.
> 
> I agree that it should be separated from the interface proxy library.
> 
> And yes, DPDK is missing a generic framework - or at least a "best practices" description - for interaction between the control plane and the data plane. So far, every DPDK application developer has to come up with his own.
> 
> > > I describe here the 3 properties I have in mind:
> > >
> > > 1/ Message policy
> > > One very important rule in DPDK is to let the control to the
> > application.
> > > So the messaging policy must be managed by the application via DPDK
> > API.
> > 
> > Do you mean send() and recv() should be wrapped around DPDK call?

I am thinking about something a bit more complex with handlers
registration and default handlers in each DPDK library.


> > > 2/ Message queue
> > > It seems we should rely on ZeroMQ. Here is why:
> > > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ
> > 
> > IMO, ZeroMQ used for IPC over network etc. In this case, the purpose is
> > to pass the Netlink message IN THE SAME SYSTEM to application.
> > Do you need external library dependency? On the same system or
> > multiprocess application, our rte_ring would be more than enough.
> > Right?
> > If not, please enumerate the use case.

Network communication will allow standardizing a DPDK remote control.
With ZeroMQ, it comes for free.


> > > 3/ Message format
> > > I am not sure whether we can manage with "simple strings", TLV,
> > > or should we use something more complex like protobuf?
> 
> Lean and mean is the way to go. A binary format, please.
> No more JSON or similar bloated encoding!

JSON, as other text encoding as one advantage: it is readable
when debugging.
But I tend to agree that TLV is probably a good fit.

> > In this use case, we are relying the Netlink message to application at
> > least
> > in Linux case. I think the message should be similar to Netlink message
> > and give
> > provision for other OS'es such as scheme.
> > 
> > Why reinvent the wheel?

I agree, we should not re-encode Netlink.
With a TLV format, we can just encapsulate Netlink for the generic
channel, and give it a message type to dispatch the message to the
right hansler.
  
Jerin Jacob April 4, 2020, 10:18 a.m. UTC | #12
On Sat, Apr 4, 2020 at 3:27 AM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 03/04/2020 23:18, Morten Brørup:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> > > Thomas Monjalon <thomas@monjalon.net> wrote:
> > > > 02/04/2020 15:48, Andrzej Ostruszka [C]:
> > > > > On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
> > > > > > On 3/25/20 12:11 PM, Morten Brørup wrote:
> > > > > [...]
> > > > > >> And I am still strongly opposed to the callback method:
> > > > > >
> > > > > > Noted - however for now I would like to keep them.  I don't have
> > > much
> > > > > > experience with this library so if they prove to be inadequate
> > > then we
> > > > > > will remove them.  Right now they seem to add some flexibility
> > > that I like:
> > > > > > - if something should be changed globally and once (and it is
> > > safe to do
> > > > > >   so!) then it can be done from the callback
> > > > > > - if something can be prepared once and consumed later by lcores
> > > then it
> > > > > >   can be done in callback and the callback returns 0 so that
> > > event is
> > > > > >   still queued and lcores (under assumption that queues are per
> > > lcore)
> > > > > >   pick up what has been prepared.
> > > > >
> > > > > Morten
> > > > >
> > > > > I've been thinking about this a bit and would like to know your
> > > (and
> > > > > others) opinion about following proposed enhancement.
> > > > >
> > > > > Right now, how queues are used is left to the application decision
> > > (per
> > > > > lcore, per port, ...) - and I intend to keep it that way - but they
> > > are
> > > > > "match all".  What I mean by that is that (unlike callbacks where
> > > you
> > > > > have separate per event type) queue has no chance to be selective.
> > > > >
> > > > > So if someone would like to go with queues only they he would have
> > > to
> > > > > coordinate between queues (or their "owners") which one does the
> > > > > handling of an event that supposedly should be handled only once.
> > > > >
> > > > > Let's take this forwarding example - the queues are per lcore and
> > > each
> > > > > lcore keeps its own copy of ARP table (hash) so when the change is
> > > > > noticed the event is queued to all registered queue, each lcore
> > > updates
> > > > > its own copy and everything is OK.  However the routing is global
> > > (and
> > > > > right now is updated from callback) and if no callback is used for
> > > that
> > > > > then the event would be queued to all lcores and application would
> > > need
> > > > > to select the one which does the update.
> > > > >
> > > > > Would that be easier/better to register queue together with a
> > > bitmask of
> > > > > event types that given queue is accepting?  Than during setup phase
> > > > > application would select just one queue to handle "global" events
> > > and
> > > > > the logic of event handling for lcores should be simplier.
> > > > >
> > > > > Let me know what you think.
> > > >
> > > > I think we want to avoid complicate design.
> > > > So let's choose between callback and message queue.
> > > > I vote for message queue because it can handle any situation,
> > > > and it allows to control the context of the event processing.
> > >
> > > IMO, it should be left to application decision, Application can use
> > > either callback or
> > > message queue based on their design and I don't think, DPDK needs to
> > > enforce certain model.
> > > On the upside, Giving two options, the application can choose the right
> > > model.
> > > The simple use case like updating the global routing table, The
> > > callback scheme would be more than enough.
> > > The downside of pushing the architecture to message queue would
> > > be that application either need to create additional control thread to
> > > poll or call select()
> > > get the event or in worst case check the message queue emptiness in
> > > fastpath.
> > > So why to enforce?
> > >
> > > Thoughts?
> >
> > A message queue would not require an additional control thread. It would use the existing control thread that the application already has.

Assuming every application has a control thread.


> >
> > I think you are missing an important point:
> >
> > The application needs to handle all control plane interactions,
> > not just control plane interactions related to the interface proxy library.
>
> Yes this is the point.

OK. I think the following message needs to have a unified message access scheme.

1) RTE_ETH_EVENT_ events registered using rte_eth_dev_callback_register()
2) rte_mp messages
3) telemetry control message for remote control

Future:
4) IF proxy library messages
5) adding the trace control message for remote control.

Since it is the control plane, slow path traffic without any
performance requirement, Generalize the message comes for zero cost.
+1 for standardizing the message if every subsystem planning to do the same.

> > So the application already has (or needs to add) mechanisms in place for this. E.g. if a control plane event (from the interface proxy library or some other trigger) needs to be distributed across a single or multiple data plane lcores, the application already has (or needs to add) a mechanism for doing it. Adding a specific mechanism only in this library does not help all the other control plane interactions the application needs to handle. Actually it does the opposite: it requires that the application handles events from the interface proxy library in a specific way that is different from the way the application already handles other control plane events.
> >
> > So I'm also voting for simplicity: A single event queue, leaving it up to the application how to handle these events.

+1

> >
> > > > The other reason is that I believe we need message queueing for
> > > > other purposes in DPDK (ex: multi-process, telemetry).
> > >
> > > As far as I know, telemetry is using Linux socket fro IPC, I am not
> > > sure
> > > why do we need to standardize message queue infra? Becasue, each use
> > > case is different.
> >
  
Andrzej Ostruszka [C] April 4, 2020, 6:07 p.m. UTC | #13
First of all Thomas, thank you for taking time to look at this.

Please scroll below for my comments.  It looks like we are going to
detour a bit for a general config mechanism on top of which we can
"rebase" IF Proxy and other libs/apps.

Note - since I often like to think in concrete terms I might be
switching between general comments and specific examples.

On 4/3/20 11:42 PM, Thomas Monjalon wrote:
> Hi Andrzej,
[...]
> 10/03/2020 12:10, Andrzej Ostruszka:
[...]
>> The purpose of this library is to help with both of these tasks (as long
>> as they remain in domain of configuration available to the system).  In
>> other words, if DPDK application has some special needs, that cannot be
>> addressed by the normal system configuration utilities, then they need
>> to be solved by the application itself.
> 
> In any case, the application must be in the loop.
> The application should always remain in control.

OK - so let me try to understand what you mean here on the example of
this IF Proxy.  I wanted (and that is an explicit goal) to use iproute2
tools to configure DPDK ports.  In this context allowing application to
have control might mean two things:

- application can accept/ignore/deny change requested by the user from
the shell in a dynamic way (based on some state/event/...)

- application writer have a choice to bind or not, but once the proxy is
bound you simply accept the requests - just like you accept user
requests in e.g. testpmd shell.

So ...

> When querying some information, nothing need to be controlled I guess.
> But when adjusting some configuration, the application must be able
> to be notified and decide which change is allowed.
> Of course, the application might allow being bypassed.

... it looks like you are talking about the first option ("bypass" is I
guess the second option).  In the concrete example of IF Proxy that
might be a bit problematic.  User requests changes on proxy interface
kernel accepts them and the DPDK application is just notified about that
- has no chance to deny request (say "busy" or "not permitted" to the user).

Of course app can ignore it (do nothing in the callback or drop the
event) and have a mismatch between port and its proxy.  I'm not so sure
if this is what you had in mind.

> Currently this rule is not respected in the rte_mp IPC system.
> I think rte_mp and IF proxy should follow the same path,
> keeping the primary application process in control.
> 
> I would like not only secondary process and IF proxy be able to use
> this control path. It should be generic enough to allow any application
> (local or remote) be part of the control path, communicating with
> the DPDK application primary process.

That goal sounds indeed like ZMQ.  Is the consensus about that already
reached?  On a general level that sounds good to me - the devil might be
in details, like e.g. trying to be simple and generic enough.  I'd like
also to solicit here input from other members of the community.

> As a summary, I propose to target the following goal:
> implement a user configuration path as a DPDK standard
> that the application can enable.
> 
> Do we agree that the exception packet path is out of scope?

Could you rephrase this question?  I'm not sure I understand it.  If you
wanted to say that we should:

- implement general config/notification mechanism
- rebase IF Proxy upon it (so instead of deciding whether this should be
callback/queue it simply uses this new scheme to deliver the change to
application)

then I'm fine with this.  If you meant something else then please explain.

> [...]
>> We create two proxy interfaces (here based on Tap driver) and bind the
>> ports to their proxies.  When user issues a command changing MTU for
>> Tap1 interface the library notes this and calls "mtu_change" callback
>> for the Port1.  Similarly when user adds an IPv4 address to the Tap2
>> interface "addr_add" callback is called for the Port2 and the same
>> happens for configuration of routing rule pointing to Tap2.
> 
> Will it work as well with TC flow configuration converted to rte_flow?

Not at the moment.  But should be doable - as long as there is good
mapping between them (I haven't checked).

>> Apart from
>> callbacks this library can notify about changes via adding events to
>> notification queues.  See below for more inforamtion about that and
>> a complete list of available callbacks.
> 
> There is choice between callback in a random context,
> or a read from a message queue in a controlled context.
> Second option looks better.

Note that callback can be a simple enqueue to some ring.  From the IF
Proxy implementation point of view - this is not much of a difference.
I notice the change and in that place in code I can either call callback
or queue an event.  Since I expect queues to be a popular choice its
support is added but without this user could register callback that
would be enqueuing (one more indirection in slow path).

Having said that - the only reason callbacks are kept is that (as
mentioned in cover):

- I can easily implement global action (true - in random context), since
queues are "match all" each event will be added to all queues and cores
would have to decide which one of them performs the global action
- I can do single/global preparation before queueing event

But I guess this is rather a mute point since we are going in the
direction of general config which IF Proxy would be using.

>> Please note that nothing has been mentioned about forwarding of the
>> packets between system and DPDK.  Since the proxies are normal DPDK
>> ports you can receive/send to them via usual RX/TX burst API.  However
>> since the library is not aware of the structure of packet processing
>> used by the application it cannot automatically forward the packets - it
>> is responsibility of the application to include proxy ports into its
>> packet processing engine.
> 
> So IF proxy does nothing special with packets, right?

Correct.

>> Although the library only helps you to identify proxy for given port
>> (and vice versa) and calls appropriate callbacks it does open some
>> interesting possibilities.  For example you can use the proxy ports to
>> forward packets for protocols that you do not wish to handle in DPDK
>> application to the system protocol stack and just listen to the
>> configuration changes - so that way you can "offload" handling of those
>> protocols to the system.
> 
> Note that when using a bifurcated driver (af_xdp or mlx),
> the exception path in the kernel is not going through DPDK.
> Moreover, no proxy is needed for device configuration in such case.

True for the link level info.  But if application would like to have
also address/routing/neighbouring info then I guess proxy would be
needed.  As for the bifurcated drivers - in one version of the library I
had an option to bind port to itself.  The binding is there only to tell
library which if_index is interesting and how to report event (if_index
-> port_id)

[...]
>> This creates logical binding - as mentioned above there is no automatic
>> packet forwarding.  With this binding whenever user changes the state of
>> proxy interface in the system (link up/down, change mac/mtu, add/remove
>> IPv4/IPv6) you get appropriate notification for the bound port.
> 
> When configuring a port via DPDK API, is it mirrored automatically
> to the kernel device?

No it isn't.  It's one way at the moment.  If we wanted bidirectional
then I would have to plug in somewhere in eth_dev to monitor changes to
ports and request similar changes to the proxy.

[...]
>> and the actual logic used is: if there is callback registered then it is
>> called, if it returns non-zero then event is considered completed,
>> otherwise event is added to each configured notification queue.
>> That way application can update data structures that are safe to be
>> modified by single writer from within callback or do the common
>> preprocessing steps (if any needed) in callback and data that is
>> replicated can be updated during handling of queued events.
> 
> As explained above, the application must control every changes.
> 
> One issue is thread safety.
> The simplest model is to manage control path from a single thread
> in the primary process.
> 
> If we create an API to allow the application managing the control path
> from external requests, I think it should be a building block
> independent of IF proxy. Then IF proxy can plug into this subsystem.
> It would allow other control path mechanisms to co-exist.

I'm fine with this.

> [...]
>> It is worth to mention also that while typical case would be a 1-to-1
>> mapping between port and proxy, the 1-to-many mapping is also supported.
>> In that case related callbacks will be called for each port bound to
>> given proxy interface - it is application responsibility to define
>> semantic of such mapping (e.g. all changes apply to all ports, or link
>> changes apply to all but other are accepted in "round robin" fashion, or
>> some other logic).
> 
> I don't get the interest of one-to-many mapping.

That was a request during early version of library - with bridging in
mind.  However this is an "experimental" part of an "experimental"
library - I would not focus on this, as it might get removed if we don't
find a real use case for that.

> [...]
> 
> Thanks for the work.
> It seems there are some overlaps with telemetry and rte_mp channels.
> The same channel could be used also for dynamic tracing command
> or for remote control.
> Would you be OK to extend it to a global control subsystem,
> having IF proxy plugged in?

Yes.  I don't have a problem with that - however at the moment the
requirements/design are a bit vague, so let's discuss this more.

With regards
Andrzej Ostruszka
  
Andrzej Ostruszka [C] April 4, 2020, 6:30 p.m. UTC | #14
Thomas,

I have replied to the other mail, here I just want to confirm, that I'm
fine with the proposed "general messaging" which other libraries (IF
Proxy including) could utilize.

See also below.

On 4/3/20 7:19 PM, Thomas Monjalon wrote:
> 02/04/2020 15:48, Andrzej Ostruszka [C]:
>> On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
[...]
>> Would that be easier/better to register queue together with a bitmask of
>> event types that given queue is accepting?  Than during setup phase
>> application would select just one queue to handle "global" events and
>> the logic of event handling for lcores should be simplier.
>>
>> Let me know what you think.
> 
> I think we want to avoid complicate design.
> So let's choose between callback and message queue.
> I vote for message queue because it can handle any situation,
> and it allows to control the context of the event processing.
> The other reason is that I believe we need message queueing for
> other purposes in DPDK (ex: multi-process, telemetry).
> 
> You start thinking about complex message management.
> And I start thinking about other usages of message queueing.
> So I think it is the right time to introduce a generic messaging in DPDK.
> Note: the IPC rte_mp should be built on top of such generic messaging.

Do you have also inter-lcore communication in mind here?  Or just
"external" world to "some DPDK controller/dispatcher" and how that is
passed to other cores is an application writer problem.

> If you agree, we can start a new email thread to better discuss
> the generic messaging sub-system.

Yes, lets talk about that.

> I describe here the 3 properties I have in mind:
> 
> 1/ Message policy
> One very important rule in DPDK is to let the control to the application.
> So the messaging policy must be managed by the application via DPDK API.
> 
> 2/ Message queue
> It seems we should rely on ZeroMQ. Here is why:
> http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ
> 
> 3/ Message format
> I am not sure whether we can manage with "simple strings", TLV,
> or should we use something more complex like protobuf?
  
Thomas Monjalon April 4, 2020, 7:51 p.m. UTC | #15
04/04/2020 20:07, Andrzej Ostruszka [C]:
> On 4/3/20 11:42 PM, Thomas Monjalon wrote:
> > 10/03/2020 12:10, Andrzej Ostruszka:
> [...]
> >> The purpose of this library is to help with both of these tasks (as long
> >> as they remain in domain of configuration available to the system).  In
> >> other words, if DPDK application has some special needs, that cannot be
> >> addressed by the normal system configuration utilities, then they need
> >> to be solved by the application itself.
> > 
> > In any case, the application must be in the loop.
> > The application should always remain in control.
> 
> OK - so let me try to understand what you mean here on the example of
> this IF Proxy.  I wanted (and that is an explicit goal) to use iproute2
> tools to configure DPDK ports.  In this context allowing application to
> have control might mean two things:
> 
> - application can accept/ignore/deny change requested by the user from
> the shell in a dynamic way (based on some state/event/...)

Yes

> - application writer have a choice to bind or not, but once the proxy is
> bound you simply accept the requests - just like you accept user
> requests in e.g. testpmd shell.

No, the application may check each user request before accepting.
And on the path, the application may need to adapt based on user request.

> So ...
> 
> > When querying some information, nothing need to be controlled I guess.
> > But when adjusting some configuration, the application must be able
> > to be notified and decide which change is allowed.
> > Of course, the application might allow being bypassed.
> 
> ... it looks like you are talking about the first option ("bypass" is I
> guess the second option).  In the concrete example of IF Proxy that
> might be a bit problematic.  User requests changes on proxy interface
> kernel accepts them and the DPDK application is just notified about that
> - has no chance to deny request (say "busy" or "not permitted" to the user).

The application must return a decision.

> Of course app can ignore it (do nothing in the callback or drop the
> event) and have a mismatch between port and its proxy.  I'm not so sure
> if this is what you had in mind.

If the change is denied, the proxy may rollback the change in kernel.

> > Currently this rule is not respected in the rte_mp IPC system.
> > I think rte_mp and IF proxy should follow the same path,
> > keeping the primary application process in control.
> > 
> > I would like not only secondary process and IF proxy be able to use
> > this control path. It should be generic enough to allow any application
> > (local or remote) be part of the control path, communicating with
> > the DPDK application primary process.
> 
> That goal sounds indeed like ZMQ.  Is the consensus about that already
> reached?  On a general level that sounds good to me - the devil might be
> in details, like e.g. trying to be simple and generic enough.  I'd like
> also to solicit here input from other members of the community.

No there is no consensus. Integrating ZMQ is a very fresh idea.
As said, we should open this major topic in a separate email thread.

> > As a summary, I propose to target the following goal:
> > implement a user configuration path as a DPDK standard
> > that the application can enable.
> > 
> > Do we agree that the exception packet path is out of scope?
> 
> Could you rephrase this question?  I'm not sure I understand it.  If you
> wanted to say that we should:
> 
> - implement general config/notification mechanism
> - rebase IF Proxy upon it (so instead of deciding whether this should be
> callback/queue it simply uses this new scheme to deliver the change to
> application)

Yes this is what I mean in general.

> then I'm fine with this.  If you meant something else then please explain.

I ask for confirmation that IF proxy is not managing an exception datapath.
You said the netdev used as proxy can also be used to send/receive
packets to/from kernel stack. But it is out of scope of IF proxy features,
right?

> > [...]
> >> We create two proxy interfaces (here based on Tap driver) and bind the
> >> ports to their proxies.  When user issues a command changing MTU for
> >> Tap1 interface the library notes this and calls "mtu_change" callback
> >> for the Port1.  Similarly when user adds an IPv4 address to the Tap2
> >> interface "addr_add" callback is called for the Port2 and the same
> >> happens for configuration of routing rule pointing to Tap2.
> > 
> > Will it work as well with TC flow configuration converted to rte_flow?
> 
> Not at the moment.  But should be doable - as long as there is good
> mapping between them (I haven't checked).

That's an interesting challenge.

> >> Apart from
> >> callbacks this library can notify about changes via adding events to
> >> notification queues.  See below for more inforamtion about that and
> >> a complete list of available callbacks.
> > 
> > There is choice between callback in a random context,
> > or a read from a message queue in a controlled context.
> > Second option looks better.
> 
> Note that callback can be a simple enqueue to some ring.  From the IF
> Proxy implementation point of view - this is not much of a difference.
> I notice the change and in that place in code I can either call callback
> or queue an event.  Since I expect queues to be a popular choice its
> support is added but without this user could register callback that
> would be enqueuing (one more indirection in slow path).
> 
> Having said that - the only reason callbacks are kept is that (as
> mentioned in cover):
> 
> - I can easily implement global action (true - in random context), since
> queues are "match all" each event will be added to all queues and cores
> would have to decide which one of them performs the global action
> - I can do single/global preparation before queueing event
> 
> But I guess this is rather a mute point since we are going in the
> direction of general config which IF Proxy would be using.
> 
> >> Please note that nothing has been mentioned about forwarding of the
> >> packets between system and DPDK.  Since the proxies are normal DPDK
> >> ports you can receive/send to them via usual RX/TX burst API.  However
> >> since the library is not aware of the structure of packet processing
> >> used by the application it cannot automatically forward the packets - it
> >> is responsibility of the application to include proxy ports into its
> >> packet processing engine.
> > 
> > So IF proxy does nothing special with packets, right?
> 
> Correct.
> 
> >> Although the library only helps you to identify proxy for given port
> >> (and vice versa) and calls appropriate callbacks it does open some
> >> interesting possibilities.  For example you can use the proxy ports to
> >> forward packets for protocols that you do not wish to handle in DPDK
> >> application to the system protocol stack and just listen to the
> >> configuration changes - so that way you can "offload" handling of those
> >> protocols to the system.
> > 
> > Note that when using a bifurcated driver (af_xdp or mlx),
> > the exception path in the kernel is not going through DPDK.
> > Moreover, no proxy is needed for device configuration in such case.
> 
> True for the link level info.  But if application would like to have
> also address/routing/neighbouring info then I guess proxy would be
> needed.  As for the bifurcated drivers - in one version of the library I
> had an option to bind port to itself.  The binding is there only to tell
> library which if_index is interesting and how to report event (if_index
> -> port_id)

Yes you're right, and that's interesting.
I wonder whether we should have a flag in the proxy port to mark
bifurcated model.

> [...]
> >> This creates logical binding - as mentioned above there is no automatic
> >> packet forwarding.  With this binding whenever user changes the state of
> >> proxy interface in the system (link up/down, change mac/mtu, add/remove
> >> IPv4/IPv6) you get appropriate notification for the bound port.
> > 
> > When configuring a port via DPDK API, is it mirrored automatically
> > to the kernel device?
> 
> No it isn't.  It's one way at the moment.  If we wanted bidirectional
> then I would have to plug in somewhere in eth_dev to monitor changes to
> ports and request similar changes to the proxy.

OK I think it is a gap we need to fill.
Bidirectional way looks mandatory to me.
Given that the application needs to be aware of the proxy binding,
can we have an explicit mechanism to notify the proxy of any change?
Or do we want to use something as failsafe PMD to pair the port and
its proxy as sub-devices of a main one, dispatching all changes?

> [...]
> >> and the actual logic used is: if there is callback registered then it is
> >> called, if it returns non-zero then event is considered completed,
> >> otherwise event is added to each configured notification queue.
> >> That way application can update data structures that are safe to be
> >> modified by single writer from within callback or do the common
> >> preprocessing steps (if any needed) in callback and data that is
> >> replicated can be updated during handling of queued events.
> > 
> > As explained above, the application must control every changes.
> > 
> > One issue is thread safety.
> > The simplest model is to manage control path from a single thread
> > in the primary process.
> > 
> > If we create an API to allow the application managing the control path
> > from external requests, I think it should be a building block
> > independent of IF proxy. Then IF proxy can plug into this subsystem.
> > It would allow other control path mechanisms to co-exist.
> 
> I'm fine with this.
> 
> > [...]
> >> It is worth to mention also that while typical case would be a 1-to-1
> >> mapping between port and proxy, the 1-to-many mapping is also supported.
> >> In that case related callbacks will be called for each port bound to
> >> given proxy interface - it is application responsibility to define
> >> semantic of such mapping (e.g. all changes apply to all ports, or link
> >> changes apply to all but other are accepted in "round robin" fashion, or
> >> some other logic).
> > 
> > I don't get the interest of one-to-many mapping.
> 
> That was a request during early version of library - with bridging in
> mind.  However this is an "experimental" part of an "experimental"
> library - I would not focus on this, as it might get removed if we don't
> find a real use case for that.

If we don't have a real use-case, I suggest to not implement it,
but keep some room for it in the design.

> > [...]
> > 
> > Thanks for the work.
> > It seems there are some overlaps with telemetry and rte_mp channels.
> > The same channel could be used also for dynamic tracing command
> > or for remote control.
> > Would you be OK to extend it to a global control subsystem,
> > having IF proxy plugged in?
> 
> Yes.  I don't have a problem with that - however at the moment the
> requirements/design are a bit vague, so let's discuss this more.

Thanks a lot
  
Thomas Monjalon April 4, 2020, 7:58 p.m. UTC | #16
04/04/2020 20:30, Andrzej Ostruszka [C]:
> Thomas,
> 
> I have replied to the other mail, here I just want to confirm, that I'm
> fine with the proposed "general messaging" which other libraries (IF
> Proxy including) could utilize.
> 
> See also below.
> 
> On 4/3/20 7:19 PM, Thomas Monjalon wrote:
> > 02/04/2020 15:48, Andrzej Ostruszka [C]:
> >> On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
> [...]
> >> Would that be easier/better to register queue together with a bitmask of
> >> event types that given queue is accepting?  Than during setup phase
> >> application would select just one queue to handle "global" events and
> >> the logic of event handling for lcores should be simplier.
> >>
> >> Let me know what you think.
> > 
> > I think we want to avoid complicate design.
> > So let's choose between callback and message queue.
> > I vote for message queue because it can handle any situation,
> > and it allows to control the context of the event processing.
> > The other reason is that I believe we need message queueing for
> > other purposes in DPDK (ex: multi-process, telemetry).
> > 
> > You start thinking about complex message management.
> > And I start thinking about other usages of message queueing.
> > So I think it is the right time to introduce a generic messaging in DPDK.
> > Note: the IPC rte_mp should be built on top of such generic messaging.
> 
> Do you have also inter-lcore communication in mind here?  Or just
> "external" world to "some DPDK controller/dispatcher" and how that is
> passed to other cores is an application writer problem.

I was thinking at communication with:
	- DPDK event from random context
	- secondary process
	- external application
	- remote application

In all cases, I thought the message receiver would be the master core.
But you are probably right that targeting a specific core may be
interesting.

> > If you agree, we can start a new email thread to better discuss
> > the generic messaging sub-system.
> 
> Yes, lets talk about that.

Let's start with the requirements (as above).
I'll write such email soon.


> > I describe here the 3 properties I have in mind:
> > 
> > 1/ Message policy
> > One very important rule in DPDK is to let the control to the application.
> > So the messaging policy must be managed by the application via DPDK API.
> > 
> > 2/ Message queue
> > It seems we should rely on ZeroMQ. Here is why:
> > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ
> > 
> > 3/ Message format
> > I am not sure whether we can manage with "simple strings", TLV,
> > or should we use something more complex like protobuf?
  
Morten Brørup April 10, 2020, 10:03 a.m. UTC | #17
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Saturday, April 4, 2020 9:58 PM
> 
> 04/04/2020 20:30, Andrzej Ostruszka [C]:
> > Thomas,
> >
> > I have replied to the other mail, here I just want to confirm, that
> I'm
> > fine with the proposed "general messaging" which other libraries (IF
> > Proxy including) could utilize.
> >
> > See also below.
> >
> > On 4/3/20 7:19 PM, Thomas Monjalon wrote:
> > > 02/04/2020 15:48, Andrzej Ostruszka [C]:
> > >> On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
> > [...]
> > >> Would that be easier/better to register queue together with a
> bitmask of
> > >> event types that given queue is accepting?  Than during setup
> phase
> > >> application would select just one queue to handle "global" events
> and
> > >> the logic of event handling for lcores should be simplier.
> > >>
> > >> Let me know what you think.
> > >
> > > I think we want to avoid complicate design.
> > > So let's choose between callback and message queue.
> > > I vote for message queue because it can handle any situation,
> > > and it allows to control the context of the event processing.
> > > The other reason is that I believe we need message queueing for
> > > other purposes in DPDK (ex: multi-process, telemetry).
> > >
> > > You start thinking about complex message management.
> > > And I start thinking about other usages of message queueing.
> > > So I think it is the right time to introduce a generic messaging in
> DPDK.
> > > Note: the IPC rte_mp should be built on top of such generic
> messaging.
> >
> > Do you have also inter-lcore communication in mind here?  Or just
> > "external" world to "some DPDK controller/dispatcher" and how that is
> > passed to other cores is an application writer problem.
> 
> I was thinking at communication with:
> 	- DPDK event from random context
> 	- secondary process
> 	- external application
> 	- remote application
> 
> In all cases, I thought the message receiver would be the master core.

That would also be my assumption.

> But you are probably right that targeting a specific core may be
> interesting.
> 
> > > If you agree, we can start a new email thread to better discuss
> > > the generic messaging sub-system.
> >
> > Yes, lets talk about that.
> 
> Let's start with the requirements (as above).
> I'll write such email soon.
> 
> 
> > > I describe here the 3 properties I have in mind:
> > >
> > > 1/ Message policy
> > > One very important rule in DPDK is to let the control to the
> application.
> > > So the messaging policy must be managed by the application via DPDK
> API.
> > >
> > > 2/ Message queue
> > > It seems we should rely on ZeroMQ. Here is why:
> > > http://zguide.zeromq.org/page:all#Why-We-Needed-ZeroMQ
> > >
> > > 3/ Message format
> > > I am not sure whether we can manage with "simple strings", TLV,
> > > or should we use something more complex like protobuf?
> 
> 
>
  
Morten Brørup April 10, 2020, 10:41 a.m. UTC | #18
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Jerin Jacob
> Sent: Saturday, April 4, 2020 12:19 PM
> 

[...]

> Since it is the control plane, slow path traffic without any
> performance requirement,

This is incorrect.

Production equipment certainly has control plane performance requirements!

If there were no control plane requirements, adding a single route could take 1 second. Then loading the internet route table of 850k IPv4 routes would take 10 full days. Good luck selling a router with this control plane performance.

> Generalize the message comes for zero cost.
> +1 for standardizing the message if every subsystem planning to do the
> same.

+1 for generalizing and standardizing, if it can be done right.
  
Jerin Jacob April 10, 2020, 12:28 p.m. UTC | #19
On Fri, Apr 10, 2020 at 3:33 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Saturday, April 4, 2020 9:58 PM
> >
> > 04/04/2020 20:30, Andrzej Ostruszka [C]:
> > > Thomas,
> > >
> > > I have replied to the other mail, here I just want to confirm, that
> > I'm
> > > fine with the proposed "general messaging" which other libraries (IF
> > > Proxy including) could utilize.
> > >
> > > See also below.
> > >
> > > On 4/3/20 7:19 PM, Thomas Monjalon wrote:
> > > > 02/04/2020 15:48, Andrzej Ostruszka [C]:
> > > >> On 3/26/20 6:42 PM, Andrzej Ostruszka wrote:
> > > [...]
> > > >> Would that be easier/better to register queue together with a
> > bitmask of
> > > >> event types that given queue is accepting?  Than during setup
> > phase
> > > >> application would select just one queue to handle "global" events
> > and
> > > >> the logic of event handling for lcores should be simplier.
> > > >>
> > > >> Let me know what you think.
> > > >
> > > > I think we want to avoid complicate design.
> > > > So let's choose between callback and message queue.
> > > > I vote for message queue because it can handle any situation,
> > > > and it allows to control the context of the event processing.
> > > > The other reason is that I believe we need message queueing for
> > > > other purposes in DPDK (ex: multi-process, telemetry).
> > > >
> > > > You start thinking about complex message management.
> > > > And I start thinking about other usages of message queueing.
> > > > So I think it is the right time to introduce a generic messaging in
> > DPDK.
> > > > Note: the IPC rte_mp should be built on top of such generic
> > messaging.
> > >
> > > Do you have also inter-lcore communication in mind here?  Or just
> > > "external" world to "some DPDK controller/dispatcher" and how that is
> > > passed to other cores is an application writer problem.
> >
> > I was thinking at communication with:
> >       - DPDK event from random context
> >       - secondary process
> >       - external application
> >       - remote application
> >
> > In all cases, I thought the message receiver would be the master core.
>
> That would also be my assumption.

IMO, DPDK should not dictate that. The library can give API to for housekeeping.
It is up to the application to call on DPDK isolated cores or from the
control thread.
I think, we need to dictate only it should be used by a single consumer.