mbox series

[RFC,0/4] SocketPair Broker support for vhost and virtio-user.

Message ID 20210317202530.4145673-1-i.maximets@ovn.org (mailing list archive)
Headers show
Series SocketPair Broker support for vhost and virtio-user. | expand

Message

Ilya Maximets March 17, 2021, 8:25 p.m. UTC
TL;DR;
  Managing socket files is too much fun. :)  And here is how this
  could be improved:
    https://github.com/igsilya/one-socket
    https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst
  In particular for vhost-user case.

In modern virtualization setups there are tens or hundreds of different
socket files for different purposes.  Sockets to manage various
daemons, vhost-user sockets for various virtual devices, memif sockets
for memif network interfaces and so on.

In order to make things work in containerized environments software
systems has to share these sockets with containers.  In most cases
this sharing is implemented as a shared directory mounted inside the
container, because socket files could be re-created in runtime or even
not be available at the container startup.  For example, if they are
created by the application inside the container.

Even more configuration tricks required in order to share some sockets
between different containers and not only with the host, e.g. to
create service chains.
And some housekeeping usually required for applications in case the
socket server terminated abnormally and socket files left on a file
system:
 "failed to bind to vhu: Address already in use; remove it and try again"

Additionally, all applications (system and user's!) should follow
naming conventions and place socket files in particular location on a
file system to make things work.

In particular, this applies to vhost-user sockets.

This patch-set aims to eliminate most of the inconveniences by
leveraging an infrastructure service provided by a SocketPair Broker.

*SocketPair Broker* is a daemon that mediates establishment of direct
socket-based connections between clients.

*One Socket* is a reference implementation of a SocketPair Broker
Daemon, SocketPair Broker Protocol and a helper library for client
applications (libspbroker):

  https://github.com/igsilya/one-socket

It's fully functional, but not completely ready for production use
for now.  See 'todo' section in README.rst in one-socket repository.

Basically, it's a daemon that listens on a single unix socket
(broker socket) and accepts clients.  Client connects and provides a
'key'.  If two clients provided the same 'key', One Socket daemon
creates a pair of connected sockets with socketpair() and sends
sides of this pair to these two clients.  At this point two clients
have a direct communication channel between them.  They will disconnect
from the broker and continue to operate and communicate normally.

Workflow overview with pictures available here:

  https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker.rst

Communication with a broker based on a SocketPair Broker Protocol:

  https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker-proto-spec.rst


This patch-set extends vhost library, vhost pmd and virtio-user pmd to
support SocketPair Broker as one of the connection methods.
Usage example:

  # Starting a One Socket daemon with socket './one.socket':
  $ ONE_SOCKET_PATH=./one.socket ./one-socket

  # Starting testpmd #1 with virtio-user device in server mode:
  $ dpdk-testpmd --no-pci --in-memory --single-file-segments \
      --vdev="net_virtio_user,path=./one.socket,broker-key=MY-KEY,server=1"

  # Starting testpmd #2 with vhost pmd in client mode:
  $ dpdk-testpmd --no-pci --in-memory --single-file-segments \
      --vdev="eth_vhost0,iface=./one.socket,broker-key=MY-KEY,client=1"

Details how to build and install One Socket are in README.rst in
one-socket repository.

DPDK side is the first step of implementation.  Once available in DPDK,
support could be easily added to Open vSwith or VPP or any DPDK-based
application.  Same support could be added to QEMU (found a volunteer
for this part).

Since SocketPair Broker is completely independent from the purposes
connection will be used for, it has a potential to unify and replace
all one-to-one unix socket connections on a host.  This one persistent
broker socket could be passed to any containers and can be used by
any application greatly simplifying system management.

Any feedback or suggestions on any component of this solution including
this patch-set, One Socket Daemon, SocketPair Broker Protocol or
libspbroker library are very welcome.

*Note* about the patch set:

First patch in a series is a *bug* fix, so it should be considered even
outside of this series.  It basically fixes unregistering of a
listening socket that never happens in current code.

The virtio-user part of the series heavily depends on this bug fix
since broker connection unlike listening socket will not persist and
will generate lots of interrupts if not unregistered.

Ilya Maximets (4):
  net/virtio: fix interrupt unregistering for listening socket
  vhost: add support for SocketPair Broker
  net/vhost: add support for SocketPair Broker
  net/virtio: add support for SocketPair Broker

 doc/guides/nics/vhost.rst                     |   5 +
 doc/guides/nics/virtio.rst                    |   5 +
 doc/guides/prog_guide/vhost_lib.rst           |  10 +
 drivers/net/vhost/rte_eth_vhost.c             |  42 ++-
 drivers/net/virtio/meson.build                |   6 +
 drivers/net/virtio/virtio_user/vhost_user.c   | 122 ++++++++-
 .../net/virtio/virtio_user/virtio_user_dev.c  | 142 +++++++---
 .../net/virtio/virtio_user/virtio_user_dev.h  |   6 +-
 drivers/net/virtio/virtio_user_ethdev.c       |  30 ++-
 lib/librte_vhost/meson.build                  |   7 +
 lib/librte_vhost/rte_vhost.h                  |   1 +
 lib/librte_vhost/socket.c                     | 245 ++++++++++++++++--
 12 files changed, 550 insertions(+), 71 deletions(-)

Comments

Stefan Hajnoczi March 18, 2021, 5:52 p.m. UTC | #1
On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
Hi,
Some questions to understand the problems that SocketPair Broker solves:

> Even more configuration tricks required in order to share some sockets
> between different containers and not only with the host, e.g. to
> create service chains.

How does SocketPair Broker solve this? I guess the idea is that
SocketPair Broker must be started before other containers. That way
applications don't need to sleep and reconnect when a socket isn't
available yet.

On the other hand, the SocketPair Broker might be unavailable (OOM
killer, crash, etc), so applications still need to sleep and reconnect
to the broker itself. I'm not sure the problem has actually been solved
unless there is a reason why the broker is always guaranteed to be
available?

> And some housekeeping usually required for applications in case the
> socket server terminated abnormally and socket files left on a file
> system:
>  "failed to bind to vhu: Address already in use; remove it and try again"

QEMU avoids this by unlinking before binding. The drawback is that users
might accidentally hijack an existing listen socket, but that can be
solved with a pidfile.

> Additionally, all applications (system and user's!) should follow
> naming conventions and place socket files in particular location on a
> file system to make things work.

Does SocketPair Broker solve this? Applications now need to use a naming
convention for keys, so it seems like this issue has not been
eliminated.

> This patch-set aims to eliminate most of the inconveniences by
> leveraging an infrastructure service provided by a SocketPair Broker.

I don't understand yet why this is useful for vhost-user, where the
creation of the vhost-user device backend and its use by a VMM are
closely managed by one piece of software:

1. Unlink the socket path.
2. Create, bind, and listen on the socket path.
3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
   RPC, spawn a process, etc) and pass in the listen fd.
4. In the meantime the VMM can open the socket path and call connect(2).
   As soon as the vhost-user device backend calls accept(2) the
   connection will proceed (there is no need for sleeping).

This approach works across containers without a broker.

BTW what is the security model of the broker? Unlike pathname UNIX
domain sockets there is no ownership permission check.

Stefan
Ilya Maximets March 18, 2021, 7:47 p.m. UTC | #2
On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> Hi,
> Some questions to understand the problems that SocketPair Broker solves:
> 
>> Even more configuration tricks required in order to share some sockets
>> between different containers and not only with the host, e.g. to
>> create service chains.
> 
> How does SocketPair Broker solve this? I guess the idea is that
> SocketPair Broker must be started before other containers. That way
> applications don't need to sleep and reconnect when a socket isn't
> available yet.
> 
> On the other hand, the SocketPair Broker might be unavailable (OOM
> killer, crash, etc), so applications still need to sleep and reconnect
> to the broker itself. I'm not sure the problem has actually been solved
> unless there is a reason why the broker is always guaranteed to be
> available?

Hi, Stefan.  Thanks for your feedback!

The idea is to have the SocketPair Broker running right from the
boot of the host.  If it will use a systemd socket-based service
activation, the socket should persist while systemd is alive, IIUC.
OOM, crash and restart of the broker should not affect existence
of the socket and systemd will spawn a service if it's not running
for any reason without loosing incoming connections.

> 
>> And some housekeeping usually required for applications in case the
>> socket server terminated abnormally and socket files left on a file
>> system:
>>  "failed to bind to vhu: Address already in use; remove it and try again"
> 
> QEMU avoids this by unlinking before binding. The drawback is that users
> might accidentally hijack an existing listen socket, but that can be
> solved with a pidfile.

How exactly this could be solved with a pidfile?  And what if this is
a different application that tries to create a socket on a same path?
e.g. QEMU creates a socket (started in a server mode) and user
accidentally created dpdkvhostuser port in Open vSwitch instead of
dpdkvhostuserclient.  This way rte_vhost library will try to bind
to an existing socket file and will fail.  Subsequently port creation
in OVS will fail.   We can't allow OVS to unlink files because this
way OVS users will have ability to unlink random sockets that OVS has
access to and we also has no idea if it's a QEMU that created a file
or it was a virtio-user application or someone else.
There are, probably, ways to detect if there is any alive process that
has this socket open, but that sounds like too much for this purpose,
also I'm not sure if it's possible if actual user is in a different
container.
So I don't see a good reliable way to detect these conditions.  This
falls on shoulders of a higher level management software or a user to
clean these socket files up before adding ports.

> 
>> Additionally, all applications (system and user's!) should follow
>> naming conventions and place socket files in particular location on a
>> file system to make things work.
> 
> Does SocketPair Broker solve this? Applications now need to use a naming
> convention for keys, so it seems like this issue has not been
> eliminated.

Key is an arbitrary sequence of bytes, so it's hard to call it a naming
convention.  But they need to know keys, you're right.  And to be
careful I said "eliminates most of the inconveniences". :)

> 
>> This patch-set aims to eliminate most of the inconveniences by
>> leveraging an infrastructure service provided by a SocketPair Broker.
> 
> I don't understand yet why this is useful for vhost-user, where the
> creation of the vhost-user device backend and its use by a VMM are
> closely managed by one piece of software:
> 
> 1. Unlink the socket path.
> 2. Create, bind, and listen on the socket path.
> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>    RPC, spawn a process, etc) and pass in the listen fd.
> 4. In the meantime the VMM can open the socket path and call connect(2).
>    As soon as the vhost-user device backend calls accept(2) the
>    connection will proceed (there is no need for sleeping).
> 
> This approach works across containers without a broker.

Not sure if I fully understood a question here, but anyway.

This approach works fine if you know what application to run.
In case of a k8s cluster, it might be a random DPDK application
with virtio-user ports running inside a container and want to
have a network connection.  Also, this application needs to run
virtio-user in server mode, otherwise restart of the OVS will
require restart of the application.  So, you basically need to
rely on a third-party application to create a socket with a right
name and in a correct location that is shared with a host, so
OVS can find it and connect.

In a VM world everything is much more simple, since you have
a libvirt and QEMU that will take care of all of these stuff
and which are also under full control of management software
and a system administrator.
In case of a container with a "random" DPDK application inside
there is no such entity that can help.  Of course, some solution
might be implemented in docker/podman daemon to create and manage
outside-looking sockets for an application inside the container,
but that is not available today AFAIK and I'm not sure if it
ever will.

> 
> BTW what is the security model of the broker? Unlike pathname UNIX
> domain sockets there is no ownership permission check.

I thought about this.  Yes, we should allow connection to this socket
for a wide group of applications.  That might be a problem.
However, 2 applications need to know the 1024 (at most) byte key in
order to connect to each other.  This might be considered as a
sufficient security model in case these keys are not predictable.
Suggestions on how to make this more secure are welcome.

If it's really necessary to completely isolate some connections
from other ones, one more broker could be started.  But I'm not
sure what the case it should be.

Broker itself closes the socketpair on its side, so the connection
between 2 applications is direct and should be secure as far as
kernel doesn't allow other system processes to intercept data on
arbitrary unix sockets.

Best regards, Ilya Maximets.
Ilya Maximets March 18, 2021, 8:14 p.m. UTC | #3
On 3/18/21 8:47 PM, Ilya Maximets wrote:
> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>> Hi,
>> Some questions to understand the problems that SocketPair Broker solves:
>>
>>> Even more configuration tricks required in order to share some sockets
>>> between different containers and not only with the host, e.g. to
>>> create service chains.
>>
>> How does SocketPair Broker solve this? I guess the idea is that
>> SocketPair Broker must be started before other containers. That way
>> applications don't need to sleep and reconnect when a socket isn't
>> available yet.
>>
>> On the other hand, the SocketPair Broker might be unavailable (OOM
>> killer, crash, etc), so applications still need to sleep and reconnect
>> to the broker itself. I'm not sure the problem has actually been solved
>> unless there is a reason why the broker is always guaranteed to be
>> available?
> 
> Hi, Stefan.  Thanks for your feedback!
> 
> The idea is to have the SocketPair Broker running right from the
> boot of the host.  If it will use a systemd socket-based service
> activation, the socket should persist while systemd is alive, IIUC.
> OOM, crash and restart of the broker should not affect existence
> of the socket and systemd will spawn a service if it's not running
> for any reason without loosing incoming connections.
> 
>>
>>> And some housekeeping usually required for applications in case the
>>> socket server terminated abnormally and socket files left on a file
>>> system:
>>>  "failed to bind to vhu: Address already in use; remove it and try again"
>>
>> QEMU avoids this by unlinking before binding. The drawback is that users
>> might accidentally hijack an existing listen socket, but that can be
>> solved with a pidfile.
> 
> How exactly this could be solved with a pidfile?  And what if this is
> a different application that tries to create a socket on a same path?
> e.g. QEMU creates a socket (started in a server mode) and user
> accidentally created dpdkvhostuser port in Open vSwitch instead of
> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> to an existing socket file and will fail.  Subsequently port creation
> in OVS will fail.   We can't allow OVS to unlink files because this
> way OVS users will have ability to unlink random sockets that OVS has
> access to and we also has no idea if it's a QEMU that created a file
> or it was a virtio-user application or someone else.
> There are, probably, ways to detect if there is any alive process that
> has this socket open, but that sounds like too much for this purpose,
> also I'm not sure if it's possible if actual user is in a different
> container.
> So I don't see a good reliable way to detect these conditions.  This
> falls on shoulders of a higher level management software or a user to
> clean these socket files up before adding ports.
> 
>>
>>> Additionally, all applications (system and user's!) should follow
>>> naming conventions and place socket files in particular location on a
>>> file system to make things work.
>>
>> Does SocketPair Broker solve this? Applications now need to use a naming
>> convention for keys, so it seems like this issue has not been
>> eliminated.
> 
> Key is an arbitrary sequence of bytes, so it's hard to call it a naming
> convention.  But they need to know keys, you're right.  And to be
> careful I said "eliminates most of the inconveniences". :)
> 
>>
>>> This patch-set aims to eliminate most of the inconveniences by
>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>
>> I don't understand yet why this is useful for vhost-user, where the
>> creation of the vhost-user device backend and its use by a VMM are
>> closely managed by one piece of software:
>>
>> 1. Unlink the socket path.
>> 2. Create, bind, and listen on the socket path.
>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>    RPC, spawn a process, etc) and pass in the listen fd.
>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>    As soon as the vhost-user device backend calls accept(2) the
>>    connection will proceed (there is no need for sleeping).
>>
>> This approach works across containers without a broker.
> 
> Not sure if I fully understood a question here, but anyway.
> 
> This approach works fine if you know what application to run.
> In case of a k8s cluster, it might be a random DPDK application
> with virtio-user ports running inside a container and want to
> have a network connection.  Also, this application needs to run
> virtio-user in server mode, otherwise restart of the OVS will
> require restart of the application.  So, you basically need to
> rely on a third-party application to create a socket with a right
> name and in a correct location that is shared with a host, so
> OVS can find it and connect.
> 
> In a VM world everything is much more simple, since you have
> a libvirt and QEMU that will take care of all of these stuff
> and which are also under full control of management software
> and a system administrator.
> In case of a container with a "random" DPDK application inside
> there is no such entity that can help.  Of course, some solution
> might be implemented in docker/podman daemon to create and manage
> outside-looking sockets for an application inside the container,
> but that is not available today AFAIK and I'm not sure if it
> ever will.
> 
>>
>> BTW what is the security model of the broker? Unlike pathname UNIX
>> domain sockets there is no ownership permission check.
> 
> I thought about this.  Yes, we should allow connection to this socket
> for a wide group of applications.  That might be a problem.
> However, 2 applications need to know the 1024 (at most) byte key in
> order to connect to each other.  This might be considered as a
> sufficient security model in case these keys are not predictable.
> Suggestions on how to make this more secure are welcome.

Digging more into unix sockets, I think that broker might use
SO_PEERCRED to identify at least a uid and gid of a client.
This way we can implement policies, e.g. one client might
request to pair it only with clients from the same group or
from the same user.

This is actually a great extension for the SocketPair Broker Protocol.

Might even use SO_PEERSEC to enforce even stricter policies
based on selinux context.

> 
> If it's really necessary to completely isolate some connections
> from other ones, one more broker could be started.  But I'm not
> sure what the case it should be.
> 
> Broker itself closes the socketpair on its side, so the connection
> between 2 applications is direct and should be secure as far as
> kernel doesn't allow other system processes to intercept data on
> arbitrary unix sockets.
> 
> Best regards, Ilya Maximets.
>
Marc-André Lureau March 19, 2021, 8:51 a.m. UTC | #4
Hi

On Thu, Mar 18, 2021 at 11:47 PM Ilya Maximets <i.maximets@ovn.org> wrote:

> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> > On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> > Hi,
> > Some questions to understand the problems that SocketPair Broker solves:
> >
> >> Even more configuration tricks required in order to share some sockets
> >> between different containers and not only with the host, e.g. to
> >> create service chains.
> >
> > How does SocketPair Broker solve this? I guess the idea is that
> > SocketPair Broker must be started before other containers. That way
> > applications don't need to sleep and reconnect when a socket isn't
> > available yet.
> >
> > On the other hand, the SocketPair Broker might be unavailable (OOM
> > killer, crash, etc), so applications still need to sleep and reconnect
> > to the broker itself. I'm not sure the problem has actually been solved
> > unless there is a reason why the broker is always guaranteed to be
> > available?
>
> Hi, Stefan.  Thanks for your feedback!
>
> The idea is to have the SocketPair Broker running right from the
> boot of the host.  If it will use a systemd socket-based service
> activation, the socket should persist while systemd is alive, IIUC.
> OOM, crash and restart of the broker should not affect existence
> of the socket and systemd will spawn a service if it's not running
> for any reason without loosing incoming connections.
>
>
Since the solution relies on systemd, why not use DBus to perform
authentication, service discovery and setup the socketpair between peers?
You don't need an extra service broker in this case.

When the org.foo service shows up, call org.foo.Connect() to return the fd
of the client end (or throw an error etc)

I don't think establishing socketpair connection between process peers
sharing some ID, without any other context, is going to be so useful. The
relation is usually not symmetrical, and you usually have associated
setup/configuration details.
Ilya Maximets March 19, 2021, 11:25 a.m. UTC | #5
On 3/19/21 9:51 AM, Marc-André Lureau wrote:
> Hi
> 
> On Thu, Mar 18, 2021 at 11:47 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
> 
>     On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>     > On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>     > Hi,
>     > Some questions to understand the problems that SocketPair Broker solves:
>     >
>     >> Even more configuration tricks required in order to share some sockets
>     >> between different containers and not only with the host, e.g. to
>     >> create service chains.
>     >
>     > How does SocketPair Broker solve this? I guess the idea is that
>     > SocketPair Broker must be started before other containers. That way
>     > applications don't need to sleep and reconnect when a socket isn't
>     > available yet.
>     >
>     > On the other hand, the SocketPair Broker might be unavailable (OOM
>     > killer, crash, etc), so applications still need to sleep and reconnect
>     > to the broker itself. I'm not sure the problem has actually been solved
>     > unless there is a reason why the broker is always guaranteed to be
>     > available?
> 
>     Hi, Stefan.  Thanks for your feedback!
> 
>     The idea is to have the SocketPair Broker running right from the
>     boot of the host.  If it will use a systemd socket-based service
>     activation, the socket should persist while systemd is alive, IIUC.
>     OOM, crash and restart of the broker should not affect existence
>     of the socket and systemd will spawn a service if it's not running
>     for any reason without loosing incoming connections.
> 
> 
> Since the solution relies on systemd, why not use DBus to perform authentication, service discovery and setup the socketpair between peers? You don't need an extra service broker in this case.
> 
> When the org.foo service shows up, call org.foo.Connect() to return the fd of the client end (or throw an error etc)

Yes, that's a possible solution.  And I even thought about running
SocketPair Broker as a dbus service (it's in a 'todo' list with
a question mark).  However, there are few issues with DBus-based
solution:

1. I'd like to not bind the solution to systemd just because it's
   not always required.  There are cases where you don't really
   need a persistent and never-changing socket file.  For example,
   dpdk implementation for vhost and virtio-user in this patch set
   will work just fine, i.e. will re-connect as soon as socket is
   available.
   Also, socket-based activation is a cool feature, but it's not the
   only solution on how to make socket file appear before starting
   a certain application.

2. It should be easier for a developer of an existing client-server
   application to just use a different socket in compare to learning
   how to use DBus and how to integrate it into application.
   Especially, it's much easier to use another socket if they want
   to keep traditional way of connection as alternative to connection
   with a SocketPair Broker.

3. Unclear security implication.  I tried to research how to use
   host DBus from the container and I didn't find a good convenient
   way to do that.  Please, point me to a correct documentation if
   I missed something.  Solution that I managed to google includes
   mounting of a /run/user/<user-id> or dbus sessions directory into
   the container and copying a dbus configuration files.
   Some articles also points out that communication is only possible
   from a privileged containers.  To be clear, I know very little
   about DBus, so any pointers on how to use it in a convenient way
   from the inside of a container will be appreciated.

> 
> I don't think establishing socketpair connection between process peers sharing some ID, without any other context, is going to be so useful. The relation is usually not symmetrical, and you usually have associated setup/configuration details.

There is an "operation mode" that user can specify in a GET_PAIR
request to the SocketPair Broker.  It might be NONE, CLIENT or
SERVER.  Broker will pair users that provided NONE together as
they likely want to have a symmetrical connection.  And it will
pair users that declared themselves as CLIENTs with users that
specified SEVER.  This is to ensure that in asymmetrical connections
there will be no two "clients" or two "servers".
See:
  https://github.com/igsilya/one-socket/blob/main/doc/socketpair-broker-proto-spec.rst

If you have any idea what else could be added to the protocol to
make it better, I'd love to hear. 

Best regards, Ilya Maximets.
Stefan Hajnoczi March 19, 2021, 2:05 p.m. UTC | #6
On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> > On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >> And some housekeeping usually required for applications in case the
> >> socket server terminated abnormally and socket files left on a file
> >> system:
> >>  "failed to bind to vhu: Address already in use; remove it and try again"
> > 
> > QEMU avoids this by unlinking before binding. The drawback is that users
> > might accidentally hijack an existing listen socket, but that can be
> > solved with a pidfile.
> 
> How exactly this could be solved with a pidfile?

A pidfile prevents two instances of the same service from running at the
same time.

The same effect can be achieved by the container orchestrator, systemd,
etc too because it refuses to run the same service twice.

> And what if this is
> a different application that tries to create a socket on a same path?
> e.g. QEMU creates a socket (started in a server mode) and user
> accidentally created dpdkvhostuser port in Open vSwitch instead of
> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> to an existing socket file and will fail.  Subsequently port creation
> in OVS will fail.   We can't allow OVS to unlink files because this
> way OVS users will have ability to unlink random sockets that OVS has
> access to and we also has no idea if it's a QEMU that created a file
> or it was a virtio-user application or someone else.

If rte_vhost unlinks the socket then the user will find that networking
doesn't work. They can either hot unplug the QEMU vhost-user-net device
or restart QEMU, depending on whether they need to keep the guest
running or not. This is a misconfiguration that is recoverable.

Regarding letting OVS unlink files, I agree that it shouldn't if this
create a security issue. I don't know the security model of OVS.

> There are, probably, ways to detect if there is any alive process that
> has this socket open, but that sounds like too much for this purpose,
> also I'm not sure if it's possible if actual user is in a different
> container.
> So I don't see a good reliable way to detect these conditions.  This
> falls on shoulders of a higher level management software or a user to
> clean these socket files up before adding ports.

Does OVS always run in the same net namespace (pod) as the DPDK
application? If yes, then abstract AF_UNIX sockets can be used. Abstract
AF_UNIX sockets don't have a filesystem path and the socket address
disappears when there is no process listening anymore.

> >> This patch-set aims to eliminate most of the inconveniences by
> >> leveraging an infrastructure service provided by a SocketPair Broker.
> > 
> > I don't understand yet why this is useful for vhost-user, where the
> > creation of the vhost-user device backend and its use by a VMM are
> > closely managed by one piece of software:
> > 
> > 1. Unlink the socket path.
> > 2. Create, bind, and listen on the socket path.
> > 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >    RPC, spawn a process, etc) and pass in the listen fd.
> > 4. In the meantime the VMM can open the socket path and call connect(2).
> >    As soon as the vhost-user device backend calls accept(2) the
> >    connection will proceed (there is no need for sleeping).
> > 
> > This approach works across containers without a broker.
> 
> Not sure if I fully understood a question here, but anyway.
>
> This approach works fine if you know what application to run.
> In case of a k8s cluster, it might be a random DPDK application
> with virtio-user ports running inside a container and want to
> have a network connection.  Also, this application needs to run
> virtio-user in server mode, otherwise restart of the OVS will
> require restart of the application.  So, you basically need to
> rely on a third-party application to create a socket with a right
> name and in a correct location that is shared with a host, so
> OVS can find it and connect.
> 
> In a VM world everything is much more simple, since you have
> a libvirt and QEMU that will take care of all of these stuff
> and which are also under full control of management software
> and a system administrator.
> In case of a container with a "random" DPDK application inside
> there is no such entity that can help.  Of course, some solution
> might be implemented in docker/podman daemon to create and manage
> outside-looking sockets for an application inside the container,
> but that is not available today AFAIK and I'm not sure if it
> ever will.

Wait, when you say there is no entity like management software or a
system administrator, then how does OVS know to instantiate the new
port? I guess something still needs to invoke ovs-ctl add-port?

Can you describe the steps used today (without the broker) for
instantiating a new DPDK app container and connecting it to OVS?
Although my interest is in the vhost-user protocol I think it's
necessary to understand the OVS requirements here and I know little
about them.

Stefan
Stefan Hajnoczi March 19, 2021, 2:16 p.m. UTC | #7
On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
> On 3/18/21 8:47 PM, Ilya Maximets wrote:
> > On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >> BTW what is the security model of the broker? Unlike pathname UNIX
> >> domain sockets there is no ownership permission check.
> > 
> > I thought about this.  Yes, we should allow connection to this socket
> > for a wide group of applications.  That might be a problem.
> > However, 2 applications need to know the 1024 (at most) byte key in
> > order to connect to each other.  This might be considered as a
> > sufficient security model in case these keys are not predictable.
> > Suggestions on how to make this more secure are welcome.
> 
> Digging more into unix sockets, I think that broker might use
> SO_PEERCRED to identify at least a uid and gid of a client.
> This way we can implement policies, e.g. one client might
> request to pair it only with clients from the same group or
> from the same user.
> 
> This is actually a great extension for the SocketPair Broker Protocol.
> 
> Might even use SO_PEERSEC to enforce even stricter policies
> based on selinux context.

Some piece of software or an administrator would need to understand the
pid/uid/gid mappings used by specific containers in order to configure
security policies in the broker like "app1 is allowed to connect to
app2's socket". This is probably harder than it looks (and DBus already
has everything to do this and more).

Stefan
Stefan Hajnoczi March 19, 2021, 2:39 p.m. UTC | #8
Hi Ilya,
By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
is restarted then existing vhost-user connections drop with an error but
QEMU could attempt to reconnect to the UNIX domain socket which the new
OVS instance will set up.

Why is it impossible to reconnect when OVS owns the listen socket?

Stefan
Ilya Maximets March 19, 2021, 3:29 p.m. UTC | #9
On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>> And some housekeeping usually required for applications in case the
>>>> socket server terminated abnormally and socket files left on a file
>>>> system:
>>>>  "failed to bind to vhu: Address already in use; remove it and try again"
>>>
>>> QEMU avoids this by unlinking before binding. The drawback is that users
>>> might accidentally hijack an existing listen socket, but that can be
>>> solved with a pidfile.
>>
>> How exactly this could be solved with a pidfile?
> 
> A pidfile prevents two instances of the same service from running at the
> same time.
> 
> The same effect can be achieved by the container orchestrator, systemd,
> etc too because it refuses to run the same service twice.

Sure. I understand that.  My point was that these could be 2 different
applications and they might not know which process to look for.

> 
>> And what if this is
>> a different application that tries to create a socket on a same path?
>> e.g. QEMU creates a socket (started in a server mode) and user
>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
>> to an existing socket file and will fail.  Subsequently port creation
>> in OVS will fail.   We can't allow OVS to unlink files because this
>> way OVS users will have ability to unlink random sockets that OVS has
>> access to and we also has no idea if it's a QEMU that created a file
>> or it was a virtio-user application or someone else.
> 
> If rte_vhost unlinks the socket then the user will find that networking
> doesn't work. They can either hot unplug the QEMU vhost-user-net device
> or restart QEMU, depending on whether they need to keep the guest
> running or not. This is a misconfiguration that is recoverable.

True, it's recoverable, but with a high cost.  Restart of a VM is rarely
desirable.  And the application inside the guest might not feel itself
well after hot re-plug of a device that it actively used.  I'd expect
a DPDK application that runs inside a guest on some virtio-net device
to crash after this kind of manipulations.  Especially, if it uses some
older versions of DPDK.

> 
> Regarding letting OVS unlink files, I agree that it shouldn't if this
> create a security issue. I don't know the security model of OVS.

In general privileges of a ovs-vswitchd daemon might be completely
different from privileges required to invoke control utilities or
to access the configuration database.  SO, yes, we should not allow
that.

> 
>> There are, probably, ways to detect if there is any alive process that
>> has this socket open, but that sounds like too much for this purpose,
>> also I'm not sure if it's possible if actual user is in a different
>> container.
>> So I don't see a good reliable way to detect these conditions.  This
>> falls on shoulders of a higher level management software or a user to
>> clean these socket files up before adding ports.
> 
> Does OVS always run in the same net namespace (pod) as the DPDK
> application? If yes, then abstract AF_UNIX sockets can be used. Abstract
> AF_UNIX sockets don't have a filesystem path and the socket address
> disappears when there is no process listening anymore.

OVS is usually started right on the host in a main network namespace.
In case it's started in a pod, it will run in a separate container but
configured with a host network.  Applications almost exclusively runs
in separate pods.

> 
>>>> This patch-set aims to eliminate most of the inconveniences by
>>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>>
>>> I don't understand yet why this is useful for vhost-user, where the
>>> creation of the vhost-user device backend and its use by a VMM are
>>> closely managed by one piece of software:
>>>
>>> 1. Unlink the socket path.
>>> 2. Create, bind, and listen on the socket path.
>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>    RPC, spawn a process, etc) and pass in the listen fd.
>>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>>    As soon as the vhost-user device backend calls accept(2) the
>>>    connection will proceed (there is no need for sleeping).
>>>
>>> This approach works across containers without a broker.
>>
>> Not sure if I fully understood a question here, but anyway.
>>
>> This approach works fine if you know what application to run.
>> In case of a k8s cluster, it might be a random DPDK application
>> with virtio-user ports running inside a container and want to
>> have a network connection.  Also, this application needs to run
>> virtio-user in server mode, otherwise restart of the OVS will
>> require restart of the application.  So, you basically need to
>> rely on a third-party application to create a socket with a right
>> name and in a correct location that is shared with a host, so
>> OVS can find it and connect.
>>
>> In a VM world everything is much more simple, since you have
>> a libvirt and QEMU that will take care of all of these stuff
>> and which are also under full control of management software
>> and a system administrator.
>> In case of a container with a "random" DPDK application inside
>> there is no such entity that can help.  Of course, some solution
>> might be implemented in docker/podman daemon to create and manage
>> outside-looking sockets for an application inside the container,
>> but that is not available today AFAIK and I'm not sure if it
>> ever will.
> 
> Wait, when you say there is no entity like management software or a
> system administrator, then how does OVS know to instantiate the new
> port? I guess something still needs to invoke ovs-ctl add-port?

I didn't mean that there is no any application that configures
everything.  Of course, there is.  I mean that there is no such
entity that abstracts all that socket machinery from the user's
application that runs inside the container.  QEMU hides all the
details of the connection to vhost backend and presents the device
as a PCI device with a network interface wrapping from the guest
kernel.  So, the application inside VM shouldn't care what actually
there is a socket connected to OVS that implements backend and
forward traffic somewhere.  For the application it's just a usual
network interface.
But in case of a container world, application should handle all
that by creating a virtio-user device that will connect to some
socket, that has an OVS on the other side.

> 
> Can you describe the steps used today (without the broker) for
> instantiating a new DPDK app container and connecting it to OVS?
> Although my interest is in the vhost-user protocol I think it's
> necessary to understand the OVS requirements here and I know little
> about them.

I might describe some things wrong since I worked with k8s and CNI
plugins last time ~1.5 years ago, but the basic schema will look
something like this:

1. user decides to start a new pod and requests k8s to do that
   via cmdline tools or some API calls.

2. k8s scheduler looks for available resources asking resource
   manager plugins, finds an appropriate physical host and asks
   local to that node kubelet daemon to launch a new pod there.

3. kubelet asks local CNI plugin to allocate network resources
   and annotate the pod with required mount points, devices that
   needs to be passed in and environment variables.
   (this is, IIRC, a gRPC connection.   It might be a multus-cni
   or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
   usually deployed as a system DaemonSet, so it runs in a
   separate pod.

4. Assuming that vhost-user connection requested in server mode.
   CNI plugin will:
   4.1 create a directory for a vhost-user socket.
   4.2 add this directory to pod annotations as a mount point.
   4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
       by connecting to ovsdb-server by JSONRPC directly.
       It will set port type as dpdkvhostuserclient and specify
       socket-path as a path inside the directory it created.
       (OVS will create a port and rte_vhost will enter the
        re-connection loop since socket does not exist yet.)
   4.4 Set up socket file location as environment variable in
       pod annotations.
   4.5 report success to kubelet.

5. kubelet will finish all other preparations and resource
   allocations and will ask docker/podman to start a container
   with all mount points, devices and environment variables from
   the pod annotation.

6. docker/podman starts a container.
   Need to mention here that in many cases initial process of
   a container is not the actual application that will use a
   vhost-user connection, but likely a shell that will invoke
   the actual application.

7. Application starts inside the container, checks the environment
   variables (actually, checking of environment variables usually
   happens in a shell script that invokes the application with
   correct arguments) and creates a net_virtio_user port in server
   mode.  At this point socket file will be created.
   (since we're running third-party application inside the container
    we can only assume that it will do what is written here, it's
    a responsibility of an application developer to do the right
    thing.)

8. OVS successfully re-connects to the newly created socket in a
   shared directory and vhost-user protocol establishes the network
   connectivity.

As you can wee, there are way too many entities and communication
methods involved.  So, passing a pre-opened file descriptor from
CNI all the way down to application is not that easy as it is in
case of QEMU+LibVirt.

Best regards, Ilya Maximets.
Ilya Maximets March 19, 2021, 3:37 p.m. UTC | #10
On 3/19/21 3:16 PM, Stefan Hajnoczi wrote:
> On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
>> On 3/18/21 8:47 PM, Ilya Maximets wrote:
>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>> BTW what is the security model of the broker? Unlike pathname UNIX
>>>> domain sockets there is no ownership permission check.
>>>
>>> I thought about this.  Yes, we should allow connection to this socket
>>> for a wide group of applications.  That might be a problem.
>>> However, 2 applications need to know the 1024 (at most) byte key in
>>> order to connect to each other.  This might be considered as a
>>> sufficient security model in case these keys are not predictable.
>>> Suggestions on how to make this more secure are welcome.
>>
>> Digging more into unix sockets, I think that broker might use
>> SO_PEERCRED to identify at least a uid and gid of a client.
>> This way we can implement policies, e.g. one client might
>> request to pair it only with clients from the same group or
>> from the same user.
>>
>> This is actually a great extension for the SocketPair Broker Protocol.
>>
>> Might even use SO_PEERSEC to enforce even stricter policies
>> based on selinux context.
> 
> Some piece of software or an administrator would need to understand the
> pid/uid/gid mappings used by specific containers in order to configure
> security policies in the broker like "app1 is allowed to connect to
> app2's socket". This is probably harder than it looks (and DBus already
> has everything to do this and more).

AFAIU, neither of orchestration solutions configures different access
rights for sockets right now.  So, it, probably, should not be a big
problem for current setups.

I'd expect pid/uid/gid being mapped to host namespace if SO_PEERCRED
requested from it.  Interesting thing to check, though.

For DBus, as I mentioned in the other reply, IIUC, it will require
mounting /run/user/*<user-id>* or its bits and some other stuff to the
container in order to make it work.  Also it will, probably, require
running containers in privileged mode which will wipe out most of the
security.

Bets regards, Ilya Maximets.
Stefan Hajnoczi March 19, 2021, 4:01 p.m. UTC | #11
On Fri, Mar 19, 2021 at 04:37:01PM +0100, Ilya Maximets wrote:
> On 3/19/21 3:16 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
> >> On 3/18/21 8:47 PM, Ilya Maximets wrote:
> >>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> BTW what is the security model of the broker? Unlike pathname UNIX
> >>>> domain sockets there is no ownership permission check.
> >>>
> >>> I thought about this.  Yes, we should allow connection to this socket
> >>> for a wide group of applications.  That might be a problem.
> >>> However, 2 applications need to know the 1024 (at most) byte key in
> >>> order to connect to each other.  This might be considered as a
> >>> sufficient security model in case these keys are not predictable.
> >>> Suggestions on how to make this more secure are welcome.
> >>
> >> Digging more into unix sockets, I think that broker might use
> >> SO_PEERCRED to identify at least a uid and gid of a client.
> >> This way we can implement policies, e.g. one client might
> >> request to pair it only with clients from the same group or
> >> from the same user.
> >>
> >> This is actually a great extension for the SocketPair Broker Protocol.
> >>
> >> Might even use SO_PEERSEC to enforce even stricter policies
> >> based on selinux context.
> > 
> > Some piece of software or an administrator would need to understand the
> > pid/uid/gid mappings used by specific containers in order to configure
> > security policies in the broker like "app1 is allowed to connect to
> > app2's socket". This is probably harder than it looks (and DBus already
> > has everything to do this and more).
> 
> AFAIU, neither of orchestration solutions configures different access
> rights for sockets right now.  So, it, probably, should not be a big
> problem for current setups.
>
> I'd expect pid/uid/gid being mapped to host namespace if SO_PEERCRED
> requested from it.  Interesting thing to check, though.
> 
> For DBus, as I mentioned in the other reply, IIUC, it will require
> mounting /run/user/*<user-id>* or its bits and some other stuff to the
> container in order to make it work.  Also it will, probably, require
> running containers in privileged mode which will wipe out most of the
> security.

Flatpak has sandboxed D-Bus for it application containers:
https://docs.flatpak.org/en/latest/sandbox-permissions.html

"Limited access to the session D-Bus instance - an app can only own its
own name on the bus."

I don't know about how it works.

Stefan
Marc-André Lureau March 19, 2021, 4:02 p.m. UTC | #12
Hi

On Fri, Mar 19, 2021 at 7:37 PM Ilya Maximets <i.maximets@ovn.org> wrote:

> On 3/19/21 3:16 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 18, 2021 at 09:14:27PM +0100, Ilya Maximets wrote:
> >> On 3/18/21 8:47 PM, Ilya Maximets wrote:
> >>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> BTW what is the security model of the broker? Unlike pathname UNIX
> >>>> domain sockets there is no ownership permission check.
> >>>
> >>> I thought about this.  Yes, we should allow connection to this socket
> >>> for a wide group of applications.  That might be a problem.
> >>> However, 2 applications need to know the 1024 (at most) byte key in
> >>> order to connect to each other.  This might be considered as a
> >>> sufficient security model in case these keys are not predictable.
> >>> Suggestions on how to make this more secure are welcome.
> >>
> >> Digging more into unix sockets, I think that broker might use
> >> SO_PEERCRED to identify at least a uid and gid of a client.
> >> This way we can implement policies, e.g. one client might
> >> request to pair it only with clients from the same group or
> >> from the same user.
> >>
> >> This is actually a great extension for the SocketPair Broker Protocol.
> >>
> >> Might even use SO_PEERSEC to enforce even stricter policies
> >> based on selinux context.
> >
> > Some piece of software or an administrator would need to understand the
> > pid/uid/gid mappings used by specific containers in order to configure
> > security policies in the broker like "app1 is allowed to connect to
> > app2's socket". This is probably harder than it looks (and DBus already
> > has everything to do this and more).
>
> AFAIU, neither of orchestration solutions configures different access
> rights for sockets right now.  So, it, probably, should not be a big
> problem for current setups.
>
> I'd expect pid/uid/gid being mapped to host namespace if SO_PEERCRED
> requested from it.  Interesting thing to check, though.
>
> For DBus, as I mentioned in the other reply, IIUC, it will require
> mounting /run/user/*<user-id>* or its bits and some other stuff to the
> container in order to make it work.  Also it will, probably, require
> running containers in privileged mode which will wipe out most of the
> security.
>

Right, if you need to communicate across namespaces, then it becomes less
common.

However, having a DBus socket (as a private bus) exposed in the NS isn't
going to be any different than having the broker socket exposed, unless I
am missing something.

You'll have the same issues discussed earlier about uid mapping, for
peercred authentication to work.
Ilya Maximets March 19, 2021, 4:11 p.m. UTC | #13
On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
> Hi Ilya,
> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
> is restarted then existing vhost-user connections drop with an error but
> QEMU could attempt to reconnect to the UNIX domain socket which the new
> OVS instance will set up.
> 
> Why is it impossible to reconnect when OVS owns the listen socket?

Well, AFAIK, qemu reconnects client connections only:

    ``reconnect`` sets the timeout for reconnecting on non-server
    sockets when the remote end goes away. qemu will delay this many
    seconds and then attempt to reconnect. Zero disables reconnecting,
    and is the default.

I'm not sure about exact reason.  It was historically this way.
For me it doesn't make much sense.  I mean, your right that it's
just a socket, so it should not matter who listens and who connects.
If reconnection is possible in one direction, it should be possible
in the opposite direction too.

dpdkvhostuser was deprecated just to scare users and force them to
migrate to dpdkvhostuserclient and avoid constant bug reports like:

  "OVS service restarted and network is lost now".

BTW, virtio-user ports in DPDK doesn't support re-connection in client
mode too.

BTW2, with SocketPair Broker it might be cheaper to implement server
reconnection in QEMU because all connections in these case are client
connections, i.e. both ends will connect() to a broker.

Bets regards, Ilya Maximets.
Ilya Maximets March 19, 2021, 4:45 p.m. UTC | #14
On 3/19/21 5:11 PM, Ilya Maximets wrote:
> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>> Hi Ilya,
>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>> is restarted then existing vhost-user connections drop with an error but
>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>> OVS instance will set up.
>>
>> Why is it impossible to reconnect when OVS owns the listen socket?
> 
> Well, AFAIK, qemu reconnects client connections only:
> 
>     ``reconnect`` sets the timeout for reconnecting on non-server
>     sockets when the remote end goes away. qemu will delay this many
>     seconds and then attempt to reconnect. Zero disables reconnecting,
>     and is the default.
> 
> I'm not sure about exact reason.  It was historically this way.
> For me it doesn't make much sense.  I mean, your right that it's
> just a socket, so it should not matter who listens and who connects.
> If reconnection is possible in one direction, it should be possible
> in the opposite direction too.

Sorry, my thought slipped. :)  Yes, QEMU supports re-connection
for client sockets.  So, in theory, dpdkvhostuser ports should work
after re-connection.  And that would be nice.  I don't remember
right now why this doesn't work...  Maybe vhost-user parts in QEMU
doesn't handle this case.  Need to dig some more into that and refresh
my memory.  It was so long ago...

Maxime, do you remember?

> 
> dpdkvhostuser was deprecated just to scare users and force them to
> migrate to dpdkvhostuserclient and avoid constant bug reports like:
> 
>   "OVS service restarted and network is lost now".
> 
> BTW, virtio-user ports in DPDK doesn't support re-connection in client
> mode too.

This is still true, though.  virtio-user in client mode doesn't reconnect.

> 
> BTW2, with SocketPair Broker it might be cheaper to implement server
> reconnection in QEMU because all connections in these case are client
> connections, i.e. both ends will connect() to a broker.
> 
> Bets regards, Ilya Maximets.
>
Stefan Hajnoczi March 19, 2021, 5:21 p.m. UTC | #15
On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> And some housekeeping usually required for applications in case the
> >>>> socket server terminated abnormally and socket files left on a file
> >>>> system:
> >>>>  "failed to bind to vhu: Address already in use; remove it and try again"
> >>>
> >>> QEMU avoids this by unlinking before binding. The drawback is that users
> >>> might accidentally hijack an existing listen socket, but that can be
> >>> solved with a pidfile.
> >>
> >> How exactly this could be solved with a pidfile?
> > 
> > A pidfile prevents two instances of the same service from running at the
> > same time.
> > 
> > The same effect can be achieved by the container orchestrator, systemd,
> > etc too because it refuses to run the same service twice.
> 
> Sure. I understand that.  My point was that these could be 2 different
> applications and they might not know which process to look for.
> 
> > 
> >> And what if this is
> >> a different application that tries to create a socket on a same path?
> >> e.g. QEMU creates a socket (started in a server mode) and user
> >> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> >> to an existing socket file and will fail.  Subsequently port creation
> >> in OVS will fail.   We can't allow OVS to unlink files because this
> >> way OVS users will have ability to unlink random sockets that OVS has
> >> access to and we also has no idea if it's a QEMU that created a file
> >> or it was a virtio-user application or someone else.
> > 
> > If rte_vhost unlinks the socket then the user will find that networking
> > doesn't work. They can either hot unplug the QEMU vhost-user-net device
> > or restart QEMU, depending on whether they need to keep the guest
> > running or not. This is a misconfiguration that is recoverable.
> 
> True, it's recoverable, but with a high cost.  Restart of a VM is rarely
> desirable.  And the application inside the guest might not feel itself
> well after hot re-plug of a device that it actively used.  I'd expect
> a DPDK application that runs inside a guest on some virtio-net device
> to crash after this kind of manipulations.  Especially, if it uses some
> older versions of DPDK.

This unlink issue is probably something we think differently about.
There are many ways for users to misconfigure things when working with
system tools. If it's possible to catch misconfigurations that is
preferrable. In this case it's just the way pathname AF_UNIX domain
sockets work and IMO it's better not to have problems starting the
service due to stale files than to insist on preventing
misconfigurations. QEMU and DPDK do this differently and both seem to be
successful, so ¯\_(ツ)_/¯.

> > 
> > Regarding letting OVS unlink files, I agree that it shouldn't if this
> > create a security issue. I don't know the security model of OVS.
> 
> In general privileges of a ovs-vswitchd daemon might be completely
> different from privileges required to invoke control utilities or
> to access the configuration database.  SO, yes, we should not allow
> that.

That can be locked down by restricting the socket path to a file beneath
/var/run/ovs/vhost-user/.

> > 
> >> There are, probably, ways to detect if there is any alive process that
> >> has this socket open, but that sounds like too much for this purpose,
> >> also I'm not sure if it's possible if actual user is in a different
> >> container.
> >> So I don't see a good reliable way to detect these conditions.  This
> >> falls on shoulders of a higher level management software or a user to
> >> clean these socket files up before adding ports.
> > 
> > Does OVS always run in the same net namespace (pod) as the DPDK
> > application? If yes, then abstract AF_UNIX sockets can be used. Abstract
> > AF_UNIX sockets don't have a filesystem path and the socket address
> > disappears when there is no process listening anymore.
> 
> OVS is usually started right on the host in a main network namespace.
> In case it's started in a pod, it will run in a separate container but
> configured with a host network.  Applications almost exclusively runs
> in separate pods.

Okay.

> >>>> This patch-set aims to eliminate most of the inconveniences by
> >>>> leveraging an infrastructure service provided by a SocketPair Broker.
> >>>
> >>> I don't understand yet why this is useful for vhost-user, where the
> >>> creation of the vhost-user device backend and its use by a VMM are
> >>> closely managed by one piece of software:
> >>>
> >>> 1. Unlink the socket path.
> >>> 2. Create, bind, and listen on the socket path.
> >>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >>>    RPC, spawn a process, etc) and pass in the listen fd.
> >>> 4. In the meantime the VMM can open the socket path and call connect(2).
> >>>    As soon as the vhost-user device backend calls accept(2) the
> >>>    connection will proceed (there is no need for sleeping).
> >>>
> >>> This approach works across containers without a broker.
> >>
> >> Not sure if I fully understood a question here, but anyway.
> >>
> >> This approach works fine if you know what application to run.
> >> In case of a k8s cluster, it might be a random DPDK application
> >> with virtio-user ports running inside a container and want to
> >> have a network connection.  Also, this application needs to run
> >> virtio-user in server mode, otherwise restart of the OVS will
> >> require restart of the application.  So, you basically need to
> >> rely on a third-party application to create a socket with a right
> >> name and in a correct location that is shared with a host, so
> >> OVS can find it and connect.
> >>
> >> In a VM world everything is much more simple, since you have
> >> a libvirt and QEMU that will take care of all of these stuff
> >> and which are also under full control of management software
> >> and a system administrator.
> >> In case of a container with a "random" DPDK application inside
> >> there is no such entity that can help.  Of course, some solution
> >> might be implemented in docker/podman daemon to create and manage
> >> outside-looking sockets for an application inside the container,
> >> but that is not available today AFAIK and I'm not sure if it
> >> ever will.
> > 
> > Wait, when you say there is no entity like management software or a
> > system administrator, then how does OVS know to instantiate the new
> > port? I guess something still needs to invoke ovs-ctl add-port?
> 
> I didn't mean that there is no any application that configures
> everything.  Of course, there is.  I mean that there is no such
> entity that abstracts all that socket machinery from the user's
> application that runs inside the container.  QEMU hides all the
> details of the connection to vhost backend and presents the device
> as a PCI device with a network interface wrapping from the guest
> kernel.  So, the application inside VM shouldn't care what actually
> there is a socket connected to OVS that implements backend and
> forward traffic somewhere.  For the application it's just a usual
> network interface.
> But in case of a container world, application should handle all
> that by creating a virtio-user device that will connect to some
> socket, that has an OVS on the other side.
> 
> > 
> > Can you describe the steps used today (without the broker) for
> > instantiating a new DPDK app container and connecting it to OVS?
> > Although my interest is in the vhost-user protocol I think it's
> > necessary to understand the OVS requirements here and I know little
> > about them.
> 
> I might describe some things wrong since I worked with k8s and CNI
> plugins last time ~1.5 years ago, but the basic schema will look
> something like this:
> 
> 1. user decides to start a new pod and requests k8s to do that
>    via cmdline tools or some API calls.
> 
> 2. k8s scheduler looks for available resources asking resource
>    manager plugins, finds an appropriate physical host and asks
>    local to that node kubelet daemon to launch a new pod there.
> 
> 3. kubelet asks local CNI plugin to allocate network resources
>    and annotate the pod with required mount points, devices that
>    needs to be passed in and environment variables.
>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
>    usually deployed as a system DaemonSet, so it runs in a
>    separate pod.
> 
> 4. Assuming that vhost-user connection requested in server mode.
>    CNI plugin will:
>    4.1 create a directory for a vhost-user socket.
>    4.2 add this directory to pod annotations as a mount point.
>    4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
>        by connecting to ovsdb-server by JSONRPC directly.
>        It will set port type as dpdkvhostuserclient and specify
>        socket-path as a path inside the directory it created.
>        (OVS will create a port and rte_vhost will enter the
>         re-connection loop since socket does not exist yet.)
>    4.4 Set up socket file location as environment variable in
>        pod annotations.
>    4.5 report success to kubelet.
> 
> 5. kubelet will finish all other preparations and resource
>    allocations and will ask docker/podman to start a container
>    with all mount points, devices and environment variables from
>    the pod annotation.
> 
> 6. docker/podman starts a container.
>    Need to mention here that in many cases initial process of
>    a container is not the actual application that will use a
>    vhost-user connection, but likely a shell that will invoke
>    the actual application.
> 
> 7. Application starts inside the container, checks the environment
>    variables (actually, checking of environment variables usually
>    happens in a shell script that invokes the application with
>    correct arguments) and creates a net_virtio_user port in server
>    mode.  At this point socket file will be created.
>    (since we're running third-party application inside the container
>     we can only assume that it will do what is written here, it's
>     a responsibility of an application developer to do the right
>     thing.)
> 
> 8. OVS successfully re-connects to the newly created socket in a
>    shared directory and vhost-user protocol establishes the network
>    connectivity.
> 
> As you can wee, there are way too many entities and communication
> methods involved.  So, passing a pre-opened file descriptor from
> CNI all the way down to application is not that easy as it is in
> case of QEMU+LibVirt.

File descriptor passing isn't necessary if OVS owns the listen socket
and the application container is the one who connects. That's why I
asked why dpdkvhostuser was deprecated in another email. The benefit of
doing this would be that the application container can instantly connect
to OVS without a sleep loop.

I still don't get the attraction of the broker idea. The pros:
+ Overcomes the issue with stale UNIX domain socket files
+ Eliminates the re-connect sleep loop

Neutral:
* vhost-user UNIX domain socket directory container volume is replaced
  by broker UNIX domain socket bind mount
* UNIX domain socket naming conflicts become broker key naming conflicts

The cons:
- Requires running a new service on the host with potential security
  issues
- Requires support in third-party applications, QEMU, and DPDK/OVS
- The old code must be kept for compatibility with non-broker
  configurations, especially since third-party applications may not
  support the broker. Developers and users will have to learn about both
  options and decide which one to use.

This seems like a modest improvement for the complexity and effort
involved. The same pros can be achieved by:
* Adding unlink(2) to rte_vhost (or applications can add rm -f
  $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
  it doesn't catch a misconfiguration where the user launches two
  processes with the same socket path.
* Reversing the direction of the client/server relationship to
  eliminate the re-connect sleep loop at startup. I'm unsure whether
  this is possible.

That said, the broker idea doesn't affect the vhost-user protocol itself
and is more of an OVS/DPDK topic. I may just not be familiar enough with
OVS/DPDK to understand the benefits of the approach.

Stefan
Adrian Moreno March 23, 2021, 5:57 p.m. UTC | #16
On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>> And some housekeeping usually required for applications in case the
>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>> system:
>>>>>>  "failed to bind to vhu: Address already in use; remove it and try again"
>>>>>
>>>>> QEMU avoids this by unlinking before binding. The drawback is that users
>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>> solved with a pidfile.
>>>>
>>>> How exactly this could be solved with a pidfile?
>>>
>>> A pidfile prevents two instances of the same service from running at the
>>> same time.
>>>
>>> The same effect can be achieved by the container orchestrator, systemd,
>>> etc too because it refuses to run the same service twice.
>>
>> Sure. I understand that.  My point was that these could be 2 different
>> applications and they might not know which process to look for.
>>
>>>
>>>> And what if this is
>>>> a different application that tries to create a socket on a same path?
>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
>>>> to an existing socket file and will fail.  Subsequently port creation
>>>> in OVS will fail.   We can't allow OVS to unlink files because this
>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>> access to and we also has no idea if it's a QEMU that created a file
>>>> or it was a virtio-user application or someone else.
>>>
>>> If rte_vhost unlinks the socket then the user will find that networking
>>> doesn't work. They can either hot unplug the QEMU vhost-user-net device
>>> or restart QEMU, depending on whether they need to keep the guest
>>> running or not. This is a misconfiguration that is recoverable.
>>
>> True, it's recoverable, but with a high cost.  Restart of a VM is rarely
>> desirable.  And the application inside the guest might not feel itself
>> well after hot re-plug of a device that it actively used.  I'd expect
>> a DPDK application that runs inside a guest on some virtio-net device
>> to crash after this kind of manipulations.  Especially, if it uses some
>> older versions of DPDK.
> 
> This unlink issue is probably something we think differently about.
> There are many ways for users to misconfigure things when working with
> system tools. If it's possible to catch misconfigurations that is
> preferrable. In this case it's just the way pathname AF_UNIX domain
> sockets work and IMO it's better not to have problems starting the
> service due to stale files than to insist on preventing
> misconfigurations. QEMU and DPDK do this differently and both seem to be
> successful, so ¯\_(ツ)_/¯.
> 
>>>
>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>> create a security issue. I don't know the security model of OVS.
>>
>> In general privileges of a ovs-vswitchd daemon might be completely
>> different from privileges required to invoke control utilities or
>> to access the configuration database.  SO, yes, we should not allow
>> that.
> 
> That can be locked down by restricting the socket path to a file beneath
> /var/run/ovs/vhost-user/.
> 
>>>
>>>> There are, probably, ways to detect if there is any alive process that
>>>> has this socket open, but that sounds like too much for this purpose,
>>>> also I'm not sure if it's possible if actual user is in a different
>>>> container.
>>>> So I don't see a good reliable way to detect these conditions.  This
>>>> falls on shoulders of a higher level management software or a user to
>>>> clean these socket files up before adding ports.
>>>
>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>> application? If yes, then abstract AF_UNIX sockets can be used. Abstract
>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>> disappears when there is no process listening anymore.
>>
>> OVS is usually started right on the host in a main network namespace.
>> In case it's started in a pod, it will run in a separate container but
>> configured with a host network.  Applications almost exclusively runs
>> in separate pods.
> 
> Okay.
> 
>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>>>>
>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>> closely managed by one piece of software:
>>>>>
>>>>> 1. Unlink the socket path.
>>>>> 2. Create, bind, and listen on the socket path.
>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
>>>>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>>>>    As soon as the vhost-user device backend calls accept(2) the
>>>>>    connection will proceed (there is no need for sleeping).
>>>>>
>>>>> This approach works across containers without a broker.
>>>>
>>>> Not sure if I fully understood a question here, but anyway.
>>>>
>>>> This approach works fine if you know what application to run.
>>>> In case of a k8s cluster, it might be a random DPDK application
>>>> with virtio-user ports running inside a container and want to
>>>> have a network connection.  Also, this application needs to run
>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>> require restart of the application.  So, you basically need to
>>>> rely on a third-party application to create a socket with a right
>>>> name and in a correct location that is shared with a host, so
>>>> OVS can find it and connect.
>>>>
>>>> In a VM world everything is much more simple, since you have
>>>> a libvirt and QEMU that will take care of all of these stuff
>>>> and which are also under full control of management software
>>>> and a system administrator.
>>>> In case of a container with a "random" DPDK application inside
>>>> there is no such entity that can help.  Of course, some solution
>>>> might be implemented in docker/podman daemon to create and manage
>>>> outside-looking sockets for an application inside the container,
>>>> but that is not available today AFAIK and I'm not sure if it
>>>> ever will.
>>>
>>> Wait, when you say there is no entity like management software or a
>>> system administrator, then how does OVS know to instantiate the new
>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>
>> I didn't mean that there is no any application that configures
>> everything.  Of course, there is.  I mean that there is no such
>> entity that abstracts all that socket machinery from the user's
>> application that runs inside the container.  QEMU hides all the
>> details of the connection to vhost backend and presents the device
>> as a PCI device with a network interface wrapping from the guest
>> kernel.  So, the application inside VM shouldn't care what actually
>> there is a socket connected to OVS that implements backend and
>> forward traffic somewhere.  For the application it's just a usual
>> network interface.
>> But in case of a container world, application should handle all
>> that by creating a virtio-user device that will connect to some
>> socket, that has an OVS on the other side.
>>
>>>
>>> Can you describe the steps used today (without the broker) for
>>> instantiating a new DPDK app container and connecting it to OVS?
>>> Although my interest is in the vhost-user protocol I think it's
>>> necessary to understand the OVS requirements here and I know little
>>> about them.
>>>> I might describe some things wrong since I worked with k8s and CNI
>> plugins last time ~1.5 years ago, but the basic schema will look
>> something like this:
>>
>> 1. user decides to start a new pod and requests k8s to do that
>>    via cmdline tools or some API calls.
>>
>> 2. k8s scheduler looks for available resources asking resource
>>    manager plugins, finds an appropriate physical host and asks
>>    local to that node kubelet daemon to launch a new pod there.
>>

When the CNI is called, the pod has already been created, i.e: a PodID exists
and so does an associated network namespace. Therefore, everything that has to
do with the runtime spec such as mountpoints or devices cannot be modified by
this time.

That's why the Device Plugin API is used to modify the Pod's spec before the CNI
chain is called.

>> 3. kubelet asks local CNI plugin to allocate network resources
>>    and annotate the pod with required mount points, devices that
>>    needs to be passed in and environment variables.
>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
>>    usually deployed as a system DaemonSet, so it runs in a
>>    separate pod.
>>
>> 4. Assuming that vhost-user connection requested in server mode.
>>    CNI plugin will:
>>    4.1 create a directory for a vhost-user socket.
>>    4.2 add this directory to pod annotations as a mount point.

I believe this is not possible, it would have to inspect the pod's spec or
otherwise determine an existing mount point where the socket should be created.

+Billy might give more insights on this

>>    4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
>>        by connecting to ovsdb-server by JSONRPC directly.
>>        It will set port type as dpdkvhostuserclient and specify
>>        socket-path as a path inside the directory it created.
>>        (OVS will create a port and rte_vhost will enter the
>>         re-connection loop since socket does not exist yet.)
>>    4.4 Set up socket file location as environment variable in
>>        pod annotations.
>>    4.5 report success to kubelet.
>>

Since the CNI cannot modify the pod's mounts it has to rely on a Device Plugin
or other external entity that can inject the mount point before the pod is created.

However, there is another usecase that might be relevant: dynamic attachment of
network interfaces. In this case the CNI cannot work in collaboration with a
Device Plugin or "mount-point injector" and an existing mount point has to be used.
Also, some form of notification mechanism has to exist to tell the workload a
new socket is ready.

>> 5. kubelet will finish all other preparations and resource
>>    allocations and will ask docker/podman to start a container
>>    with all mount points, devices and environment variables from
>>    the pod annotation.
>>
>> 6. docker/podman starts a container.
>>    Need to mention here that in many cases initial process of
>>    a container is not the actual application that will use a
>>    vhost-user connection, but likely a shell that will invoke
>>    the actual application.
>>
>> 7. Application starts inside the container, checks the environment
>>    variables (actually, checking of environment variables usually
>>    happens in a shell script that invokes the application with
>>    correct arguments) and creates a net_virtio_user port in server
>>    mode.  At this point socket file will be created.
>>    (since we're running third-party application inside the container
>>     we can only assume that it will do what is written here, it's
>>     a responsibility of an application developer to do the right
>>     thing.)
>>
>> 8. OVS successfully re-connects to the newly created socket in a
>>    shared directory and vhost-user protocol establishes the network
>>    connectivity.
>>
>> As you can wee, there are way too many entities and communication
>> methods involved.  So, passing a pre-opened file descriptor from
>> CNI all the way down to application is not that easy as it is in
>> case of QEMU+LibVirt.
> 
> File descriptor passing isn't necessary if OVS owns the listen socket
> and the application container is the one who connects. That's why I
> asked why dpdkvhostuser was deprecated in another email. The benefit of
> doing this would be that the application container can instantly connect
> to OVS without a sleep loop.
> 
> I still don't get the attraction of the broker idea. The pros:
> + Overcomes the issue with stale UNIX domain socket files
> + Eliminates the re-connect sleep loop
> 
> Neutral:
> * vhost-user UNIX domain socket directory container volume is replaced
>   by broker UNIX domain socket bind mount
> * UNIX domain socket naming conflicts become broker key naming conflicts
> 
> The cons:
> - Requires running a new service on the host with potential security
>   issues
> - Requires support in third-party applications, QEMU, and DPDK/OVS
> - The old code must be kept for compatibility with non-broker
>   configurations, especially since third-party applications may not
>   support the broker. Developers and users will have to learn about both
>   options and decide which one to use.
> 
> This seems like a modest improvement for the complexity and effort
> involved. The same pros can be achieved by:
> * Adding unlink(2) to rte_vhost (or applications can add rm -f
>   $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
>   it doesn't catch a misconfiguration where the user launches two
>   processes with the same socket path.
> * Reversing the direction of the client/server relationship to
>   eliminate the re-connect sleep loop at startup. I'm unsure whether
>   this is possible.
> 
> That said, the broker idea doesn't affect the vhost-user protocol itself
> and is more of an OVS/DPDK topic. I may just not be familiar enough with
> OVS/DPDK to understand the benefits of the approach.
> 
> Stefan
>
Ilya Maximets March 23, 2021, 6:27 p.m. UTC | #17
On 3/23/21 6:57 PM, Adrian Moreno wrote:
> 
> 
> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>>> And some housekeeping usually required for applications in case the
>>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>>> system:
>>>>>>>  "failed to bind to vhu: Address already in use; remove it and try again"
>>>>>>
>>>>>> QEMU avoids this by unlinking before binding. The drawback is that users
>>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>>> solved with a pidfile.
>>>>>
>>>>> How exactly this could be solved with a pidfile?
>>>>
>>>> A pidfile prevents two instances of the same service from running at the
>>>> same time.
>>>>
>>>> The same effect can be achieved by the container orchestrator, systemd,
>>>> etc too because it refuses to run the same service twice.
>>>
>>> Sure. I understand that.  My point was that these could be 2 different
>>> applications and they might not know which process to look for.
>>>
>>>>
>>>>> And what if this is
>>>>> a different application that tries to create a socket on a same path?
>>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
>>>>> to an existing socket file and will fail.  Subsequently port creation
>>>>> in OVS will fail.   We can't allow OVS to unlink files because this
>>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>>> access to and we also has no idea if it's a QEMU that created a file
>>>>> or it was a virtio-user application or someone else.
>>>>
>>>> If rte_vhost unlinks the socket then the user will find that networking
>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net device
>>>> or restart QEMU, depending on whether they need to keep the guest
>>>> running or not. This is a misconfiguration that is recoverable.
>>>
>>> True, it's recoverable, but with a high cost.  Restart of a VM is rarely
>>> desirable.  And the application inside the guest might not feel itself
>>> well after hot re-plug of a device that it actively used.  I'd expect
>>> a DPDK application that runs inside a guest on some virtio-net device
>>> to crash after this kind of manipulations.  Especially, if it uses some
>>> older versions of DPDK.
>>
>> This unlink issue is probably something we think differently about.
>> There are many ways for users to misconfigure things when working with
>> system tools. If it's possible to catch misconfigurations that is
>> preferrable. In this case it's just the way pathname AF_UNIX domain
>> sockets work and IMO it's better not to have problems starting the
>> service due to stale files than to insist on preventing
>> misconfigurations. QEMU and DPDK do this differently and both seem to be
>> successful, so ¯\_(ツ)_/¯.
>>
>>>>
>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>>> create a security issue. I don't know the security model of OVS.
>>>
>>> In general privileges of a ovs-vswitchd daemon might be completely
>>> different from privileges required to invoke control utilities or
>>> to access the configuration database.  SO, yes, we should not allow
>>> that.
>>
>> That can be locked down by restricting the socket path to a file beneath
>> /var/run/ovs/vhost-user/.
>>
>>>>
>>>>> There are, probably, ways to detect if there is any alive process that
>>>>> has this socket open, but that sounds like too much for this purpose,
>>>>> also I'm not sure if it's possible if actual user is in a different
>>>>> container.
>>>>> So I don't see a good reliable way to detect these conditions.  This
>>>>> falls on shoulders of a higher level management software or a user to
>>>>> clean these socket files up before adding ports.
>>>>
>>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>>> application? If yes, then abstract AF_UNIX sockets can be used. Abstract
>>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>>> disappears when there is no process listening anymore.
>>>
>>> OVS is usually started right on the host in a main network namespace.
>>> In case it's started in a pod, it will run in a separate container but
>>> configured with a host network.  Applications almost exclusively runs
>>> in separate pods.
>>
>> Okay.
>>
>>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>>> leveraging an infrastructure service provided by a SocketPair Broker.
>>>>>>
>>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>>> closely managed by one piece of software:
>>>>>>
>>>>>> 1. Unlink the socket path.
>>>>>> 2. Create, bind, and listen on the socket path.
>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
>>>>>> 4. In the meantime the VMM can open the socket path and call connect(2).
>>>>>>    As soon as the vhost-user device backend calls accept(2) the
>>>>>>    connection will proceed (there is no need for sleeping).
>>>>>>
>>>>>> This approach works across containers without a broker.
>>>>>
>>>>> Not sure if I fully understood a question here, but anyway.
>>>>>
>>>>> This approach works fine if you know what application to run.
>>>>> In case of a k8s cluster, it might be a random DPDK application
>>>>> with virtio-user ports running inside a container and want to
>>>>> have a network connection.  Also, this application needs to run
>>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>>> require restart of the application.  So, you basically need to
>>>>> rely on a third-party application to create a socket with a right
>>>>> name and in a correct location that is shared with a host, so
>>>>> OVS can find it and connect.
>>>>>
>>>>> In a VM world everything is much more simple, since you have
>>>>> a libvirt and QEMU that will take care of all of these stuff
>>>>> and which are also under full control of management software
>>>>> and a system administrator.
>>>>> In case of a container with a "random" DPDK application inside
>>>>> there is no such entity that can help.  Of course, some solution
>>>>> might be implemented in docker/podman daemon to create and manage
>>>>> outside-looking sockets for an application inside the container,
>>>>> but that is not available today AFAIK and I'm not sure if it
>>>>> ever will.
>>>>
>>>> Wait, when you say there is no entity like management software or a
>>>> system administrator, then how does OVS know to instantiate the new
>>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>>
>>> I didn't mean that there is no any application that configures
>>> everything.  Of course, there is.  I mean that there is no such
>>> entity that abstracts all that socket machinery from the user's
>>> application that runs inside the container.  QEMU hides all the
>>> details of the connection to vhost backend and presents the device
>>> as a PCI device with a network interface wrapping from the guest
>>> kernel.  So, the application inside VM shouldn't care what actually
>>> there is a socket connected to OVS that implements backend and
>>> forward traffic somewhere.  For the application it's just a usual
>>> network interface.
>>> But in case of a container world, application should handle all
>>> that by creating a virtio-user device that will connect to some
>>> socket, that has an OVS on the other side.
>>>
>>>>
>>>> Can you describe the steps used today (without the broker) for
>>>> instantiating a new DPDK app container and connecting it to OVS?
>>>> Although my interest is in the vhost-user protocol I think it's
>>>> necessary to understand the OVS requirements here and I know little
>>>> about them.
>>>>> I might describe some things wrong since I worked with k8s and CNI
>>> plugins last time ~1.5 years ago, but the basic schema will look
>>> something like this:
>>>
>>> 1. user decides to start a new pod and requests k8s to do that
>>>    via cmdline tools or some API calls.
>>>
>>> 2. k8s scheduler looks for available resources asking resource
>>>    manager plugins, finds an appropriate physical host and asks
>>>    local to that node kubelet daemon to launch a new pod there.
>>>
> 
> When the CNI is called, the pod has already been created, i.e: a PodID exists
> and so does an associated network namespace. Therefore, everything that has to
> do with the runtime spec such as mountpoints or devices cannot be modified by
> this time.
> 
> That's why the Device Plugin API is used to modify the Pod's spec before the CNI
> chain is called.
> 
>>> 3. kubelet asks local CNI plugin to allocate network resources
>>>    and annotate the pod with required mount points, devices that
>>>    needs to be passed in and environment variables.
>>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
>>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
>>>    usually deployed as a system DaemonSet, so it runs in a
>>>    separate pod.
>>>
>>> 4. Assuming that vhost-user connection requested in server mode.
>>>    CNI plugin will:
>>>    4.1 create a directory for a vhost-user socket.
>>>    4.2 add this directory to pod annotations as a mount point.
> 
> I believe this is not possible, it would have to inspect the pod's spec or
> otherwise determine an existing mount point where the socket should be created.

Uff.  Yes, you're right.  Thanks for your clarification.
I mixed up CNI and Device Plugin here.

CNI itself is not able to annotate new resources to the pod, i.e.
create new mounts or something like this.   And I don't recall any
vhost-user device plugins.  Is there any?  There is an SR-IOV device
plugin, but its purpose is to allocate and pass PCI devices, not create
mounts for vhost-user.

So, IIUC, right now user must create the directory and specify
a mount point in a pod spec file or pass the whole /var/run/openvswitch
or something like this, right?

Looking at userspace-cni-network-plugin, it actually just parses
annotations to find the shared directory and fails if there is
no any:
 https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122

And examples suggests to specify a directory to mount:
 https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41

Looks like this is done by user's hands.

> 
> +Billy might give more insights on this
> 
>>>    4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
>>>        by connecting to ovsdb-server by JSONRPC directly.
>>>        It will set port type as dpdkvhostuserclient and specify
>>>        socket-path as a path inside the directory it created.
>>>        (OVS will create a port and rte_vhost will enter the
>>>         re-connection loop since socket does not exist yet.)
>>>    4.4 Set up socket file location as environment variable in
>>>        pod annotations.
>>>    4.5 report success to kubelet.
>>>
> 
> Since the CNI cannot modify the pod's mounts it has to rely on a Device Plugin
> or other external entity that can inject the mount point before the pod is created.
> 
> However, there is another usecase that might be relevant: dynamic attachment of
> network interfaces. In this case the CNI cannot work in collaboration with a
> Device Plugin or "mount-point injector" and an existing mount point has to be used.
> Also, some form of notification mechanism has to exist to tell the workload a
> new socket is ready.
> 
>>> 5. kubelet will finish all other preparations and resource
>>>    allocations and will ask docker/podman to start a container
>>>    with all mount points, devices and environment variables from
>>>    the pod annotation.
>>>
>>> 6. docker/podman starts a container.
>>>    Need to mention here that in many cases initial process of
>>>    a container is not the actual application that will use a
>>>    vhost-user connection, but likely a shell that will invoke
>>>    the actual application.
>>>
>>> 7. Application starts inside the container, checks the environment
>>>    variables (actually, checking of environment variables usually
>>>    happens in a shell script that invokes the application with
>>>    correct arguments) and creates a net_virtio_user port in server
>>>    mode.  At this point socket file will be created.
>>>    (since we're running third-party application inside the container
>>>     we can only assume that it will do what is written here, it's
>>>     a responsibility of an application developer to do the right
>>>     thing.)
>>>
>>> 8. OVS successfully re-connects to the newly created socket in a
>>>    shared directory and vhost-user protocol establishes the network
>>>    connectivity.
>>>
>>> As you can wee, there are way too many entities and communication
>>> methods involved.  So, passing a pre-opened file descriptor from
>>> CNI all the way down to application is not that easy as it is in
>>> case of QEMU+LibVirt.
>>
>> File descriptor passing isn't necessary if OVS owns the listen socket
>> and the application container is the one who connects. That's why I
>> asked why dpdkvhostuser was deprecated in another email. The benefit of
>> doing this would be that the application container can instantly connect
>> to OVS without a sleep loop.
>>
>> I still don't get the attraction of the broker idea. The pros:
>> + Overcomes the issue with stale UNIX domain socket files
>> + Eliminates the re-connect sleep loop
>>
>> Neutral:
>> * vhost-user UNIX domain socket directory container volume is replaced
>>   by broker UNIX domain socket bind mount
>> * UNIX domain socket naming conflicts become broker key naming conflicts
>>
>> The cons:
>> - Requires running a new service on the host with potential security
>>   issues
>> - Requires support in third-party applications, QEMU, and DPDK/OVS
>> - The old code must be kept for compatibility with non-broker
>>   configurations, especially since third-party applications may not
>>   support the broker. Developers and users will have to learn about both
>>   options and decide which one to use.
>>
>> This seems like a modest improvement for the complexity and effort
>> involved. The same pros can be achieved by:
>> * Adding unlink(2) to rte_vhost (or applications can add rm -f
>>   $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
>>   it doesn't catch a misconfiguration where the user launches two
>>   processes with the same socket path.
>> * Reversing the direction of the client/server relationship to
>>   eliminate the re-connect sleep loop at startup. I'm unsure whether
>>   this is possible.
>>
>> That said, the broker idea doesn't affect the vhost-user protocol itself
>> and is more of an OVS/DPDK topic. I may just not be familiar enough with
>> OVS/DPDK to understand the benefits of the approach.
>>
>> Stefan
>>
>
Billy McFall March 23, 2021, 8:54 p.m. UTC | #18
On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:

> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >
> >
> > On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>>>>> And some housekeeping usually required for applications in case the
> >>>>>>> socket server terminated abnormally and socket files left on a file
> >>>>>>> system:
> >>>>>>>  "failed to bind to vhu: Address already in use; remove it and try
> again"
> >>>>>>
> >>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> users
> >>>>>> might accidentally hijack an existing listen socket, but that can be
> >>>>>> solved with a pidfile.
> >>>>>
> >>>>> How exactly this could be solved with a pidfile?
> >>>>
> >>>> A pidfile prevents two instances of the same service from running at
> the
> >>>> same time.
> >>>>
> >>>> The same effect can be achieved by the container orchestrator,
> systemd,
> >>>> etc too because it refuses to run the same service twice.
> >>>
> >>> Sure. I understand that.  My point was that these could be 2 different
> >>> applications and they might not know which process to look for.
> >>>
> >>>>
> >>>>> And what if this is
> >>>>> a different application that tries to create a socket on a same path?
> >>>>> e.g. QEMU creates a socket (started in a server mode) and user
> >>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> >>>>> to an existing socket file and will fail.  Subsequently port creation
> >>>>> in OVS will fail.   We can't allow OVS to unlink files because this
> >>>>> way OVS users will have ability to unlink random sockets that OVS has
> >>>>> access to and we also has no idea if it's a QEMU that created a file
> >>>>> or it was a virtio-user application or someone else.
> >>>>
> >>>> If rte_vhost unlinks the socket then the user will find that
> networking
> >>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> device
> >>>> or restart QEMU, depending on whether they need to keep the guest
> >>>> running or not. This is a misconfiguration that is recoverable.
> >>>
> >>> True, it's recoverable, but with a high cost.  Restart of a VM is
> rarely
> >>> desirable.  And the application inside the guest might not feel itself
> >>> well after hot re-plug of a device that it actively used.  I'd expect
> >>> a DPDK application that runs inside a guest on some virtio-net device
> >>> to crash after this kind of manipulations.  Especially, if it uses some
> >>> older versions of DPDK.
> >>
> >> This unlink issue is probably something we think differently about.
> >> There are many ways for users to misconfigure things when working with
> >> system tools. If it's possible to catch misconfigurations that is
> >> preferrable. In this case it's just the way pathname AF_UNIX domain
> >> sockets work and IMO it's better not to have problems starting the
> >> service due to stale files than to insist on preventing
> >> misconfigurations. QEMU and DPDK do this differently and both seem to be
> >> successful, so ¯\_(ツ)_/¯.
> >>
> >>>>
> >>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> >>>> create a security issue. I don't know the security model of OVS.
> >>>
> >>> In general privileges of a ovs-vswitchd daemon might be completely
> >>> different from privileges required to invoke control utilities or
> >>> to access the configuration database.  SO, yes, we should not allow
> >>> that.
> >>
> >> That can be locked down by restricting the socket path to a file beneath
> >> /var/run/ovs/vhost-user/.
> >>
> >>>>
> >>>>> There are, probably, ways to detect if there is any alive process
> that
> >>>>> has this socket open, but that sounds like too much for this purpose,
> >>>>> also I'm not sure if it's possible if actual user is in a different
> >>>>> container.
> >>>>> So I don't see a good reliable way to detect these conditions.  This
> >>>>> falls on shoulders of a higher level management software or a user to
> >>>>> clean these socket files up before adding ports.
> >>>>
> >>>> Does OVS always run in the same net namespace (pod) as the DPDK
> >>>> application? If yes, then abstract AF_UNIX sockets can be used.
> Abstract
> >>>> AF_UNIX sockets don't have a filesystem path and the socket address
> >>>> disappears when there is no process listening anymore.
> >>>
> >>> OVS is usually started right on the host in a main network namespace.
> >>> In case it's started in a pod, it will run in a separate container but
> >>> configured with a host network.  Applications almost exclusively runs
> >>> in separate pods.
> >>
> >> Okay.
> >>
> >>>>>>> This patch-set aims to eliminate most of the inconveniences by
> >>>>>>> leveraging an infrastructure service provided by a SocketPair
> Broker.
> >>>>>>
> >>>>>> I don't understand yet why this is useful for vhost-user, where the
> >>>>>> creation of the vhost-user device backend and its use by a VMM are
> >>>>>> closely managed by one piece of software:
> >>>>>>
> >>>>>> 1. Unlink the socket path.
> >>>>>> 2. Create, bind, and listen on the socket path.
> >>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
> >>>>>> 4. In the meantime the VMM can open the socket path and call
> connect(2).
> >>>>>>    As soon as the vhost-user device backend calls accept(2) the
> >>>>>>    connection will proceed (there is no need for sleeping).
> >>>>>>
> >>>>>> This approach works across containers without a broker.
> >>>>>
> >>>>> Not sure if I fully understood a question here, but anyway.
> >>>>>
> >>>>> This approach works fine if you know what application to run.
> >>>>> In case of a k8s cluster, it might be a random DPDK application
> >>>>> with virtio-user ports running inside a container and want to
> >>>>> have a network connection.  Also, this application needs to run
> >>>>> virtio-user in server mode, otherwise restart of the OVS will
> >>>>> require restart of the application.  So, you basically need to
> >>>>> rely on a third-party application to create a socket with a right
> >>>>> name and in a correct location that is shared with a host, so
> >>>>> OVS can find it and connect.
> >>>>>
> >>>>> In a VM world everything is much more simple, since you have
> >>>>> a libvirt and QEMU that will take care of all of these stuff
> >>>>> and which are also under full control of management software
> >>>>> and a system administrator.
> >>>>> In case of a container with a "random" DPDK application inside
> >>>>> there is no such entity that can help.  Of course, some solution
> >>>>> might be implemented in docker/podman daemon to create and manage
> >>>>> outside-looking sockets for an application inside the container,
> >>>>> but that is not available today AFAIK and I'm not sure if it
> >>>>> ever will.
> >>>>
> >>>> Wait, when you say there is no entity like management software or a
> >>>> system administrator, then how does OVS know to instantiate the new
> >>>> port? I guess something still needs to invoke ovs-ctl add-port?
> >>>
> >>> I didn't mean that there is no any application that configures
> >>> everything.  Of course, there is.  I mean that there is no such
> >>> entity that abstracts all that socket machinery from the user's
> >>> application that runs inside the container.  QEMU hides all the
> >>> details of the connection to vhost backend and presents the device
> >>> as a PCI device with a network interface wrapping from the guest
> >>> kernel.  So, the application inside VM shouldn't care what actually
> >>> there is a socket connected to OVS that implements backend and
> >>> forward traffic somewhere.  For the application it's just a usual
> >>> network interface.
> >>> But in case of a container world, application should handle all
> >>> that by creating a virtio-user device that will connect to some
> >>> socket, that has an OVS on the other side.
> >>>
> >>>>
> >>>> Can you describe the steps used today (without the broker) for
> >>>> instantiating a new DPDK app container and connecting it to OVS?
> >>>> Although my interest is in the vhost-user protocol I think it's
> >>>> necessary to understand the OVS requirements here and I know little
> >>>> about them.
> >>>>> I might describe some things wrong since I worked with k8s and CNI
> >>> plugins last time ~1.5 years ago, but the basic schema will look
> >>> something like this:
> >>>
> >>> 1. user decides to start a new pod and requests k8s to do that
> >>>    via cmdline tools or some API calls.
> >>>
> >>> 2. k8s scheduler looks for available resources asking resource
> >>>    manager plugins, finds an appropriate physical host and asks
> >>>    local to that node kubelet daemon to launch a new pod there.
> >>>
> >
> > When the CNI is called, the pod has already been created, i.e: a PodID
> exists
> > and so does an associated network namespace. Therefore, everything that
> has to
> > do with the runtime spec such as mountpoints or devices cannot be
> modified by
> > this time.
> >
> > That's why the Device Plugin API is used to modify the Pod's spec before
> the CNI
> > chain is called.
> >
> >>> 3. kubelet asks local CNI plugin to allocate network resources
> >>>    and annotate the pod with required mount points, devices that
> >>>    needs to be passed in and environment variables.
> >>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
> >>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
> >>>    usually deployed as a system DaemonSet, so it runs in a
> >>>    separate pod.
> >>>
> >>> 4. Assuming that vhost-user connection requested in server mode.
> >>>    CNI plugin will:
> >>>    4.1 create a directory for a vhost-user socket.
> >>>    4.2 add this directory to pod annotations as a mount point.
> >
> > I believe this is not possible, it would have to inspect the pod's spec
> or
> > otherwise determine an existing mount point where the socket should be
> created.
>
> Uff.  Yes, you're right.  Thanks for your clarification.
> I mixed up CNI and Device Plugin here.
>
> CNI itself is not able to annotate new resources to the pod, i.e.
> create new mounts or something like this.   And I don't recall any
> vhost-user device plugins.  Is there any?  There is an SR-IOV device
> plugin, but its purpose is to allocate and pass PCI devices, not create
> mounts for vhost-user.
>
> So, IIUC, right now user must create the directory and specify
> a mount point in a pod spec file or pass the whole /var/run/openvswitch
> or something like this, right?
>
> Looking at userspace-cni-network-plugin, it actually just parses
> annotations to find the shared directory and fails if there is
> no any:
>
> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
>
> And examples suggests to specify a directory to mount:
>
> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
>
> Looks like this is done by user's hands.
>
> Yes, I am one of the primary authors of Userspace CNI. Currently, the
directory is by hand. Long term thought was to have a mutating
webhook/admission controller inject a directory into the podspec.  Not sure
if it has changed, but I think when I was originally doing this work, OvS
only lets you choose the directory at install time, so it has to be
something like /var/run/openvswitch/. You can choose the socketfile name
and maybe a subdirectory off the main directory, but not the full path.

One of the issues I was trying to solve was making sure ContainerA couldn't
see ContainerB's socketfiles. That's where the admission controller could
create a unique subdirectory for each container under
/var/run/openvswitch/. But this was more of a PoC CNI and other work items
always took precedence so that work never completed.

Billy

>
> > +Billy might give more insights on this
> >
> >>>    4.3 create a port in OVS by invoking 'ovs-vsctl port-add' or
> >>>        by connecting to ovsdb-server by JSONRPC directly.
> >>>        It will set port type as dpdkvhostuserclient and specify
> >>>        socket-path as a path inside the directory it created.
> >>>        (OVS will create a port and rte_vhost will enter the
> >>>         re-connection loop since socket does not exist yet.)
> >>>    4.4 Set up socket file location as environment variable in
> >>>        pod annotations.
> >>>    4.5 report success to kubelet.
> >>>
> >
> > Since the CNI cannot modify the pod's mounts it has to rely on a Device
> Plugin
> > or other external entity that can inject the mount point before the pod
> is created.
> >
> > However, there is another usecase that might be relevant: dynamic
> attachment of
> > network interfaces. In this case the CNI cannot work in collaboration
> with a
> > Device Plugin or "mount-point injector" and an existing mount point has
> to be used.
> > Also, some form of notification mechanism has to exist to tell the
> workload a
> > new socket is ready.
> >
> >>> 5. kubelet will finish all other preparations and resource
> >>>    allocations and will ask docker/podman to start a container
> >>>    with all mount points, devices and environment variables from
> >>>    the pod annotation.
> >>>
> >>> 6. docker/podman starts a container.
> >>>    Need to mention here that in many cases initial process of
> >>>    a container is not the actual application that will use a
> >>>    vhost-user connection, but likely a shell that will invoke
> >>>    the actual application.
> >>>
> >>> 7. Application starts inside the container, checks the environment
> >>>    variables (actually, checking of environment variables usually
> >>>    happens in a shell script that invokes the application with
> >>>    correct arguments) and creates a net_virtio_user port in server
> >>>    mode.  At this point socket file will be created.
> >>>    (since we're running third-party application inside the container
> >>>     we can only assume that it will do what is written here, it's
> >>>     a responsibility of an application developer to do the right
> >>>     thing.)
> >>>
> >>> 8. OVS successfully re-connects to the newly created socket in a
> >>>    shared directory and vhost-user protocol establishes the network
> >>>    connectivity.
> >>>
> >>> As you can wee, there are way too many entities and communication
> >>> methods involved.  So, passing a pre-opened file descriptor from
> >>> CNI all the way down to application is not that easy as it is in
> >>> case of QEMU+LibVirt.
> >>
> >> File descriptor passing isn't necessary if OVS owns the listen socket
> >> and the application container is the one who connects. That's why I
> >> asked why dpdkvhostuser was deprecated in another email. The benefit of
> >> doing this would be that the application container can instantly connect
> >> to OVS without a sleep loop.
> >>
> >> I still don't get the attraction of the broker idea. The pros:
> >> + Overcomes the issue with stale UNIX domain socket files
> >> + Eliminates the re-connect sleep loop
> >>
> >> Neutral:
> >> * vhost-user UNIX domain socket directory container volume is replaced
> >>   by broker UNIX domain socket bind mount
> >> * UNIX domain socket naming conflicts become broker key naming conflicts
> >>
> >> The cons:
> >> - Requires running a new service on the host with potential security
> >>   issues
> >> - Requires support in third-party applications, QEMU, and DPDK/OVS
> >> - The old code must be kept for compatibility with non-broker
> >>   configurations, especially since third-party applications may not
> >>   support the broker. Developers and users will have to learn about both
> >>   options and decide which one to use.
> >>
> >> This seems like a modest improvement for the complexity and effort
> >> involved. The same pros can be achieved by:
> >> * Adding unlink(2) to rte_vhost (or applications can add rm -f
> >>   $PATH_TO_SOCKET to their docker-entrypoint.sh). The disadvantage is
> >>   it doesn't catch a misconfiguration where the user launches two
> >>   processes with the same socket path.
> >> * Reversing the direction of the client/server relationship to
> >>   eliminate the re-connect sleep loop at startup. I'm unsure whether
> >>   this is possible.
> >>
> >> That said, the broker idea doesn't affect the vhost-user protocol itself
> >> and is more of an OVS/DPDK topic. I may just not be familiar enough with
> >> OVS/DPDK to understand the benefits of the approach.
> >>
> >> Stefan
> >>
> >
>
>
Stefan Hajnoczi March 24, 2021, 12:05 p.m. UTC | #19
On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> 
> > On 3/23/21 6:57 PM, Adrian Moreno wrote:
> > >
> > >
> > > On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> > >> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> > >>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> > >>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> > >>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> > >>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> > >>>>>>> And some housekeeping usually required for applications in case the
> > >>>>>>> socket server terminated abnormally and socket files left on a file
> > >>>>>>> system:
> > >>>>>>>  "failed to bind to vhu: Address already in use; remove it and try
> > again"
> > >>>>>>
> > >>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> > users
> > >>>>>> might accidentally hijack an existing listen socket, but that can be
> > >>>>>> solved with a pidfile.
> > >>>>>
> > >>>>> How exactly this could be solved with a pidfile?
> > >>>>
> > >>>> A pidfile prevents two instances of the same service from running at
> > the
> > >>>> same time.
> > >>>>
> > >>>> The same effect can be achieved by the container orchestrator,
> > systemd,
> > >>>> etc too because it refuses to run the same service twice.
> > >>>
> > >>> Sure. I understand that.  My point was that these could be 2 different
> > >>> applications and they might not know which process to look for.
> > >>>
> > >>>>
> > >>>>> And what if this is
> > >>>>> a different application that tries to create a socket on a same path?
> > >>>>> e.g. QEMU creates a socket (started in a server mode) and user
> > >>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> > >>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> > >>>>> to an existing socket file and will fail.  Subsequently port creation
> > >>>>> in OVS will fail.   We can't allow OVS to unlink files because this
> > >>>>> way OVS users will have ability to unlink random sockets that OVS has
> > >>>>> access to and we also has no idea if it's a QEMU that created a file
> > >>>>> or it was a virtio-user application or someone else.
> > >>>>
> > >>>> If rte_vhost unlinks the socket then the user will find that
> > networking
> > >>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> > device
> > >>>> or restart QEMU, depending on whether they need to keep the guest
> > >>>> running or not. This is a misconfiguration that is recoverable.
> > >>>
> > >>> True, it's recoverable, but with a high cost.  Restart of a VM is
> > rarely
> > >>> desirable.  And the application inside the guest might not feel itself
> > >>> well after hot re-plug of a device that it actively used.  I'd expect
> > >>> a DPDK application that runs inside a guest on some virtio-net device
> > >>> to crash after this kind of manipulations.  Especially, if it uses some
> > >>> older versions of DPDK.
> > >>
> > >> This unlink issue is probably something we think differently about.
> > >> There are many ways for users to misconfigure things when working with
> > >> system tools. If it's possible to catch misconfigurations that is
> > >> preferrable. In this case it's just the way pathname AF_UNIX domain
> > >> sockets work and IMO it's better not to have problems starting the
> > >> service due to stale files than to insist on preventing
> > >> misconfigurations. QEMU and DPDK do this differently and both seem to be
> > >> successful, so ¯\_(ツ)_/¯.
> > >>
> > >>>>
> > >>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> > >>>> create a security issue. I don't know the security model of OVS.
> > >>>
> > >>> In general privileges of a ovs-vswitchd daemon might be completely
> > >>> different from privileges required to invoke control utilities or
> > >>> to access the configuration database.  SO, yes, we should not allow
> > >>> that.
> > >>
> > >> That can be locked down by restricting the socket path to a file beneath
> > >> /var/run/ovs/vhost-user/.
> > >>
> > >>>>
> > >>>>> There are, probably, ways to detect if there is any alive process
> > that
> > >>>>> has this socket open, but that sounds like too much for this purpose,
> > >>>>> also I'm not sure if it's possible if actual user is in a different
> > >>>>> container.
> > >>>>> So I don't see a good reliable way to detect these conditions.  This
> > >>>>> falls on shoulders of a higher level management software or a user to
> > >>>>> clean these socket files up before adding ports.
> > >>>>
> > >>>> Does OVS always run in the same net namespace (pod) as the DPDK
> > >>>> application? If yes, then abstract AF_UNIX sockets can be used.
> > Abstract
> > >>>> AF_UNIX sockets don't have a filesystem path and the socket address
> > >>>> disappears when there is no process listening anymore.
> > >>>
> > >>> OVS is usually started right on the host in a main network namespace.
> > >>> In case it's started in a pod, it will run in a separate container but
> > >>> configured with a host network.  Applications almost exclusively runs
> > >>> in separate pods.
> > >>
> > >> Okay.
> > >>
> > >>>>>>> This patch-set aims to eliminate most of the inconveniences by
> > >>>>>>> leveraging an infrastructure service provided by a SocketPair
> > Broker.
> > >>>>>>
> > >>>>>> I don't understand yet why this is useful for vhost-user, where the
> > >>>>>> creation of the vhost-user device backend and its use by a VMM are
> > >>>>>> closely managed by one piece of software:
> > >>>>>>
> > >>>>>> 1. Unlink the socket path.
> > >>>>>> 2. Create, bind, and listen on the socket path.
> > >>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> > >>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
> > >>>>>> 4. In the meantime the VMM can open the socket path and call
> > connect(2).
> > >>>>>>    As soon as the vhost-user device backend calls accept(2) the
> > >>>>>>    connection will proceed (there is no need for sleeping).
> > >>>>>>
> > >>>>>> This approach works across containers without a broker.
> > >>>>>
> > >>>>> Not sure if I fully understood a question here, but anyway.
> > >>>>>
> > >>>>> This approach works fine if you know what application to run.
> > >>>>> In case of a k8s cluster, it might be a random DPDK application
> > >>>>> with virtio-user ports running inside a container and want to
> > >>>>> have a network connection.  Also, this application needs to run
> > >>>>> virtio-user in server mode, otherwise restart of the OVS will
> > >>>>> require restart of the application.  So, you basically need to
> > >>>>> rely on a third-party application to create a socket with a right
> > >>>>> name and in a correct location that is shared with a host, so
> > >>>>> OVS can find it and connect.
> > >>>>>
> > >>>>> In a VM world everything is much more simple, since you have
> > >>>>> a libvirt and QEMU that will take care of all of these stuff
> > >>>>> and which are also under full control of management software
> > >>>>> and a system administrator.
> > >>>>> In case of a container with a "random" DPDK application inside
> > >>>>> there is no such entity that can help.  Of course, some solution
> > >>>>> might be implemented in docker/podman daemon to create and manage
> > >>>>> outside-looking sockets for an application inside the container,
> > >>>>> but that is not available today AFAIK and I'm not sure if it
> > >>>>> ever will.
> > >>>>
> > >>>> Wait, when you say there is no entity like management software or a
> > >>>> system administrator, then how does OVS know to instantiate the new
> > >>>> port? I guess something still needs to invoke ovs-ctl add-port?
> > >>>
> > >>> I didn't mean that there is no any application that configures
> > >>> everything.  Of course, there is.  I mean that there is no such
> > >>> entity that abstracts all that socket machinery from the user's
> > >>> application that runs inside the container.  QEMU hides all the
> > >>> details of the connection to vhost backend and presents the device
> > >>> as a PCI device with a network interface wrapping from the guest
> > >>> kernel.  So, the application inside VM shouldn't care what actually
> > >>> there is a socket connected to OVS that implements backend and
> > >>> forward traffic somewhere.  For the application it's just a usual
> > >>> network interface.
> > >>> But in case of a container world, application should handle all
> > >>> that by creating a virtio-user device that will connect to some
> > >>> socket, that has an OVS on the other side.
> > >>>
> > >>>>
> > >>>> Can you describe the steps used today (without the broker) for
> > >>>> instantiating a new DPDK app container and connecting it to OVS?
> > >>>> Although my interest is in the vhost-user protocol I think it's
> > >>>> necessary to understand the OVS requirements here and I know little
> > >>>> about them.
> > >>>>> I might describe some things wrong since I worked with k8s and CNI
> > >>> plugins last time ~1.5 years ago, but the basic schema will look
> > >>> something like this:
> > >>>
> > >>> 1. user decides to start a new pod and requests k8s to do that
> > >>>    via cmdline tools or some API calls.
> > >>>
> > >>> 2. k8s scheduler looks for available resources asking resource
> > >>>    manager plugins, finds an appropriate physical host and asks
> > >>>    local to that node kubelet daemon to launch a new pod there.
> > >>>
> > >
> > > When the CNI is called, the pod has already been created, i.e: a PodID
> > exists
> > > and so does an associated network namespace. Therefore, everything that
> > has to
> > > do with the runtime spec such as mountpoints or devices cannot be
> > modified by
> > > this time.
> > >
> > > That's why the Device Plugin API is used to modify the Pod's spec before
> > the CNI
> > > chain is called.
> > >
> > >>> 3. kubelet asks local CNI plugin to allocate network resources
> > >>>    and annotate the pod with required mount points, devices that
> > >>>    needs to be passed in and environment variables.
> > >>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
> > >>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
> > >>>    usually deployed as a system DaemonSet, so it runs in a
> > >>>    separate pod.
> > >>>
> > >>> 4. Assuming that vhost-user connection requested in server mode.
> > >>>    CNI plugin will:
> > >>>    4.1 create a directory for a vhost-user socket.
> > >>>    4.2 add this directory to pod annotations as a mount point.
> > >
> > > I believe this is not possible, it would have to inspect the pod's spec
> > or
> > > otherwise determine an existing mount point where the socket should be
> > created.
> >
> > Uff.  Yes, you're right.  Thanks for your clarification.
> > I mixed up CNI and Device Plugin here.
> >
> > CNI itself is not able to annotate new resources to the pod, i.e.
> > create new mounts or something like this.   And I don't recall any
> > vhost-user device plugins.  Is there any?  There is an SR-IOV device
> > plugin, but its purpose is to allocate and pass PCI devices, not create
> > mounts for vhost-user.
> >
> > So, IIUC, right now user must create the directory and specify
> > a mount point in a pod spec file or pass the whole /var/run/openvswitch
> > or something like this, right?
> >
> > Looking at userspace-cni-network-plugin, it actually just parses
> > annotations to find the shared directory and fails if there is
> > no any:
> >
> > https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
> >
> > And examples suggests to specify a directory to mount:
> >
> > https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
> >
> > Looks like this is done by user's hands.
> >
> > Yes, I am one of the primary authors of Userspace CNI. Currently, the
> directory is by hand. Long term thought was to have a mutating
> webhook/admission controller inject a directory into the podspec.  Not sure
> if it has changed, but I think when I was originally doing this work, OvS
> only lets you choose the directory at install time, so it has to be
> something like /var/run/openvswitch/. You can choose the socketfile name
> and maybe a subdirectory off the main directory, but not the full path.
> 
> One of the issues I was trying to solve was making sure ContainerA couldn't
> see ContainerB's socketfiles. That's where the admission controller could
> create a unique subdirectory for each container under
> /var/run/openvswitch/. But this was more of a PoC CNI and other work items
> always took precedence so that work never completed.

If the CNI plugin has access to the container's network namespace, could
it create an abstract AF_UNIX listen socket?

That way the application inside the container could connect to an
AF_UNIX socket and there is no need to manage container volumes.

I'm not familiar with the Open VSwitch, so I'm not sure if there is a
sane way of passing the listen socket fd into ovswitchd from the CNI
plugin?

The steps:
1. CNI plugin enters container's network namespace and opens an abstract
   AF_UNIX listen socket.
2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
   add-port step. Instead of using type=dpdkvhostuserclient
   options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
   dpdkvhostuser server with the listen fd.
3. When the container starts, it connects to the abstract AF_UNIX
   socket. The abstract socket name is provided to the container at
   startup time in an environment variable. The name is unique, at least
   to the pod, so that multiple containers in the pod can run vhost-user
   applications.

Stefan
Ilya Maximets March 24, 2021, 1:11 p.m. UTC | #20
On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
>>>>
>>>>
>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>>>>>>>> And some housekeeping usually required for applications in case the
>>>>>>>>>> socket server terminated abnormally and socket files left on a file
>>>>>>>>>> system:
>>>>>>>>>>  "failed to bind to vhu: Address already in use; remove it and try
>>> again"
>>>>>>>>>
>>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that
>>> users
>>>>>>>>> might accidentally hijack an existing listen socket, but that can be
>>>>>>>>> solved with a pidfile.
>>>>>>>>
>>>>>>>> How exactly this could be solved with a pidfile?
>>>>>>>
>>>>>>> A pidfile prevents two instances of the same service from running at
>>> the
>>>>>>> same time.
>>>>>>>
>>>>>>> The same effect can be achieved by the container orchestrator,
>>> systemd,
>>>>>>> etc too because it refuses to run the same service twice.
>>>>>>
>>>>>> Sure. I understand that.  My point was that these could be 2 different
>>>>>> applications and they might not know which process to look for.
>>>>>>
>>>>>>>
>>>>>>>> And what if this is
>>>>>>>> a different application that tries to create a socket on a same path?
>>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user
>>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
>>>>>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
>>>>>>>> to an existing socket file and will fail.  Subsequently port creation
>>>>>>>> in OVS will fail.   We can't allow OVS to unlink files because this
>>>>>>>> way OVS users will have ability to unlink random sockets that OVS has
>>>>>>>> access to and we also has no idea if it's a QEMU that created a file
>>>>>>>> or it was a virtio-user application or someone else.
>>>>>>>
>>>>>>> If rte_vhost unlinks the socket then the user will find that
>>> networking
>>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
>>> device
>>>>>>> or restart QEMU, depending on whether they need to keep the guest
>>>>>>> running or not. This is a misconfiguration that is recoverable.
>>>>>>
>>>>>> True, it's recoverable, but with a high cost.  Restart of a VM is
>>> rarely
>>>>>> desirable.  And the application inside the guest might not feel itself
>>>>>> well after hot re-plug of a device that it actively used.  I'd expect
>>>>>> a DPDK application that runs inside a guest on some virtio-net device
>>>>>> to crash after this kind of manipulations.  Especially, if it uses some
>>>>>> older versions of DPDK.
>>>>>
>>>>> This unlink issue is probably something we think differently about.
>>>>> There are many ways for users to misconfigure things when working with
>>>>> system tools. If it's possible to catch misconfigurations that is
>>>>> preferrable. In this case it's just the way pathname AF_UNIX domain
>>>>> sockets work and IMO it's better not to have problems starting the
>>>>> service due to stale files than to insist on preventing
>>>>> misconfigurations. QEMU and DPDK do this differently and both seem to be
>>>>> successful, so ¯\_(ツ)_/¯.
>>>>>
>>>>>>>
>>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
>>>>>>> create a security issue. I don't know the security model of OVS.
>>>>>>
>>>>>> In general privileges of a ovs-vswitchd daemon might be completely
>>>>>> different from privileges required to invoke control utilities or
>>>>>> to access the configuration database.  SO, yes, we should not allow
>>>>>> that.
>>>>>
>>>>> That can be locked down by restricting the socket path to a file beneath
>>>>> /var/run/ovs/vhost-user/.
>>>>>
>>>>>>>
>>>>>>>> There are, probably, ways to detect if there is any alive process
>>> that
>>>>>>>> has this socket open, but that sounds like too much for this purpose,
>>>>>>>> also I'm not sure if it's possible if actual user is in a different
>>>>>>>> container.
>>>>>>>> So I don't see a good reliable way to detect these conditions.  This
>>>>>>>> falls on shoulders of a higher level management software or a user to
>>>>>>>> clean these socket files up before adding ports.
>>>>>>>
>>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK
>>>>>>> application? If yes, then abstract AF_UNIX sockets can be used.
>>> Abstract
>>>>>>> AF_UNIX sockets don't have a filesystem path and the socket address
>>>>>>> disappears when there is no process listening anymore.
>>>>>>
>>>>>> OVS is usually started right on the host in a main network namespace.
>>>>>> In case it's started in a pod, it will run in a separate container but
>>>>>> configured with a host network.  Applications almost exclusively runs
>>>>>> in separate pods.
>>>>>
>>>>> Okay.
>>>>>
>>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by
>>>>>>>>>> leveraging an infrastructure service provided by a SocketPair
>>> Broker.
>>>>>>>>>
>>>>>>>>> I don't understand yet why this is useful for vhost-user, where the
>>>>>>>>> creation of the vhost-user device backend and its use by a VMM are
>>>>>>>>> closely managed by one piece of software:
>>>>>>>>>
>>>>>>>>> 1. Unlink the socket path.
>>>>>>>>> 2. Create, bind, and listen on the socket path.
>>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
>>>>>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
>>>>>>>>> 4. In the meantime the VMM can open the socket path and call
>>> connect(2).
>>>>>>>>>    As soon as the vhost-user device backend calls accept(2) the
>>>>>>>>>    connection will proceed (there is no need for sleeping).
>>>>>>>>>
>>>>>>>>> This approach works across containers without a broker.
>>>>>>>>
>>>>>>>> Not sure if I fully understood a question here, but anyway.
>>>>>>>>
>>>>>>>> This approach works fine if you know what application to run.
>>>>>>>> In case of a k8s cluster, it might be a random DPDK application
>>>>>>>> with virtio-user ports running inside a container and want to
>>>>>>>> have a network connection.  Also, this application needs to run
>>>>>>>> virtio-user in server mode, otherwise restart of the OVS will
>>>>>>>> require restart of the application.  So, you basically need to
>>>>>>>> rely on a third-party application to create a socket with a right
>>>>>>>> name and in a correct location that is shared with a host, so
>>>>>>>> OVS can find it and connect.
>>>>>>>>
>>>>>>>> In a VM world everything is much more simple, since you have
>>>>>>>> a libvirt and QEMU that will take care of all of these stuff
>>>>>>>> and which are also under full control of management software
>>>>>>>> and a system administrator.
>>>>>>>> In case of a container with a "random" DPDK application inside
>>>>>>>> there is no such entity that can help.  Of course, some solution
>>>>>>>> might be implemented in docker/podman daemon to create and manage
>>>>>>>> outside-looking sockets for an application inside the container,
>>>>>>>> but that is not available today AFAIK and I'm not sure if it
>>>>>>>> ever will.
>>>>>>>
>>>>>>> Wait, when you say there is no entity like management software or a
>>>>>>> system administrator, then how does OVS know to instantiate the new
>>>>>>> port? I guess something still needs to invoke ovs-ctl add-port?
>>>>>>
>>>>>> I didn't mean that there is no any application that configures
>>>>>> everything.  Of course, there is.  I mean that there is no such
>>>>>> entity that abstracts all that socket machinery from the user's
>>>>>> application that runs inside the container.  QEMU hides all the
>>>>>> details of the connection to vhost backend and presents the device
>>>>>> as a PCI device with a network interface wrapping from the guest
>>>>>> kernel.  So, the application inside VM shouldn't care what actually
>>>>>> there is a socket connected to OVS that implements backend and
>>>>>> forward traffic somewhere.  For the application it's just a usual
>>>>>> network interface.
>>>>>> But in case of a container world, application should handle all
>>>>>> that by creating a virtio-user device that will connect to some
>>>>>> socket, that has an OVS on the other side.
>>>>>>
>>>>>>>
>>>>>>> Can you describe the steps used today (without the broker) for
>>>>>>> instantiating a new DPDK app container and connecting it to OVS?
>>>>>>> Although my interest is in the vhost-user protocol I think it's
>>>>>>> necessary to understand the OVS requirements here and I know little
>>>>>>> about them.
>>>>>>>> I might describe some things wrong since I worked with k8s and CNI
>>>>>> plugins last time ~1.5 years ago, but the basic schema will look
>>>>>> something like this:
>>>>>>
>>>>>> 1. user decides to start a new pod and requests k8s to do that
>>>>>>    via cmdline tools or some API calls.
>>>>>>
>>>>>> 2. k8s scheduler looks for available resources asking resource
>>>>>>    manager plugins, finds an appropriate physical host and asks
>>>>>>    local to that node kubelet daemon to launch a new pod there.
>>>>>>
>>>>
>>>> When the CNI is called, the pod has already been created, i.e: a PodID
>>> exists
>>>> and so does an associated network namespace. Therefore, everything that
>>> has to
>>>> do with the runtime spec such as mountpoints or devices cannot be
>>> modified by
>>>> this time.
>>>>
>>>> That's why the Device Plugin API is used to modify the Pod's spec before
>>> the CNI
>>>> chain is called.
>>>>
>>>>>> 3. kubelet asks local CNI plugin to allocate network resources
>>>>>>    and annotate the pod with required mount points, devices that
>>>>>>    needs to be passed in and environment variables.
>>>>>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
>>>>>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
>>>>>>    usually deployed as a system DaemonSet, so it runs in a
>>>>>>    separate pod.
>>>>>>
>>>>>> 4. Assuming that vhost-user connection requested in server mode.
>>>>>>    CNI plugin will:
>>>>>>    4.1 create a directory for a vhost-user socket.
>>>>>>    4.2 add this directory to pod annotations as a mount point.
>>>>
>>>> I believe this is not possible, it would have to inspect the pod's spec
>>> or
>>>> otherwise determine an existing mount point where the socket should be
>>> created.
>>>
>>> Uff.  Yes, you're right.  Thanks for your clarification.
>>> I mixed up CNI and Device Plugin here.
>>>
>>> CNI itself is not able to annotate new resources to the pod, i.e.
>>> create new mounts or something like this.   And I don't recall any
>>> vhost-user device plugins.  Is there any?  There is an SR-IOV device
>>> plugin, but its purpose is to allocate and pass PCI devices, not create
>>> mounts for vhost-user.
>>>
>>> So, IIUC, right now user must create the directory and specify
>>> a mount point in a pod spec file or pass the whole /var/run/openvswitch
>>> or something like this, right?
>>>
>>> Looking at userspace-cni-network-plugin, it actually just parses
>>> annotations to find the shared directory and fails if there is
>>> no any:
>>>
>>> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
>>>
>>> And examples suggests to specify a directory to mount:
>>>
>>> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
>>>
>>> Looks like this is done by user's hands.
>>>
>>> Yes, I am one of the primary authors of Userspace CNI. Currently, the
>> directory is by hand. Long term thought was to have a mutating
>> webhook/admission controller inject a directory into the podspec.  Not sure
>> if it has changed, but I think when I was originally doing this work, OvS
>> only lets you choose the directory at install time, so it has to be
>> something like /var/run/openvswitch/. You can choose the socketfile name
>> and maybe a subdirectory off the main directory, but not the full path.
>>
>> One of the issues I was trying to solve was making sure ContainerA couldn't
>> see ContainerB's socketfiles. That's where the admission controller could
>> create a unique subdirectory for each container under
>> /var/run/openvswitch/. But this was more of a PoC CNI and other work items
>> always took precedence so that work never completed.
> 
> If the CNI plugin has access to the container's network namespace, could
> it create an abstract AF_UNIX listen socket?
> 
> That way the application inside the container could connect to an
> AF_UNIX socket and there is no need to manage container volumes.
> 
> I'm not familiar with the Open VSwitch, so I'm not sure if there is a
> sane way of passing the listen socket fd into ovswitchd from the CNI
> plugin?
> 
> The steps:
> 1. CNI plugin enters container's network namespace and opens an abstract
>    AF_UNIX listen socket.
> 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
>    add-port step. Instead of using type=dpdkvhostuserclient
>    options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
>    dpdkvhostuser server with the listen fd.

For this step you will need a side channel, i.e. a separate unix socket
created by ovs-vswitchd (most likely, created by rte_vhost on
rte_vhost_driver_register() call).

The problem is that ovs-vsctl talks with ovsdb-server and adds the new
port -- just a new row in the 'interface' table of the database.
ovs-vswitchd receives update from the database and creates the actual
port.  All the communications done through JSONRPC, so passing fds is
not an option.

> 3. When the container starts, it connects to the abstract AF_UNIX
>    socket. The abstract socket name is provided to the container at
>    startup time in an environment variable. The name is unique, at least
>    to the pod, so that multiple containers in the pod can run vhost-user
>    applications.

Few more problems with this solution:

- We still want to run application inside the container in a server mode,
  because virtio-user PMD in client mode doesn't support re-connection.

- How to get this fd again after the OVS restart?  CNI will not be invoked
  at this point to pass a new fd.

- If application will close the connection for any reason (restart, some
  reconfiguration internal to the application) and OVS will be re-started
  at the same time, abstract socket will be gone.  Need a persistent daemon
  to hold it.

Best regards, Ilya Maximets.
Stefan Hajnoczi March 24, 2021, 3:07 p.m. UTC | #21
On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>
> >>>>
> >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> And some housekeeping usually required for applications in case the
> >>>>>>>>>> socket server terminated abnormally and socket files left on a file
> >>>>>>>>>> system:
> >>>>>>>>>>  "failed to bind to vhu: Address already in use; remove it and try
> >>> again"
> >>>>>>>>>
> >>>>>>>>> QEMU avoids this by unlinking before binding. The drawback is that
> >>> users
> >>>>>>>>> might accidentally hijack an existing listen socket, but that can be
> >>>>>>>>> solved with a pidfile.
> >>>>>>>>
> >>>>>>>> How exactly this could be solved with a pidfile?
> >>>>>>>
> >>>>>>> A pidfile prevents two instances of the same service from running at
> >>> the
> >>>>>>> same time.
> >>>>>>>
> >>>>>>> The same effect can be achieved by the container orchestrator,
> >>> systemd,
> >>>>>>> etc too because it refuses to run the same service twice.
> >>>>>>
> >>>>>> Sure. I understand that.  My point was that these could be 2 different
> >>>>>> applications and they might not know which process to look for.
> >>>>>>
> >>>>>>>
> >>>>>>>> And what if this is
> >>>>>>>> a different application that tries to create a socket on a same path?
> >>>>>>>> e.g. QEMU creates a socket (started in a server mode) and user
> >>>>>>>> accidentally created dpdkvhostuser port in Open vSwitch instead of
> >>>>>>>> dpdkvhostuserclient.  This way rte_vhost library will try to bind
> >>>>>>>> to an existing socket file and will fail.  Subsequently port creation
> >>>>>>>> in OVS will fail.   We can't allow OVS to unlink files because this
> >>>>>>>> way OVS users will have ability to unlink random sockets that OVS has
> >>>>>>>> access to and we also has no idea if it's a QEMU that created a file
> >>>>>>>> or it was a virtio-user application or someone else.
> >>>>>>>
> >>>>>>> If rte_vhost unlinks the socket then the user will find that
> >>> networking
> >>>>>>> doesn't work. They can either hot unplug the QEMU vhost-user-net
> >>> device
> >>>>>>> or restart QEMU, depending on whether they need to keep the guest
> >>>>>>> running or not. This is a misconfiguration that is recoverable.
> >>>>>>
> >>>>>> True, it's recoverable, but with a high cost.  Restart of a VM is
> >>> rarely
> >>>>>> desirable.  And the application inside the guest might not feel itself
> >>>>>> well after hot re-plug of a device that it actively used.  I'd expect
> >>>>>> a DPDK application that runs inside a guest on some virtio-net device
> >>>>>> to crash after this kind of manipulations.  Especially, if it uses some
> >>>>>> older versions of DPDK.
> >>>>>
> >>>>> This unlink issue is probably something we think differently about.
> >>>>> There are many ways for users to misconfigure things when working with
> >>>>> system tools. If it's possible to catch misconfigurations that is
> >>>>> preferrable. In this case it's just the way pathname AF_UNIX domain
> >>>>> sockets work and IMO it's better not to have problems starting the
> >>>>> service due to stale files than to insist on preventing
> >>>>> misconfigurations. QEMU and DPDK do this differently and both seem to be
> >>>>> successful, so ¯\_(ツ)_/¯.
> >>>>>
> >>>>>>>
> >>>>>>> Regarding letting OVS unlink files, I agree that it shouldn't if this
> >>>>>>> create a security issue. I don't know the security model of OVS.
> >>>>>>
> >>>>>> In general privileges of a ovs-vswitchd daemon might be completely
> >>>>>> different from privileges required to invoke control utilities or
> >>>>>> to access the configuration database.  SO, yes, we should not allow
> >>>>>> that.
> >>>>>
> >>>>> That can be locked down by restricting the socket path to a file beneath
> >>>>> /var/run/ovs/vhost-user/.
> >>>>>
> >>>>>>>
> >>>>>>>> There are, probably, ways to detect if there is any alive process
> >>> that
> >>>>>>>> has this socket open, but that sounds like too much for this purpose,
> >>>>>>>> also I'm not sure if it's possible if actual user is in a different
> >>>>>>>> container.
> >>>>>>>> So I don't see a good reliable way to detect these conditions.  This
> >>>>>>>> falls on shoulders of a higher level management software or a user to
> >>>>>>>> clean these socket files up before adding ports.
> >>>>>>>
> >>>>>>> Does OVS always run in the same net namespace (pod) as the DPDK
> >>>>>>> application? If yes, then abstract AF_UNIX sockets can be used.
> >>> Abstract
> >>>>>>> AF_UNIX sockets don't have a filesystem path and the socket address
> >>>>>>> disappears when there is no process listening anymore.
> >>>>>>
> >>>>>> OVS is usually started right on the host in a main network namespace.
> >>>>>> In case it's started in a pod, it will run in a separate container but
> >>>>>> configured with a host network.  Applications almost exclusively runs
> >>>>>> in separate pods.
> >>>>>
> >>>>> Okay.
> >>>>>
> >>>>>>>>>> This patch-set aims to eliminate most of the inconveniences by
> >>>>>>>>>> leveraging an infrastructure service provided by a SocketPair
> >>> Broker.
> >>>>>>>>>
> >>>>>>>>> I don't understand yet why this is useful for vhost-user, where the
> >>>>>>>>> creation of the vhost-user device backend and its use by a VMM are
> >>>>>>>>> closely managed by one piece of software:
> >>>>>>>>>
> >>>>>>>>> 1. Unlink the socket path.
> >>>>>>>>> 2. Create, bind, and listen on the socket path.
> >>>>>>>>> 3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
> >>>>>>>>>    RPC, spawn a process, etc) and pass in the listen fd.
> >>>>>>>>> 4. In the meantime the VMM can open the socket path and call
> >>> connect(2).
> >>>>>>>>>    As soon as the vhost-user device backend calls accept(2) the
> >>>>>>>>>    connection will proceed (there is no need for sleeping).
> >>>>>>>>>
> >>>>>>>>> This approach works across containers without a broker.
> >>>>>>>>
> >>>>>>>> Not sure if I fully understood a question here, but anyway.
> >>>>>>>>
> >>>>>>>> This approach works fine if you know what application to run.
> >>>>>>>> In case of a k8s cluster, it might be a random DPDK application
> >>>>>>>> with virtio-user ports running inside a container and want to
> >>>>>>>> have a network connection.  Also, this application needs to run
> >>>>>>>> virtio-user in server mode, otherwise restart of the OVS will
> >>>>>>>> require restart of the application.  So, you basically need to
> >>>>>>>> rely on a third-party application to create a socket with a right
> >>>>>>>> name and in a correct location that is shared with a host, so
> >>>>>>>> OVS can find it and connect.
> >>>>>>>>
> >>>>>>>> In a VM world everything is much more simple, since you have
> >>>>>>>> a libvirt and QEMU that will take care of all of these stuff
> >>>>>>>> and which are also under full control of management software
> >>>>>>>> and a system administrator.
> >>>>>>>> In case of a container with a "random" DPDK application inside
> >>>>>>>> there is no such entity that can help.  Of course, some solution
> >>>>>>>> might be implemented in docker/podman daemon to create and manage
> >>>>>>>> outside-looking sockets for an application inside the container,
> >>>>>>>> but that is not available today AFAIK and I'm not sure if it
> >>>>>>>> ever will.
> >>>>>>>
> >>>>>>> Wait, when you say there is no entity like management software or a
> >>>>>>> system administrator, then how does OVS know to instantiate the new
> >>>>>>> port? I guess something still needs to invoke ovs-ctl add-port?
> >>>>>>
> >>>>>> I didn't mean that there is no any application that configures
> >>>>>> everything.  Of course, there is.  I mean that there is no such
> >>>>>> entity that abstracts all that socket machinery from the user's
> >>>>>> application that runs inside the container.  QEMU hides all the
> >>>>>> details of the connection to vhost backend and presents the device
> >>>>>> as a PCI device with a network interface wrapping from the guest
> >>>>>> kernel.  So, the application inside VM shouldn't care what actually
> >>>>>> there is a socket connected to OVS that implements backend and
> >>>>>> forward traffic somewhere.  For the application it's just a usual
> >>>>>> network interface.
> >>>>>> But in case of a container world, application should handle all
> >>>>>> that by creating a virtio-user device that will connect to some
> >>>>>> socket, that has an OVS on the other side.
> >>>>>>
> >>>>>>>
> >>>>>>> Can you describe the steps used today (without the broker) for
> >>>>>>> instantiating a new DPDK app container and connecting it to OVS?
> >>>>>>> Although my interest is in the vhost-user protocol I think it's
> >>>>>>> necessary to understand the OVS requirements here and I know little
> >>>>>>> about them.
> >>>>>>>> I might describe some things wrong since I worked with k8s and CNI
> >>>>>> plugins last time ~1.5 years ago, but the basic schema will look
> >>>>>> something like this:
> >>>>>>
> >>>>>> 1. user decides to start a new pod and requests k8s to do that
> >>>>>>    via cmdline tools or some API calls.
> >>>>>>
> >>>>>> 2. k8s scheduler looks for available resources asking resource
> >>>>>>    manager plugins, finds an appropriate physical host and asks
> >>>>>>    local to that node kubelet daemon to launch a new pod there.
> >>>>>>
> >>>>
> >>>> When the CNI is called, the pod has already been created, i.e: a PodID
> >>> exists
> >>>> and so does an associated network namespace. Therefore, everything that
> >>> has to
> >>>> do with the runtime spec such as mountpoints or devices cannot be
> >>> modified by
> >>>> this time.
> >>>>
> >>>> That's why the Device Plugin API is used to modify the Pod's spec before
> >>> the CNI
> >>>> chain is called.
> >>>>
> >>>>>> 3. kubelet asks local CNI plugin to allocate network resources
> >>>>>>    and annotate the pod with required mount points, devices that
> >>>>>>    needs to be passed in and environment variables.
> >>>>>>    (this is, IIRC, a gRPC connection.   It might be a multus-cni
> >>>>>>    or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
> >>>>>>    usually deployed as a system DaemonSet, so it runs in a
> >>>>>>    separate pod.
> >>>>>>
> >>>>>> 4. Assuming that vhost-user connection requested in server mode.
> >>>>>>    CNI plugin will:
> >>>>>>    4.1 create a directory for a vhost-user socket.
> >>>>>>    4.2 add this directory to pod annotations as a mount point.
> >>>>
> >>>> I believe this is not possible, it would have to inspect the pod's spec
> >>> or
> >>>> otherwise determine an existing mount point where the socket should be
> >>> created.
> >>>
> >>> Uff.  Yes, you're right.  Thanks for your clarification.
> >>> I mixed up CNI and Device Plugin here.
> >>>
> >>> CNI itself is not able to annotate new resources to the pod, i.e.
> >>> create new mounts or something like this.   And I don't recall any
> >>> vhost-user device plugins.  Is there any?  There is an SR-IOV device
> >>> plugin, but its purpose is to allocate and pass PCI devices, not create
> >>> mounts for vhost-user.
> >>>
> >>> So, IIUC, right now user must create the directory and specify
> >>> a mount point in a pod spec file or pass the whole /var/run/openvswitch
> >>> or something like this, right?
> >>>
> >>> Looking at userspace-cni-network-plugin, it actually just parses
> >>> annotations to find the shared directory and fails if there is
> >>> no any:
> >>>
> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122
> >>>
> >>> And examples suggests to specify a directory to mount:
> >>>
> >>> https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41
> >>>
> >>> Looks like this is done by user's hands.
> >>>
> >>> Yes, I am one of the primary authors of Userspace CNI. Currently, the
> >> directory is by hand. Long term thought was to have a mutating
> >> webhook/admission controller inject a directory into the podspec.  Not sure
> >> if it has changed, but I think when I was originally doing this work, OvS
> >> only lets you choose the directory at install time, so it has to be
> >> something like /var/run/openvswitch/. You can choose the socketfile name
> >> and maybe a subdirectory off the main directory, but not the full path.
> >>
> >> One of the issues I was trying to solve was making sure ContainerA couldn't
> >> see ContainerB's socketfiles. That's where the admission controller could
> >> create a unique subdirectory for each container under
> >> /var/run/openvswitch/. But this was more of a PoC CNI and other work items
> >> always took precedence so that work never completed.
> > 
> > If the CNI plugin has access to the container's network namespace, could
> > it create an abstract AF_UNIX listen socket?
> > 
> > That way the application inside the container could connect to an
> > AF_UNIX socket and there is no need to manage container volumes.
> > 
> > I'm not familiar with the Open VSwitch, so I'm not sure if there is a
> > sane way of passing the listen socket fd into ovswitchd from the CNI
> > plugin?
> > 
> > The steps:
> > 1. CNI plugin enters container's network namespace and opens an abstract
> >    AF_UNIX listen socket.
> > 2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
> >    add-port step. Instead of using type=dpdkvhostuserclient
> >    options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
> >    dpdkvhostuser server with the listen fd.
> 
> For this step you will need a side channel, i.e. a separate unix socket
> created by ovs-vswitchd (most likely, created by rte_vhost on
> rte_vhost_driver_register() call).
> 
> The problem is that ovs-vsctl talks with ovsdb-server and adds the new
> port -- just a new row in the 'interface' table of the database.
> ovs-vswitchd receives update from the database and creates the actual
> port.  All the communications done through JSONRPC, so passing fds is
> not an option.
> 
> > 3. When the container starts, it connects to the abstract AF_UNIX
> >    socket. The abstract socket name is provided to the container at
> >    startup time in an environment variable. The name is unique, at least
> >    to the pod, so that multiple containers in the pod can run vhost-user
> >    applications.
> 
> Few more problems with this solution:
> 
> - We still want to run application inside the container in a server mode,
>   because virtio-user PMD in client mode doesn't support re-connection.
> 
> - How to get this fd again after the OVS restart?  CNI will not be invoked
>   at this point to pass a new fd.
> 
> - If application will close the connection for any reason (restart, some
>   reconfiguration internal to the application) and OVS will be re-started
>   at the same time, abstract socket will be gone.  Need a persistent daemon
>   to hold it.

Okay, if there is no component that has a lifetime suitable for holding
the abstract listen socket, then using pathname AF_UNIX sockets seems
like a better approach.

Stefan
Maxime Coquelin March 24, 2021, 8:56 p.m. UTC | #22
Hi Ilya,

On 3/19/21 5:45 PM, Ilya Maximets wrote:
> On 3/19/21 5:11 PM, Ilya Maximets wrote:
>> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>>> Hi Ilya,
>>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>>> is restarted then existing vhost-user connections drop with an error but
>>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>>> OVS instance will set up.
>>>
>>> Why is it impossible to reconnect when OVS owns the listen socket?
>>
>> Well, AFAIK, qemu reconnects client connections only:
>>
>>     ``reconnect`` sets the timeout for reconnecting on non-server
>>     sockets when the remote end goes away. qemu will delay this many
>>     seconds and then attempt to reconnect. Zero disables reconnecting,
>>     and is the default.
>>
>> I'm not sure about exact reason.  It was historically this way.
>> For me it doesn't make much sense.  I mean, your right that it's
>> just a socket, so it should not matter who listens and who connects.
>> If reconnection is possible in one direction, it should be possible
>> in the opposite direction too.
> 
> Sorry, my thought slipped. :)  Yes, QEMU supports re-connection
> for client sockets.  So, in theory, dpdkvhostuser ports should work
> after re-connection.  And that would be nice.  I don't remember
> right now why this doesn't work...  Maybe vhost-user parts in QEMU
> doesn't handle this case.  Need to dig some more into that and refresh
> my memory.  It was so long ago...
> 
> Maxime, do you remember?

Sorry for the delay. I didn't remember, so I wanted to have a try.

I can confirm reconnect works with QEMU as client and with Vhost PMD as
server with:


    <interface type='vhostuser'>
      <mac address='56:48:4f:53:54:01'/>
      <source type='unix' path='/tmp/vhost-user1' mode='client'>
        <reconnect enabled='yes' timeout='1'/>
      </source>
      <model type='virtio'/>
      <driver name='vhost' rx_queue_size='256'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00'
function='0x0'/>
    </interface>



> 
>>
>> dpdkvhostuser was deprecated just to scare users and force them to
>> migrate to dpdkvhostuserclient and avoid constant bug reports like:
>>
>>   "OVS service restarted and network is lost now".
>>
>> BTW, virtio-user ports in DPDK doesn't support re-connection in client
>> mode too.
> 
> This is still true, though.  virtio-user in client mode doesn't reconnect.

That could be added, and it is maybe not as important for containers as
it is for VM to support it, given the ephemeral nature of containers?

Regards,
Maxime

> 
>>
>> BTW2, with SocketPair Broker it might be cheaper to implement server
>> reconnection in QEMU because all connections in these case are client
>> connections, i.e. both ends will connect() to a broker.
>>
>> Bets regards, Ilya Maximets.
>>
>
Ilya Maximets March 24, 2021, 9:39 p.m. UTC | #23
On 3/24/21 9:56 PM, Maxime Coquelin wrote:
> Hi Ilya,
> 
> On 3/19/21 5:45 PM, Ilya Maximets wrote:
>> On 3/19/21 5:11 PM, Ilya Maximets wrote:
>>> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>>>> Hi Ilya,
>>>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>>>> is restarted then existing vhost-user connections drop with an error but
>>>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>>>> OVS instance will set up.
>>>>
>>>> Why is it impossible to reconnect when OVS owns the listen socket?
>>>
>>> Well, AFAIK, qemu reconnects client connections only:
>>>
>>>     ``reconnect`` sets the timeout for reconnecting on non-server
>>>     sockets when the remote end goes away. qemu will delay this many
>>>     seconds and then attempt to reconnect. Zero disables reconnecting,
>>>     and is the default.
>>>
>>> I'm not sure about exact reason.  It was historically this way.
>>> For me it doesn't make much sense.  I mean, your right that it's
>>> just a socket, so it should not matter who listens and who connects.
>>> If reconnection is possible in one direction, it should be possible
>>> in the opposite direction too.
>>
>> Sorry, my thought slipped. :)  Yes, QEMU supports re-connection
>> for client sockets.  So, in theory, dpdkvhostuser ports should work
>> after re-connection.  And that would be nice.  I don't remember
>> right now why this doesn't work...  Maybe vhost-user parts in QEMU
>> doesn't handle this case.  Need to dig some more into that and refresh
>> my memory.  It was so long ago...
>>
>> Maxime, do you remember?
> 
> Sorry for the delay. I didn't remember, so I wanted to have a try.
> 
> I can confirm reconnect works with QEMU as client and with Vhost PMD as
> server with:
> 
> 
>     <interface type='vhostuser'>
>       <mac address='56:48:4f:53:54:01'/>
>       <source type='unix' path='/tmp/vhost-user1' mode='client'>
>         <reconnect enabled='yes' timeout='1'/>
>       </source>
>       <model type='virtio'/>
>       <driver name='vhost' rx_queue_size='256'/>
>       <address type='pci' domain='0x0000' bus='0x07' slot='0x00'
> function='0x0'/>
>     </interface>

Cool, thanks for checking. :)
If it works with vhost PMD, it should probably work with OVS too.

There is still a couple of problems:

1. OpenStack Nova doesn't support configuration of a 'reconnect'
   in libvirt xml (it only adds the 'source'):
   https://github.com/openstack/nova/blob/master/nova/virt/libvirt/config.py#L1834

2. 'reconnect' configuration supported only starting from libvirt 4.1.0.
   It's released in 2018, but still some systems are using older versions.
   e.g. Ubuntu 18.04 which will be supported until 2023 uses libvirt 4.0.0.

> 
>>
>>>
>>> dpdkvhostuser was deprecated just to scare users and force them to
>>> migrate to dpdkvhostuserclient and avoid constant bug reports like:
>>>
>>>   "OVS service restarted and network is lost now".
>>>
>>> BTW, virtio-user ports in DPDK doesn't support re-connection in client
>>> mode too.
>>
>> This is still true, though.  virtio-user in client mode doesn't reconnect.
> 
> That could be added, and it is maybe not as important for containers as
> it is for VM to support it, given the ephemeral nature of containers?

Well, restart of OVS should not require restarting of all the containers
on the host even though they are "stateless".

BTW, some infrastructure changes that I made in this series might be
reused to implement client-side reconnection for virtio-user.

> 
> Regards,
> Maxime
> 
>>
>>>
>>> BTW2, with SocketPair Broker it might be cheaper to implement server
>>> reconnection in QEMU because all connections in these case are client
>>> connections, i.e. both ends will connect() to a broker.
>>>
>>> Bets regards, Ilya Maximets.
>>>
>>
Maxime Coquelin March 24, 2021, 9:51 p.m. UTC | #24
On 3/24/21 10:39 PM, Ilya Maximets wrote:
> On 3/24/21 9:56 PM, Maxime Coquelin wrote:
>> Hi Ilya,
>>
>> On 3/19/21 5:45 PM, Ilya Maximets wrote:
>>> On 3/19/21 5:11 PM, Ilya Maximets wrote:
>>>> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>>>>> Hi Ilya,
>>>>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>>>>> is restarted then existing vhost-user connections drop with an error but
>>>>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>>>>> OVS instance will set up.
>>>>>
>>>>> Why is it impossible to reconnect when OVS owns the listen socket?
>>>>
>>>> Well, AFAIK, qemu reconnects client connections only:
>>>>
>>>>     ``reconnect`` sets the timeout for reconnecting on non-server
>>>>     sockets when the remote end goes away. qemu will delay this many
>>>>     seconds and then attempt to reconnect. Zero disables reconnecting,
>>>>     and is the default.
>>>>
>>>> I'm not sure about exact reason.  It was historically this way.
>>>> For me it doesn't make much sense.  I mean, your right that it's
>>>> just a socket, so it should not matter who listens and who connects.
>>>> If reconnection is possible in one direction, it should be possible
>>>> in the opposite direction too.
>>>
>>> Sorry, my thought slipped. :)  Yes, QEMU supports re-connection
>>> for client sockets.  So, in theory, dpdkvhostuser ports should work
>>> after re-connection.  And that would be nice.  I don't remember
>>> right now why this doesn't work...  Maybe vhost-user parts in QEMU
>>> doesn't handle this case.  Need to dig some more into that and refresh
>>> my memory.  It was so long ago...
>>>
>>> Maxime, do you remember?
>>
>> Sorry for the delay. I didn't remember, so I wanted to have a try.
>>
>> I can confirm reconnect works with QEMU as client and with Vhost PMD as
>> server with:
>>
>>
>>     <interface type='vhostuser'>
>>       <mac address='56:48:4f:53:54:01'/>
>>       <source type='unix' path='/tmp/vhost-user1' mode='client'>
>>         <reconnect enabled='yes' timeout='1'/>
>>       </source>
>>       <model type='virtio'/>
>>       <driver name='vhost' rx_queue_size='256'/>
>>       <address type='pci' domain='0x0000' bus='0x07' slot='0x00'
>> function='0x0'/>
>>     </interface>
> 
> Cool, thanks for checking. :)
> If it works with vhost PMD, it should probably work with OVS too.
> 
> There is still a couple of problems:
> 
> 1. OpenStack Nova doesn't support configuration of a 'reconnect'
>    in libvirt xml (it only adds the 'source'):
>    https://github.com/openstack/nova/blob/master/nova/virt/libvirt/config.py#L1834
> 
> 2. 'reconnect' configuration supported only starting from libvirt 4.1.0.
>    It's released in 2018, but still some systems are using older versions.
>    e.g. Ubuntu 18.04 which will be supported until 2023 uses libvirt 4.0.0.

Is it really a problem? We can keep using OVS as client for OSP, and use
OVS as server for containers.

>>
>>>
>>>>
>>>> dpdkvhostuser was deprecated just to scare users and force them to
>>>> migrate to dpdkvhostuserclient and avoid constant bug reports like:
>>>>
>>>>   "OVS service restarted and network is lost now".
>>>>
>>>> BTW, virtio-user ports in DPDK doesn't support re-connection in client
>>>> mode too.
>>>
>>> This is still true, though.  virtio-user in client mode doesn't reconnect.
>>
>> That could be added, and it is maybe not as important for containers as
>> it is for VM to support it, given the ephemeral nature of containers?
> 
> Well, restart of OVS should not require restarting of all the containers
> on the host even though they are "stateless".
> 
> BTW, some infrastructure changes that I made in this series might be
> reused to implement client-side reconnection for virtio-user.

Great, I'll look at the series when we'll work on adding reconnect for
clients.

Thanks,
Maxime

>>
>> Regards,
>> Maxime
>>
>>>
>>>>
>>>> BTW2, with SocketPair Broker it might be cheaper to implement server
>>>> reconnection in QEMU because all connections in these case are client
>>>> connections, i.e. both ends will connect() to a broker.
>>>>
>>>> Bets regards, Ilya Maximets.
>>>>
>>>
>
Ilya Maximets March 24, 2021, 10:17 p.m. UTC | #25
On 3/24/21 10:51 PM, Maxime Coquelin wrote:
> 
> 
> On 3/24/21 10:39 PM, Ilya Maximets wrote:
>> On 3/24/21 9:56 PM, Maxime Coquelin wrote:
>>> Hi Ilya,
>>>
>>> On 3/19/21 5:45 PM, Ilya Maximets wrote:
>>>> On 3/19/21 5:11 PM, Ilya Maximets wrote:
>>>>> On 3/19/21 3:39 PM, Stefan Hajnoczi wrote:
>>>>>> Hi Ilya,
>>>>>> By the way, it's not clear to me why dpdkvhostuser is deprecated. If OVS
>>>>>> is restarted then existing vhost-user connections drop with an error but
>>>>>> QEMU could attempt to reconnect to the UNIX domain socket which the new
>>>>>> OVS instance will set up.
>>>>>>
>>>>>> Why is it impossible to reconnect when OVS owns the listen socket?
>>>>>
>>>>> Well, AFAIK, qemu reconnects client connections only:
>>>>>
>>>>>     ``reconnect`` sets the timeout for reconnecting on non-server
>>>>>     sockets when the remote end goes away. qemu will delay this many
>>>>>     seconds and then attempt to reconnect. Zero disables reconnecting,
>>>>>     and is the default.
>>>>>
>>>>> I'm not sure about exact reason.  It was historically this way.
>>>>> For me it doesn't make much sense.  I mean, your right that it's
>>>>> just a socket, so it should not matter who listens and who connects.
>>>>> If reconnection is possible in one direction, it should be possible
>>>>> in the opposite direction too.
>>>>
>>>> Sorry, my thought slipped. :)  Yes, QEMU supports re-connection
>>>> for client sockets.  So, in theory, dpdkvhostuser ports should work
>>>> after re-connection.  And that would be nice.  I don't remember
>>>> right now why this doesn't work...  Maybe vhost-user parts in QEMU
>>>> doesn't handle this case.  Need to dig some more into that and refresh
>>>> my memory.  It was so long ago...
>>>>
>>>> Maxime, do you remember?
>>>
>>> Sorry for the delay. I didn't remember, so I wanted to have a try.
>>>
>>> I can confirm reconnect works with QEMU as client and with Vhost PMD as
>>> server with:
>>>
>>>
>>>     <interface type='vhostuser'>
>>>       <mac address='56:48:4f:53:54:01'/>
>>>       <source type='unix' path='/tmp/vhost-user1' mode='client'>
>>>         <reconnect enabled='yes' timeout='1'/>
>>>       </source>
>>>       <model type='virtio'/>
>>>       <driver name='vhost' rx_queue_size='256'/>
>>>       <address type='pci' domain='0x0000' bus='0x07' slot='0x00'
>>> function='0x0'/>
>>>     </interface>
>>
>> Cool, thanks for checking. :)
>> If it works with vhost PMD, it should probably work with OVS too.
>>
>> There is still a couple of problems:
>>
>> 1. OpenStack Nova doesn't support configuration of a 'reconnect'
>>    in libvirt xml (it only adds the 'source'):
>>    https://github.com/openstack/nova/blob/master/nova/virt/libvirt/config.py#L1834
>>
>> 2. 'reconnect' configuration supported only starting from libvirt 4.1.0.
>>    It's released in 2018, but still some systems are using older versions.
>>    e.g. Ubuntu 18.04 which will be supported until 2023 uses libvirt 4.0.0.
> 
> Is it really a problem? We can keep using OVS as client for OSP,

Sure, no problem here.  I think, we scared enough users so they
mostly switched to this configuration and it's a good thing.

> and use OVS as server for containers.

There is a problem here that OVS emits a deprecation warning each
time you're trying to use dpdkvhostuser ports.  There are no problems
functionally, though, except that deprecation also implies that
we're consciously not adding new features to these ports.

Another thing is that current state of the art of k8s CNI requires
mounting of the whole /var/run/openvswitch or managing directories
by hands.  So, IMO, Broker idea is still valid is some form in terms
of hassle reduction for users.

> 
>>>
>>>>
>>>>>
>>>>> dpdkvhostuser was deprecated just to scare users and force them to
>>>>> migrate to dpdkvhostuserclient and avoid constant bug reports like:
>>>>>
>>>>>   "OVS service restarted and network is lost now".
>>>>>
>>>>> BTW, virtio-user ports in DPDK doesn't support re-connection in client
>>>>> mode too.
>>>>
>>>> This is still true, though.  virtio-user in client mode doesn't reconnect.
>>>
>>> That could be added, and it is maybe not as important for containers as
>>> it is for VM to support it, given the ephemeral nature of containers?
>>
>> Well, restart of OVS should not require restarting of all the containers
>> on the host even though they are "stateless".
>>
>> BTW, some infrastructure changes that I made in this series might be
>> reused to implement client-side reconnection for virtio-user.
> 
> Great, I'll look at the series when we'll work on adding reconnect for
> clients.

Please, take a look at the patch #1 in this set sooner.
It's a bug fix. :)

> 
> Thanks,
> Maxime
> 
>>>
>>> Regards,
>>> Maxime
>>>
>>>>
>>>>>
>>>>> BTW2, with SocketPair Broker it might be cheaper to implement server
>>>>> reconnection in QEMU because all connections in these case are client
>>>>> connections, i.e. both ends will connect() to a broker.
>>>>>
>>>>> Bets regards, Ilya Maximets.
>>>>>
>>>>
>>
>
Stefan Hajnoczi March 25, 2021, 9:35 a.m. UTC | #26
On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> - How to get this fd again after the OVS restart?  CNI will not be invoked
>   at this point to pass a new fd.
> 
> - If application will close the connection for any reason (restart, some
>   reconfiguration internal to the application) and OVS will be re-started
>   at the same time, abstract socket will be gone.  Need a persistent daemon
>   to hold it.

I remembered that these two points can be solved by sd_notify(3)
FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
this is the case (at least in the CNI use case)?

https://www.freedesktop.org/software/systemd/man/sd_notify.html

Stefan
Ilya Maximets March 25, 2021, 11 a.m. UTC | #27
On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
> On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
>> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
>>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
>>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
>>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>> - How to get this fd again after the OVS restart?  CNI will not be invoked
>>   at this point to pass a new fd.
>>
>> - If application will close the connection for any reason (restart, some
>>   reconfiguration internal to the application) and OVS will be re-started
>>   at the same time, abstract socket will be gone.  Need a persistent daemon
>>   to hold it.
> 
> I remembered that these two points can be solved by sd_notify(3)
> FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
> this is the case (at least in the CNI use case)?
> 
> https://www.freedesktop.org/software/systemd/man/sd_notify.html

IIUC, these file descriptors only passed on the restart of the service,
so port-del + port-add scenario is not covered (and this is a very
common usecase, users are implementing some configuration changes this
way and also this is internally possible scenario, e.g. this sequence
will be triggered internally to change the OpenFlow port number).
port-del will release all the resources including the listening socket.
Keeping the fd for later use is not an option, because OVS will not know
if this port will be added back or not and fds is a limited resource.

It's also unclear how to map these file descriptors to particular ports
they belong to after restart.

OVS could run as a system pod or as a systemd service.  It differs from
one setup to another.  So it might not be controlled by systemd.

Also, it behaves as an old-style daemon, so it closes all the file
descriptors, forkes and so on.  This might be adjusted, though, with
some rework of the deamonization procedure.

On the side note, it maybe interesting to allow user application to
create a socket and pass a pollable file descriptor directly to
rte_vhost_driver_register() instead of a socket path.  This way
the user application may choose to use an abstract socket or a file
socket or any other future type of socket connections.  This will
also allow user application to store these sockets somewhere, or
receive them from systemd/init/other management software.

Best regards, Ilya Maximets.
Stefan Hajnoczi March 25, 2021, 4:43 p.m. UTC | #28
On Thu, Mar 25, 2021 at 12:00:11PM +0100, Ilya Maximets wrote:
> On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
> > On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> >> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> >>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >> - How to get this fd again after the OVS restart?  CNI will not be invoked
> >>   at this point to pass a new fd.
> >>
> >> - If application will close the connection for any reason (restart, some
> >>   reconfiguration internal to the application) and OVS will be re-started
> >>   at the same time, abstract socket will be gone.  Need a persistent daemon
> >>   to hold it.
> > 
> > I remembered that these two points can be solved by sd_notify(3)
> > FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
> > this is the case (at least in the CNI use case)?
> > 
> > https://www.freedesktop.org/software/systemd/man/sd_notify.html
> 
> IIUC, these file descriptors only passed on the restart of the service,
> so port-del + port-add scenario is not covered (and this is a very
> common usecase, users are implementing some configuration changes this
> way and also this is internally possible scenario, e.g. this sequence
> will be triggered internally to change the OpenFlow port number).
> port-del will release all the resources including the listening socket.
> Keeping the fd for later use is not an option, because OVS will not know
> if this port will be added back or not and fds is a limited resource.

If users of the CNI plugin are reasonably expected to do this then it
sounds like a blocker for the sd_notify(3) approach. Maybe it could be
fixed by introducing an atomic port-rename (?) operation, but this is
starting to sound too invasive.

> It's also unclear how to map these file descriptors to particular ports
> they belong to after restart.

The first fd would be a memfd containing a description of the remaining
fds plus any other crash recovery state that OVS wants.

> OVS could run as a system pod or as a systemd service.  It differs from
> one setup to another.  So it might not be controlled by systemd.

Does the CNI plugin allow both configurations?

It's impossible to come up with one approach that works for everyone in
the general case (beyond the CNI plugin, beyond Kubernetes). I think we
need to enumerate use cases and decide which ones are currently not
addressed satisfactorily.

> Also, it behaves as an old-style daemon, so it closes all the file
> descriptors, forkes and so on.  This might be adjusted, though, with
> some rework of the deamonization procedure.

Doesn't sound like fun but may be doable.

> On the side note, it maybe interesting to allow user application to
> create a socket and pass a pollable file descriptor directly to
> rte_vhost_driver_register() instead of a socket path.  This way
> the user application may choose to use an abstract socket or a file
> socket or any other future type of socket connections.  This will
> also allow user application to store these sockets somewhere, or
> receive them from systemd/init/other management software.

Yes, sounds useful.

Stefan
Ilya Maximets March 25, 2021, 5:58 p.m. UTC | #29
On 3/25/21 5:43 PM, Stefan Hajnoczi wrote:
> On Thu, Mar 25, 2021 at 12:00:11PM +0100, Ilya Maximets wrote:
>> On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
>>> On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
>>>> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
>>>>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
>>>>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
>>>>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
>>>>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
>>>>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
>>>>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
>>>>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
>>>> - How to get this fd again after the OVS restart?  CNI will not be invoked
>>>>   at this point to pass a new fd.
>>>>
>>>> - If application will close the connection for any reason (restart, some
>>>>   reconfiguration internal to the application) and OVS will be re-started
>>>>   at the same time, abstract socket will be gone.  Need a persistent daemon
>>>>   to hold it.
>>>
>>> I remembered that these two points can be solved by sd_notify(3)
>>> FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
>>> this is the case (at least in the CNI use case)?
>>>
>>> https://www.freedesktop.org/software/systemd/man/sd_notify.html
>>
>> IIUC, these file descriptors only passed on the restart of the service,
>> so port-del + port-add scenario is not covered (and this is a very
>> common usecase, users are implementing some configuration changes this
>> way and also this is internally possible scenario, e.g. this sequence
>> will be triggered internally to change the OpenFlow port number).
>> port-del will release all the resources including the listening socket.
>> Keeping the fd for later use is not an option, because OVS will not know
>> if this port will be added back or not and fds is a limited resource.
> 
> If users of the CNI plugin are reasonably expected to do this then it
> sounds like a blocker for the sd_notify(3) approach. Maybe it could be
> fixed by introducing an atomic port-rename (?) operation, but this is
> starting to sound too invasive.

It's hard to implement, actually.  Things like 'port-rename' will
be internally implemented as del+add in most cases.  Otherwise, it
will require a significant rework of OVS internals.
There are things that could be adjusted on the fly, but some
fundamental parts like OF port number that every other part depends
on are not easy to change.

> 
>> It's also unclear how to map these file descriptors to particular ports
>> they belong to after restart.
> 
> The first fd would be a memfd containing a description of the remaining
> fds plus any other crash recovery state that OVS wants.

Yeah, I saw that it's possible to assign names to fds, so from this
perspective it's not a big problem.

> 
>> OVS could run as a system pod or as a systemd service.  It differs from
>> one setup to another.  So it might not be controlled by systemd.
> 
> Does the CNI plugin allow both configurations?

CNI runs as a DaemonSet (pod on each node) by itself, and it doesn't
matter if OVS is running on the host or in a different pod.  They have
a part of a filesystem to share (/var/run/openvswitch/ and some other).
For example, OVN-K8s CNI provides an OVS DaemonSet:
  https://github.com/ovn-org/ovn-kubernetes/blob/master/dist/templates/ovs-node.yaml.j2
Users can use it, but it's not required and indifferent from the CNI
point of view.

Everything is a pod in k8s, but you can run some parts on the host if
you wish.

In general, CNI plugin only needs a network connection to the ovsdb-server
process.  In reality, most of CNI plugins are connecting via control
socket in /var/run/openvswitch.

> 
> It's impossible to come up with one approach that works for everyone in
> the general case (beyond the CNI plugin, beyond Kubernetes).

If we're looking for a solution to store abstract sockets somehow
for OVS then it's hard to came up with something generic.  It will
have dependency on specific init system anyway.

OTOH, Broker solution will work for all cases. :)  One may think
of a broker as a service that supplies abstract sockets for processes
from different namespaces.  These sockets are already connected, for
convenience.

> I think we
> need to enumerate use cases and decide which ones are currently not
> addressed satisfactorily.
> 
>> Also, it behaves as an old-style daemon, so it closes all the file
>> descriptors, forkes and so on.  This might be adjusted, though, with
>> some rework of the deamonization procedure.
> 
> Doesn't sound like fun but may be doable.

It really doesn't sound like fun, so I'd like to not do that unless
we have a solid usecase.

> 
>> On the side note, it maybe interesting to allow user application to
>> create a socket and pass a pollable file descriptor directly to
>> rte_vhost_driver_register() instead of a socket path.  This way
>> the user application may choose to use an abstract socket or a file
>> socket or any other future type of socket connections.  This will
>> also allow user application to store these sockets somewhere, or
>> receive them from systemd/init/other management software.
> 
> Yes, sounds useful.
> 
> Stefan
>
Stefan Hajnoczi March 30, 2021, 3:01 p.m. UTC | #30
On Thu, Mar 25, 2021 at 06:58:56PM +0100, Ilya Maximets wrote:
> On 3/25/21 5:43 PM, Stefan Hajnoczi wrote:
> > On Thu, Mar 25, 2021 at 12:00:11PM +0100, Ilya Maximets wrote:
> >> On 3/25/21 10:35 AM, Stefan Hajnoczi wrote:
> >>> On Wed, Mar 24, 2021 at 02:11:31PM +0100, Ilya Maximets wrote:
> >>>> On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:
> >>>>> On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:
> >>>>>> On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>>> On 3/23/21 6:57 PM, Adrian Moreno wrote:
> >>>>>>>> On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:
> >>>>>>>>> On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:
> >>>>>>>>>> On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:
> >>>>>>>>>>> On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:
> >>>>>>>>>>>> On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:
> >>>>>>>>>>>>> On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:
> >>>> - How to get this fd again after the OVS restart?  CNI will not be invoked
> >>>>   at this point to pass a new fd.
> >>>>
> >>>> - If application will close the connection for any reason (restart, some
> >>>>   reconfiguration internal to the application) and OVS will be re-started
> >>>>   at the same time, abstract socket will be gone.  Need a persistent daemon
> >>>>   to hold it.
> >>>
> >>> I remembered that these two points can be solved by sd_notify(3)
> >>> FDSTORE=1. This requires that OVS runs as a systemd service. Not sure if
> >>> this is the case (at least in the CNI use case)?
> >>>
> >>> https://www.freedesktop.org/software/systemd/man/sd_notify.html
> >>
> >> IIUC, these file descriptors only passed on the restart of the service,
> >> so port-del + port-add scenario is not covered (and this is a very
> >> common usecase, users are implementing some configuration changes this
> >> way and also this is internally possible scenario, e.g. this sequence
> >> will be triggered internally to change the OpenFlow port number).
> >> port-del will release all the resources including the listening socket.
> >> Keeping the fd for later use is not an option, because OVS will not know
> >> if this port will be added back or not and fds is a limited resource.
> > 
> > If users of the CNI plugin are reasonably expected to do this then it
> > sounds like a blocker for the sd_notify(3) approach. Maybe it could be
> > fixed by introducing an atomic port-rename (?) operation, but this is
> > starting to sound too invasive.
> 
> It's hard to implement, actually.  Things like 'port-rename' will
> be internally implemented as del+add in most cases.  Otherwise, it
> will require a significant rework of OVS internals.
> There are things that could be adjusted on the fly, but some
> fundamental parts like OF port number that every other part depends
> on are not easy to change.

I see. In that case the sd_notify(3) approach won't work.

> >> OVS could run as a system pod or as a systemd service.  It differs from
> >> one setup to another.  So it might not be controlled by systemd.
> > 
> > Does the CNI plugin allow both configurations?
> 
> CNI runs as a DaemonSet (pod on each node) by itself, and it doesn't
> matter if OVS is running on the host or in a different pod.

Okay.

> > 
> > It's impossible to come up with one approach that works for everyone in
> > the general case (beyond the CNI plugin, beyond Kubernetes).
> 
> If we're looking for a solution to store abstract sockets somehow
> for OVS then it's hard to came up with something generic.  It will
> have dependency on specific init system anyway.
> 
> OTOH, Broker solution will work for all cases. :)  One may think
> of a broker as a service that supplies abstract sockets for processes
> from different namespaces.  These sockets are already connected, for
> convenience.

I'm not sure what we're trying to come up with :). I haven't figured out
how much of what has been discussed is cosmetic and nice-to-have stuff
versus what is a real problem that needs a solution.

From the vhost-user point of view I would prefer to stick to the
existing UNIX domain socket approach. Any additional mechanism adds
extra complexity, won't be supported by all software, requires educating
users and developers, requires building new vhost-user application
container images, etc. IMO it's only worth doing if there is a real
problem with UNIX domain sockets that cannot be solved without
introducing a new connection mechanism.

Stefan